Git Analyze

Ever found an old copy of a repository and didn't know the purpose or state of if? Was it just a test? Are there modifications that were not pushed anywhere? When was it cloned in the first place? And when was is used the last time?

I didn't found any methods of Git itself or third-party tools to query this information (okay, I didn't search very thoroughly). This little script demonstrates what can be found when digging the .git directory. The script is on GitHub.

Identify a GitHub repository

A GitHub repository carries its meta-information in a .git directory located in its base directory. If the current working directory is a subdirectory, the path must be followed towards the root directory.

DIR=''
ORIG_DIR=$PWD
if [[ -d .git ]]; then
    DIR=$ORIG_DIR
else
    while [[ $PWD != / ]]; do
        cd ..
        if [[ -d .git ]]; then
            DIR=$PWD
            break
        fi
    done
fi
if [[-z $DIR ]]; then
    echo "ERROR: no .git directory found in path '$ORIG_DIR'"
    exit
fi

These lines check if there is a .git directory in the current working directory and if it can not be found, it steps upwards until either the root directory is reached or a .git directory is found. If the loop is left by the while condition, the variable DIR is still empty and an error message is printed.

This code changes to the base directory of the Git repository. Since it is a script with its own scope, it does not need to store the original directory to restore it later. The calling environment is not changed.

Age of the repo

The age of the repo can be derived from the oldest file in the .git directory. This is probably not the most stable algorithm. Later, the logs directory is explained that holds a better source for this information.

# try to derive the age (date of init or clone) from .git files
# (use oldest file in .git directory)
echo -n "init'ed or clone'd most probably on: "
stat -c '%Y %y %n' $DIR/.git/* | sort | head -n1 | awk '{print  $2 " " $3  " (" $5 ")" }'

The stat utility is given a format string to print the modification date as seconds since the UNIX epoch, in a readable format and the file name. This list is sorted, the first line is extracted and the columns 2 and 3 (date, time) and 5 (filename) are printed.

Simple information by git tools

A basic information is the .git/description that can be set for every repository. It appears not to be used by the git tools, but might be read by other tools (GitWeb) or hooks. If the file exists and does not contain the default ("Unnamed repository..."), its content is printed.

The remote links are printed by git remote -v, but the lines are annotated with (fetch) and (push). To just see the links, the second column is cut out and sorted and unified:

echo -n "Remote links: "
git remote -v | cut  -f2 | cut -d' ' -f1 | sort | uniq

SVN connection

The command git svn info should reveal any ties to a Subversion repository. If the command gives an error, the output is suppressed:

SVNINFO=$(git svn info 2>&1)
if [[ ! "$SVNINFO" =~ ^Unable ]]; then
    echo -n "Git-SVN info: " $SVNINFO
fi

Last commit

The commit log can be flexibly formated with git log. The script uses:

echo -n "Last commit: "
git --no-pager log --all -n1 --format="${COLGITHASH}%h ${COLGITDATE}%ci${COLGITRESET}%d ${COLGITSUBJECT}%s${COLGITRESET}"

The variables $COLGIT* contain the git log format %C(...) as documented in its man-pages. The can be set empty to suppress color (see the final script linked above for the whole picture).

Git logs

The directory .git/logs holds files with a history log for various objects. The HEAD contains the initialization or cloning, any pushes, fetches and pulls and commits and checkouts.

The first line gives the information when the repo was creates by initialization or cloning. In the latter case, the clone source is given. Depending on verbosity, the script prints either the first and last entry or the full history. The tokenization of the lines is a bit tricky because after two hashes, the name can be one or multiple words. The E-Mail address is enclosed in angle brackets. Then follows a UNIX time stamp (seconds since epoch), the time zone and a description of the action.

function tokenize_log()
{
    read REV0 REV1 REST < <(echo "$@")           # prev. and current revision sha1, remainder
    echo_v3 "REV0   = '$REV0'"                   # echo_v3 prints only on verbosity >= 3
    echo_v3 "REV1   = '$REV1'"
    NAME=${REST%% <*}                            # Name is up to first angle bracket
    echo_v3 "NAME   = '$NAME'"
    REST=${REST##* <}                            # Remainder is after first bracket
    MAIL=${REST%%>*}                             # Mail is up to closing angle bracket
    echo_v3 "MAIL   = '$MAIL'"
    REST=${REST##*> }                            # Remainder is after angle bracket
    read TIME ZONE ACTION < <(echo $REST )       # Time, Zone, Action (multiple words)
    echo_v3 "TIME   = '$TIME'"
    echo_v3 "ZONE   = '$ZONE'"
    DATE=$(date -d@$TIME  +'%Y-%m-%d %H:%M:%S')  # convert UNIX time stamp into readable date
    echo_v3 "ACTION = '$ACTION'"
    if [[ -z $ACTION ]]; then
        ACTION="git init"                        # if no action: it was a `git init`
    fi
}

if [[ $PARAM_VERBOSE -ge 1 ]]; then
    # full history: loop over all lines of .git/refs/heads/master
    echo -e "History of master"
    while read LINE; do
        # function fills global variables $REV0, $REV1, ..., $ACTION, $DATE, $NAME, $MAIL
        tokenize_log "$LINE"
        # print in a convenient format
        echo -e "   $ACTION on $DATE by $NAME $MAIL"
    done < <(cat $DIR/.git/logs/refs/heads/master)
else
    # print only the first and last line of .git/logs/HEAD
    tokenize_log $(head -n1 $DIR/.git/logs/HEAD)
    echo -e "Source of this Repo: $ACTION on $DATE"

    tokenize_log $(tail -n1 $DIR/.git/logs/HEAD)
    echo -e "Last action: $ACTION on $DATE by $NAME $MAIL"
fi

The function tokenize_log receives a line of the logfile and fills the global variables. The echo_v3 function only prints the debug output, if the log level (verbosity) is above or equal to three.

The same output is created for all remotes:

for D in $DIR/.git/logs/refs/remotes/*; do
    REMOTE=${D##*/}                         # extract last part of path
    # get name of server
    REMSERVER=$(git remote -v | grep $REMOTE | cut  -f2 | cut -d' ' -f1 | sort | uniq)
    echo -e "History of $REMOTE ($REMSERVER)"
    if [[ $PARAM_VERBOSE -ge 1 ]]; then
        # full history
        while read LINE; do
            tokenize_log "$LINE"
            echo -e "   $ACTION on $DATE by $NAME $MAIL"
        done < <(cat $D/master)
    else
        # only first and last entry
        tokenize_log $(head -n1 $D/master)
        echo -e "  First action: $ACTION on $DATE"

        tokenize_log $(tail -n1 $D/master)
        echo -e "  Last action: $ACTION on $DATE by $NAME $MAIL"
    fi
done

social