Explore GIT Internals

GIT is becoming widely used as SCM tool. I have been always interested in SCM tools which make me feel good if I know exactly where is my code and in case if I need made some change or track when/how the change is introduced. To get know better, I explored the internals of GIT, which is actaully very interesting.

GIT Internal 1

GIT is actaully a content tracking tool. It is amazing it is used as version control tool. The following diagram shows the internal GIT structure:

GIT internal structure

Basics

There are two types GIT repositories: bare and non-bare(working repository). The above showing a working repository, which has a working directory, inside working directory, it contains a hidden directory “.git”, which hold a local copy of git repository. All the magics are inside of “.git” directory.

Let’s take an example by creating an empty repository by “git init”:

weng@weng-u1604:~$ mkdir git-internal-test && cd git-internal-test/
weng@weng-u1604:~/git-internal-test$ git init
Initialized empty Git repository in /home/weng/git-internal-test/.git/
weng@weng-u1604:~/git-internal-test$ find
.
./.git
./.git/HEAD
./.git/config
./.git/hooks
./.git/hooks/pre-rebase.sample
./.git/hooks/applypatch-msg.sample
./.git/hooks/pre-applypatch.sample
./.git/hooks/post-update.sample
./.git/hooks/pre-push.sample
./.git/hooks/prepare-commit-msg.sample
./.git/hooks/commit-msg.sample
./.git/hooks/pre-commit.sample
./.git/hooks/update.sample
./.git/objects
./.git/objects/pack
./.git/objects/info
./.git/refs
./.git/refs/heads
./.git/refs/tags
./.git/info
./.git/info/exclude
./.git/description
./.git/branches
weng@weng-u1604:~/git-internal-test$ 

As it is shown, “git init” creates a few directories inside “.git” directory to store different type of files such as objects, references, hooks.

Objects

Objects are the key ingridients of GIT repository. “.git/objects” is the directory will hold GIT objects. An object is identified by a 40-character-long string – SHA1 hash of the object’s content. There are four types objects:

  • commit: stores commit info, reference to tree object and has reference to parent commit to form the Git commit graph.
  • tree: stores direcotry layouts (pointing to sub-tree object) and filenames with SHA1 hash.
  • blob: stores file content, which is uniquely identified by SHA1 hash.
  • tag: stores annotated tag, which points to a commit. This can be used to create branch etc.
  • There is one important concept in GIT when we are working on files. There are following states of a file:

  • untracked: when the file is created first time in working directory, it is in intracked state.
  • modifed: for a file which is under version controlled by GIT, and a copy is placed in working directory and being edited. If it no editted, then it is in unmodified state.
  • staged: for untracked and modified file, in order to be committed, it has to be staged first by using "git add" command. These information will be saved in file ".git/index".
  • commited: the file inside ".git/objects" directory after "git commit".
  • Let’s create first file, and track how the objects are created, and file status is changed:

    weng@weng-u1604:~/git-internal-test$ echo "test file 1" > f1.txt
    weng@weng-u1604:~/git-internal-test$ tree
    .
    └── f1.txt
    
    0 directories, 1 file
    weng@weng-u1604:~/git-internal-test$ git status
    On branch master
    
    Initial commit
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
    	f1.txt
    
    nothing added to commit but untracked files present (use "git add" to track)
    weng@weng-u1604:~/git-internal-test$

    As it is shown f1.txt is created but in untracked state. Now let’s use “git add” to add it into stage area:

    weng@weng-u1604:~/git-internal-test$ git add f1.txt
    weng@weng-u1604:~/git-internal-test$ find
    .
    ./.git
    ./.git/HEAD
    ./.git/config
    ./.git/hooks
    ./.git/hooks/pre-rebase.sample
    ./.git/hooks/applypatch-msg.sample
    ./.git/hooks/pre-applypatch.sample
    ./.git/hooks/post-update.sample
    ./.git/hooks/pre-push.sample
    ./.git/hooks/prepare-commit-msg.sample
    ./.git/hooks/commit-msg.sample
    ./.git/hooks/pre-commit.sample
    ./.git/hooks/update.sample
    ./.git/objects
    ./.git/objects/75
    ./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0
    ./.git/objects/pack
    ./.git/objects/info
    ./.git/refs
    ./.git/refs/heads
    ./.git/refs/tags
    ./.git/info
    ./.git/info/exclude
    ./.git/description
    ./.git/index
    ./.git/branches
    ./f1.txt
    weng@weng-u1604:~/git-internal-test$ 
    weng@weng-u1604:~/git-internal-test$ git status
    On branch master
    
    Initial commit
    
    Changes to be committed:
      (use "git rm --cached <file>..." to unstage)
    
    	new file:   f1.txt
    weng@weng-u1604:~/git-internal-test$ git cat-file -t 75342
    blob
    weng@weng-u1604:~/git-internal-test$ git cat-file -p 75342
    test file 1
    weng@weng-u1604:~/git-internal-test$ 

    Now we see that f1.txt has become a staged/new file. Also we noticed that a new object file “./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0” is created. This is a blob object file with content of “test file 1”, which is what we put in f1.txt. Now we know/verify how blob object is created.

    The f1.txt is only staged,not committed yet. But we can create a tree object for it by using “git write-tree”.

    weng@weng-u1604:~/git-internal-test$ git write-tree
    741053ae1c317a0205edf9d8a756c486688b7d1a
    weng@weng-u1604:~/git-internal-test$ find
    .
    ./.git
    ./.git/HEAD
    ./.git/config
    ./.git/hooks
    ./.git/hooks/pre-rebase.sample
    ./.git/hooks/applypatch-msg.sample
    ./.git/hooks/pre-applypatch.sample
    ./.git/hooks/post-update.sample
    ./.git/hooks/pre-push.sample
    ./.git/hooks/prepare-commit-msg.sample
    ./.git/hooks/commit-msg.sample
    ./.git/hooks/pre-commit.sample
    ./.git/hooks/update.sample
    ./.git/objects
    ./.git/objects/75
    ./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0
    ./.git/objects/pack
    ./.git/objects/info
    ./.git/objects/74
    ./.git/objects/74/1053ae1c317a0205edf9d8a756c486688b7d1a
    ./.git/refs
    ./.git/refs/heads
    ./.git/refs/tags
    ./.git/info
    ./.git/info/exclude
    ./.git/description
    ./.git/index
    ./.git/branches
    ./f1.txt
    weng@weng-u1604:~/git-internal-test$ git cat-file -t 74105
    tree
    weng@weng-u1604:~/git-internal-test$ git cat-file -p 74105
    100644 blob 75342f57ac22184fe5047ed0b0e82286bc56eea0	f1.txt
    weng@weng-u1604:~/git-internal-test$ 

    As it is shown, we created a tree object with SHA1 hash 741053ae1c317a0205edf9d8a756c486688b7d1a, which has the reference to blob object, plus the file name.

    Next we verify the commit object:

    weng@weng-u1604:~/git-internal-test$ git commit -m "fist commit for f1.txt file"[master (root-commit) c940ba6] fist commit for f1.txt file
     1 file changed, 1 insertion(+)
     create mode 100644 f1.txt
    weng@weng-u1604:~/git-internal-test$ find
    .
    ./.git
    ./.git/COMMIT_EDITMSG
    ./.git/HEAD
    ./.git/logs
    ./.git/logs/HEAD
    ./.git/logs/refs
    ./.git/logs/refs/heads
    ./.git/logs/refs/heads/master
    ./.git/config
    ./.git/hooks
    ./.git/hooks/pre-rebase.sample
    ./.git/hooks/applypatch-msg.sample
    ./.git/hooks/pre-applypatch.sample
    ./.git/hooks/post-update.sample
    ./.git/hooks/pre-push.sample
    ./.git/hooks/prepare-commit-msg.sample
    ./.git/hooks/commit-msg.sample
    ./.git/hooks/pre-commit.sample
    ./.git/hooks/update.sample
    ./.git/objects
    ./.git/objects/75
    ./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0
    ./.git/objects/pack
    ./.git/objects/c9
    ./.git/objects/c9/40ba63b1bda27738df7b43423fcf1efaf767ce
    ./.git/objects/info
    ./.git/objects/74
    ./.git/objects/74/1053ae1c317a0205edf9d8a756c486688b7d1a
    ./.git/refs
    ./.git/refs/heads
    ./.git/refs/heads/master
    ./.git/refs/tags
    ./.git/info
    ./.git/info/exclude
    ./.git/description
    ./.git/index
    ./.git/branches
    ./f1.txt
    weng@weng-u1604:~/git-internal-test$ git cat-file -t c940ba
    commit
    weng@weng-u1604:~/git-internal-test$ git cat-file -p c940ba
    tree 741053ae1c317a0205edf9d8a756c486688b7d1a
    author Wenwei Weng <weweng@gmail.com> 1484639055 -0800
    committer Wenwei Weng <weweng@gmail.com> 1484639055 -0800
    
    fist commit for f1.txt file
    weng@weng-u1604:~/git-internal-test$
    weng@weng-u1604:~/git-internal-test$ cat .git/HEAD 
    ref: refs/heads/master
    weng@weng-u1604:~/git-internal-test$ cat .git/refs/heads/master
    c940ba63b1bda27738df7b43423fcf1efaf767ce
    weng@weng-u1604:~/git-internal-test$ 
    weng@weng-u1604:~/git-internal-test$ git ls-tree -r HEAD
    100644 blob 75342f57ac22184fe5047ed0b0e82286bc56eea0	f1.txt
    weng@weng-u1604:~/git-internal-test$ 

    We can see that after commit, “.git/HEAD” is updated to point to “.git/refs/heads/master”, which is pointing to the commit we just made.

    Lastly we will verify tag object by using “git tag -a -m”v0.1” V0.1 c940ba”

    weng@weng-u1604:~/git-internal-test$ git tag -a -m'version 0.1' V0.1 c940ba
    weng@weng-u1604:~/git-internal-test$ find
    .
    ./.git
    ./.git/COMMIT_EDITMSG
    ./.git/HEAD
    ./.git/logs
    ./.git/logs/HEAD
    ./.git/logs/refs
    ./.git/logs/refs/heads
    ./.git/logs/refs/heads/master
    ./.git/config
    ./.git/hooks
    ./.git/hooks/pre-rebase.sample
    ./.git/hooks/applypatch-msg.sample
    ./.git/hooks/pre-applypatch.sample
    ./.git/hooks/post-update.sample
    ./.git/hooks/pre-push.sample
    ./.git/hooks/prepare-commit-msg.sample
    ./.git/hooks/commit-msg.sample
    ./.git/hooks/pre-commit.sample
    ./.git/hooks/update.sample
    ./.git/objects
    ./.git/objects/c8
    ./.git/objects/c8/62407c91ee28f496970ce0585b216681f19c1e
    ./.git/objects/75
    ./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0
    ./.git/objects/pack
    ./.git/objects/c9
    ./.git/objects/c9/40ba63b1bda27738df7b43423fcf1efaf767ce
    ./.git/objects/info
    ./.git/objects/74
    ./.git/objects/74/1053ae1c317a0205edf9d8a756c486688b7d1a
    ./.git/refs
    ./.git/refs/heads
    ./.git/refs/heads/master
    ./.git/refs/tags
    ./.git/refs/tags/V0.1
    ./.git/info
    ./.git/info/exclude
    ./.git/description
    ./.git/index
    ./.git/branches
    ./f1.txt
    weng@weng-u1604:~/git-internal-test$ git cat-file -t c8624
    tag
    weng@weng-u1604:~/git-internal-test$ git cat-file -p c8624
    object c940ba63b1bda27738df7b43423fcf1efaf767ce
    type commit
    tag V0.1
    tagger Wenwei Weng <weweng@gmail.com> 1484639367 -0800
    
    version 0.1
    weng@weng-u1604:~/git-internal-test$ git show-ref --tags
    c862407c91ee28f496970ce0585b216681f19c1e refs/tags/V0.1
    weng@weng-u1604:~/git-internal-test$ cat .git/refs/tags/V0.1 
    c862407c91ee28f496970ce0585b216681f19c1e
    weng@weng-u1604:~/git-internal-test$ 

    When the repository grows, to save disk space, we can use “git gc” to pack the objects:

    weng@weng-u1604:~/git-internal-test$ git gc
    Counting objects: 4, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (2/2), done.
    Writing objects: 100% (4/4), done.
    Total 4 (delta 0), reused 0 (delta 0)
    weng@weng-u1604:~/git-internal-test$ find
    .
    ./.git
    ./.git/COMMIT_EDITMSG
    ./.git/HEAD
    ./.git/logs
    ./.git/logs/HEAD
    ./.git/logs/refs
    ./.git/logs/refs/heads
    ./.git/logs/refs/heads/master
    ./.git/config
    ./.git/hooks
    ./.git/hooks/pre-rebase.sample
    ./.git/hooks/applypatch-msg.sample
    ./.git/hooks/pre-applypatch.sample
    ./.git/hooks/post-update.sample
    ./.git/hooks/pre-push.sample
    ./.git/hooks/prepare-commit-msg.sample
    ./.git/hooks/commit-msg.sample
    ./.git/hooks/pre-commit.sample
    ./.git/hooks/update.sample
    ./.git/packed-refs
    ./.git/objects
    ./.git/objects/pack
    ./.git/objects/pack/pack-19eeb5192ba7924b412bce57e60843e28a4eea51.pack
    ./.git/objects/pack/pack-19eeb5192ba7924b412bce57e60843e28a4eea51.idx
    ./.git/objects/info
    ./.git/objects/info/packs
    ./.git/refs
    ./.git/refs/heads
    ./.git/refs/tags
    ./.git/info
    ./.git/info/exclude
    ./.git/info/refs
    ./.git/description
    ./.git/index
    ./.git/branches
    ./f1.txt
    weng@weng-u1604:~/git-internal-test$

    As it is show all the object files are packed into two files: *.pack and *.idx. Even though all object files are packed, GIT still works normally.

    References

    Reference files are created to manage/refer to objects inside repository. Each branch has a head reference which is stored under “.git/refs/heads/branch-name”. The file “.git/HEAD” contains the “path + filename” which is the current active branch. The following example shows that HEAD is current active branch “master”, and its last commit is object with hash c940ba63b1bda27738df7b43423fcf1efaf767ce.

    weng@weng-u1604:~/git-internal-test$ cat .git/HEAD
    ref: refs/heads/master
    weng@weng-u1604:~/git-internal-test$ cat .git/refs/heads/master
    c940ba63b1bda27738df7b43423fcf1efaf767ce
    weng@weng-u1604:~/git-internal-test$

    When “git checkout " is used to switch branch, HEAD will be update accordingly.

    Reset and checkout

    An easier way to think about reset and checkout is through the mental frame of Git being a content manager of three different trees. By “tree” here we really mean “collection of files”, not specifically the data structure.

  • HEAD: HEAD is the pointer to the current branch reference, which is in turn a pointer to the last commit made on that branch. That means HEAD will be the parent of the next commit that is created. It’s generally simplest to think of HEAD as the snapshot of your last commit. To see its content, "git ls-tree -r HEAD"
  • Index: Proposed next commit snapshot. To see its content, use "git ls-files -s".
  • Working directory: the sandbox which hold all files: untracked, modified, unmodified. To see its content, simple use linux util command "tree".
  • Reset

    reset moves the current active branch that HEAD is pointing to. Then depending on the option (–soft, –mixed, –hard) given perform different actions. Assume intially we have following: git reset init-state

    (note: HEAD~ means back to previous first commit, HEAD~2 means back to previous 2nd commit so on…)

    git reset –soft HEAD~

    This will only move HEAD point to previous commit, without updating index and working directory.

    git reset soft

    git reset –mixed HEAD~

    This will only move HEAD point to previous commit, and copy files to index/staging area, without updating working directory.

    git reset mixed

    git reset –hard HEAD~

    This will move HEAD point to previous commit, and copy files to index/staging area and working directory to make them all consistent.

    git reset hard

    git reset filenmae

    This will cause git copy the file version from HEAD is pointing to Index/staging area, which is bascially unstage the file, which is the opposite of “git add”.

    git reset [commit-HASH] filename

    This will cause git copy the file from given commit is pointing to Index/staging area. Here is really show GIT as content tracking system instead of version tracking system. To do the same in clearcase, you would specify the file version number, here in GIT, there is no version number, all information is tracked though SHA1 hash, so you have to specify commit SHA1 hash to copy the old version file out. In fact, you can use commit SHA1 hash, find associated tree object, then file the filename SHA1 hash from tree object, and then use “git cat-file -p file-SH1-HASH” to print out the file content!

    Checkout

    It is very close to git reset –hard, two options: with and without path

    git checkout branch-name

    First, unlike reset –hard, checkout is working-directory safe; it will check to make sure it’s not blowing away files that have changes to them. Actually, it’s a bit smarter than that – it tries to do a trivial merge in the Working Directory, so all of the files you haven’t changed in will be updated. reset –hard, on the other hand, will simply replace everything across the board without checking. The second important difference is how it updates HEAD. Where reset will move the branch that HEAD points to, checkout will move HEAD itself to point to another branch.

    git checkout branch-name file-name

    The other way to run checkout is with a file path, which, like reset, does not move HEAD. It is just like git reset [branch] file in that it updates the index with that file at that commit, but it also overwrites the file in the working directory. It would be exactly like git reset –hard [branch] file (if reset would let you run that) – it’s not working-directory safe, and it does not move HEAD. Also, like git reset and git add, checkout will accept a –patch option to allow you to selectively revert file contents on a hunk-by-hunk basis.