GIT is becoming widely used as SCM tool. I have been always interested in SCM tools which make me feel good if I know exactly where is my code and in case if I need made some change or track when/how the change is introduced. To get know better, I explored the internals of GIT, which is actaully very interesting.
GIT is actaully a content tracking tool. It is amazing it is used as version control tool.
The following diagram shows the internal GIT structure:
Basics
There are two types GIT repositories: bare and non-bare(working repository). The above showing a working repository, which has a working directory, inside working directory, it contains a hidden directory “.git”, which hold a local copy of git repository. All the magics are inside of “.git” directory.
Let’s take an example by creating an empty repository by “git init”:
As it is shown, “git init” creates a few directories inside “.git” directory to store different type of files such as objects, references, hooks.
Objects
Objects are the key ingridients of GIT repository. “.git/objects” is the directory will hold GIT objects. An object is identified by a 40-character-long string – SHA1 hash of the object’s content. There are four types objects:
commit: stores commit info, reference to tree object and has reference to parent commit to form the Git commit graph.
tree: stores direcotry layouts (pointing to sub-tree object) and filenames with SHA1 hash.
blob: stores file content, which is uniquely identified by SHA1 hash.
tag: stores annotated tag, which points to a commit. This can be used to create branch etc.
There is one important concept in GIT when we are working on files. There are following states of a file:
untracked: when the file is created first time in working directory, it is in intracked state.
modifed: for a file which is under version controlled by GIT, and a copy is placed in working directory and being edited. If it no editted, then it is in unmodified state.
staged: for untracked and modified file, in order to be committed, it has to be staged first by using "git add" command. These information will be saved in file ".git/index".
commited: the file inside ".git/objects" directory after "git commit".
Let’s create first file, and track how the objects are created, and file status is changed:
As it is shown f1.txt is created but in untracked state.
Now let’s use “git add” to add it into stage area:
Now we see that f1.txt has become a staged/new file.
Also we noticed that a new object file “./.git/objects/75/342f57ac22184fe5047ed0b0e82286bc56eea0” is created. This is a blob object file with content of “test file 1”, which is what we put in f1.txt. Now we know/verify how blob object is created.
The f1.txt is only staged,not committed yet. But we can create a tree object for it by using “git write-tree”.
As it is shown, we created a tree object with SHA1 hash 741053ae1c317a0205edf9d8a756c486688b7d1a, which has the reference to blob object, plus the file name.
Next we verify the commit object:
We can see that after commit, “.git/HEAD” is updated to point to “.git/refs/heads/master”, which is pointing to the commit we just made.
Lastly we will verify tag object by using “git tag -a -m”v0.1” V0.1 c940ba”
When the repository grows, to save disk space, we can use “git gc” to pack the objects:
As it is show all the object files are packed into two files: *.pack and *.idx.
Even though all object files are packed, GIT still works normally.
References
Reference files are created to manage/refer to objects inside repository. Each branch has a head reference which is stored under “.git/refs/heads/branch-name”.
The file “.git/HEAD” contains the “path + filename” which is the current active branch.
The following example shows that HEAD is current active branch “master”, and its last commit is object with hash c940ba63b1bda27738df7b43423fcf1efaf767ce.
When “git checkout " is used to switch branch, HEAD will be update accordingly.
Reset and checkout
An easier way to think about reset and checkout is through the mental frame
of Git being a content manager of three different trees. By “tree” here we really
mean “collection of files”, not specifically the data structure.
HEAD: HEAD is the pointer to the current branch reference, which is in turn a pointer to the last commit made on that branch. That means HEAD will be the parent of the next commit that is created. It’s generally simplest to think of HEAD as the snapshot of your last commit. To see its content, "git ls-tree -r HEAD"
Index: Proposed next commit snapshot. To see its content, use "git ls-files -s".
Working directory: the sandbox which hold all files: untracked, modified, unmodified. To see its content, simple use linux util command "tree".
Reset
reset moves the current active branch that HEAD is pointing to. Then depending on the option (–soft, –mixed, –hard) given perform different actions.
Assume intially we have following:
(note: HEAD~ means back to previous first commit, HEAD~2 means back to previous 2nd commit so on…)
git reset –soft HEAD~
This will only move HEAD point to previous commit, without updating index and working directory.
git reset –mixed HEAD~
This will only move HEAD point to previous commit, and copy files to index/staging area, without updating working directory.
git reset –hard HEAD~
This will move HEAD point to previous commit, and copy files to index/staging area and working directory to make them all consistent.
git reset filenmae
This will cause git copy the file version from HEAD is pointing to Index/staging area, which is bascially unstage the file, which is the opposite of “git add”.
git reset [commit-HASH] filename
This will cause git copy the file from given commit is pointing to Index/staging area. Here is really show GIT as content tracking system instead of version tracking system. To do the same in clearcase, you would specify the file version number, here in GIT, there is no version number, all information is tracked though SHA1 hash, so you have to specify commit SHA1 hash to copy the old version file out. In fact, you can use commit SHA1 hash, find associated tree object, then file the filename SHA1 hash from tree object, and then use “git cat-file -p file-SH1-HASH” to print out the file content!
Checkout
It is very close to git reset –hard, two options: with and without path
git checkout branch-name
First, unlike reset –hard, checkout is working-directory safe; it will
check to make sure it’s not blowing away files that have changes to them. Actually, it’s a bit smarter than that – it tries to do a trivial merge in the Working Directory, so all of the files you haven’t changed in will be updated. reset –hard, on the other hand, will simply replace everything across the board
without checking.
The second important difference is how it updates HEAD. Where reset will
move the branch that HEAD points to, checkout will move HEAD itself to point
to another branch.
git checkout branch-name file-name
The other way to run checkout is with a file path, which, like reset, does not
move HEAD. It is just like git reset [branch] file in that it updates the
index with that file at that commit, but it also overwrites the file in the working directory. It would be exactly like git reset –hard [branch] file (if reset would let you run that) – it’s not working-directory safe, and it does not
move HEAD.
Also, like git reset and git add, checkout will accept a –patch option
to allow you to selectively revert file contents on a hunk-by-hunk basis.