At the end of this exercise, a web application will be up and running. Executing something like git clone http://localhost:4000/ hello-world-repo will clone a git repo with single text file in it. Yes, clone from a web application with no git on server side.

What is covered:

git objects formats
git objects creation
endpoints to support client requests
dumb protocol

What is not covered:

packs (the way how git optimizes storage)
smart protocol

The app is available @github

What is a git repo

On a high level, a git repo is a set of commits. Each commit has optional link to a parent one; and required link to a tree structure. This tree structure is just a list of links to files and other trees. As simple as that.

On a low level, git uses key-value storage to keep all objects. These keys are hash values for the content and these hashes are used as links in references.

To have a repo with single file in it; these steps must be completed:

calculate a hash value (H1) for some test content
create a tree object with a file name and a link (H1) to the content; and calculate a hash for the tree (H2)
create a commit object with a link (H2) to the tree object; and calculate a hash for the commit itself (H3)

Because hashes have dependencies H3 <- H2 <- H1, any change to the content, the file name or the commit will change some or all hashes as well. Another conclusion is that to calculate a hash for the commit, hashes for the tree and the content must already be calculated.

Implementation

The sample code is a simple rails application. Turning this app into a demo git repository may be splitter into two steps:

calculate hashes for the content, the tree and the commit; and caching them
handle git requests over the HTTP dumb protocol and using a cache to transfer data to a client during a cloning operation

Entire code lives in single controller @github. It should be quite easy to reimplement the logic in any language.

Calculate hashes & cache

0. Git transfers archived data.

To keep things simple, our cache is going to contain archived data as well; it will simplify request/response calls for git clone command.

1. Content (build_file method in the source)

Git stores content as blob objects. The format is blob SIZE\0content. \0 is zero byte and SIZE = size_in_bytes(content). Hash get calculated over this data(blob keyword, size and original content). After the calculation is done; the data is archived and stored to the cache (simple hash map).

2. Tree (build_tree)

Tree is a listing of files and other trees; and is, basically, a list of items. Each item has format: file_permissions file_name\0file_hash. There are few things to keep in mind. First of all, file_hash is the hash calculated on step 1; it’s added to the item as 20 bytes (and not as 40 bytes HEX representation). Secondly, there are no separator between items (no spaces or new lines).

With a list of items in a variable, tree record can be created according to this format: tree SIZE\0list_of_items; where SIZE is size_in_bytes(list_of_items). As usual, hash get calculated over this object and everything is cached (after archiving).

3. Commit(build_commit)

Format: tree TREE_HASH\n\nCOMMIT_MESSAGE. And the object is commit SIZE\0CONTENT. Note, TREE_HASH for the commit is a HEX representation. In case parent commit should be referenced, parent HASH_OF_OTHER_COMMIT\n goes before the tree link. Commit is cached and archived as usual.

Handle git clone

As it was told before, on a low level git is key-value storage. Basically, the application exposes an endpoint - object/*hash, where a client can retrieve an object by providing a hash (as a lowercase hex string). (See routes file for details.)[https://github.com/andrewromanenco/git-server-hello-world/blob/master/config/routes.rb] The handler for this endpoint (method objects) simply looks into the cache to return the data to a caller.

This demo keeps all objects in the cache, so the object/*hash endpoint always returns a result. In a full git implementation, this endpoint could return 404(not found), which means that the object is stored in a pack file. Pack files are outside of the scope for this app; please, read git documentation for details.

With objects available by their hashes, there are two question two answer: how does git knows hash values to ask for; and how git knows which git branch to checkout locally after clone.

Second question is simple: there is an endpoint heads, which returns a name for the default branch. In this example app, the name is hardcoded to ‘ref: refs/heads/master\n’ - see head method; so master get checked out after a clone.

To answer first question there is another endpoint: info/refs. It lists all available heads (each head is a commit). In this demo, there is only one head/commit and a single record is returned: COMMIT_HASH\trefs/heads/master\n. See info_refs method for details.

Clone it

With the app up and running, execute: git clone http://localhost:3000 hello-world

To sum it up; these are steps taken by git clone command against the demo app:

Request a list of available heads (single commit is available)
Request head name to be checked out after clone is done (always master)
Use hash to retrieve an object
Parse the object; and in case it’s a commit or a tree, find all new hashes inside, and resolve each one of them starting from step 3
When no more new hashes are available, checkout a head, identified on step 2
Tell user that the cloning is done!