'Time and space complexity of git commands

Before switching to a monorepo, I need to foresee what the consequences will be. An important thing to consider is the behavior of git when dealing with huge repositories. Questions such as:

  • How much time will it take to push a commit or checkout to a branch, or
  • Whether rebasing will need a 100GB of memory.

needs to be answered.

Big tech companies maintain a whole set of in-house tools to deal with monorepos. So, they may overcome some possible shortcomings of git on certain use cases. The question is, without such tools, is it possible to retain our daily workflow with a monorepo.

So to formulate it as a general question:

What are the time and space complexity of git commands in terms of number of commits, number of objects, size of objects, size of working copy, number of branches or possibly other parameters depending on the sub-command.

For example, git reset --soft <commit> sets the current branch head to <commit>. My first guess is it is an O(1) operation, since it only changes the content of HEAD to point to <commit>. But probably it is first trying to locate the commit object, so a search will be required among the commit objects, making it - I guess - log(N), where N is number of commits.



Solution 1:[1]

What are the time and space complexity of git commands in terms of number of commits, number of objects, size of objects, size of working copy, number of branches or possibly other parameters depending on the sub-command.

That's a very difficult question. The answers will depend on what Git features you use, and your Git version (Git is undergoing continuous development that both adds new complexity = makes it slower, and addresses existing large-repository performance issues = makes it faster).

For example, git reset --soft <commit> sets the current branch head to <commit>. My first guess is it is an O(1) operation, since it only changes the content of HEAD to point to <commit>.

Generally, yes:

But probably it is first trying to locate the commit object, so a search will be required among the commit objects, making it - I guess - log(N), where N is number of commits.

If the specified <commit> is a full hash ID, Git will simply look up the hash ID in the objects database. Since it's hashed, this would in theory be O(1), if the hash were perfect. In practice, no hashing scheme is ever perfect and Git may have to look in one loose object location and in multiple pack files. The answer now depends on whether you use the multi-pack-index feature, and if not, how many pack files and loose objects you have. Git does get to stop short as soon as it finds one copy, so this might be O(n/2) (i.e., O(n)) in the number of pack files. There is a directory in each pack file; this means there's no need to look through all the objects in the pack, but there may still be some O(log log o) effects where o is the number of objects in the pack. Meanwhile, it's O(n) in the number of loose objects in the directory containing the loose object file, if found as a loose object, unless the OS you're using has faster-than-O(n) directory lookup, which some do. Git tries to keep that below approximately 30, so that's generally very fast in spite of the potential to be slow.

Note that Git tries to keep the number of pack files reasonable as well, with git gc re-packing N old pack files into one single new pack file, deleting the old pack files. If you use the feature that keeps old pack files, though, this falls apart. Large hosting sites like GitHub do use that feature (or a modified version of it), which I assume drove the idea of a multi-pack index in the first place.

If the specified <commit> is an abbreviated hash, Git must find all possible matching hash IDs. This means it cannot short-circuit the results after finding one match, so that it is much more likely to be O(n) for some n.

If the specified <commit> is given by branch or tag name, Git must resolve the name. Now we add a step that may (or may not) be O(n) in the number of ref-names in the names database. The C implementation of Git is growing a new "reftables" back end for this that will reduce this to a hash lookup (and apparently reftables are already implemented in JGit).

Microsoft in particular have been working on very-large-source-bases (e.g., the Windows code base) stored in a monorepo. You can read something about their experience, starting back in 2017, here: The largest Git repo on the planet.

There is ongoing work in Git to take a different approach; I have no idea when it will be ready for practical use. The slowest parts in very large monorepos in practice tend to be merges1 that combine a lot of changed files—the new "merge-ort" helps out here—and large full checkouts; sparse checkout may be important to you. Usable sparse checkout support in Git is there now but there are still various rough edges.


1Rebase is repeated cherry-pick, and each cherry-pick is a merge, so these do get expensive.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 torek