'Fetch fork of a huge Github repository without redownloading unmodified files

I cloned a repository from Github with this commmand: git clone https://github.com/user1/huge_repo. This repo is huge, multiple gigabytes.

Someone made a small edit to one of the files and pushed their fork on Github. I want to fetch their version of the repo (let's say, it's on https://github.com/user2/huge_repo) without downloading gigabytes of data that are unmodified. Is this possible?



Solution 1:[1]

You can use your existing clone as a reference clone. For instance, if your existing clone lives in $HOME/sources/github.com/user1/huge_repo:

git clone --reference $HOME/sources/github.com/user1/huge_repo https://github.com/user2/huge_repo

You may wish to add --dissociate here, or not; that part is up to you. With or without --dissociate, your Git asks the GitHub Git what commits they have to send you. This makes a list of commits to fetch. Let's assume for concreteness that the list includes a123456. Your Git now:

  • looks at the reference clone's objects to see if you already have commit a123456:
    • if yes, uses the existing copy and tells the other Git no thanks, I already have that commit, which avoids downloading it at all
    • if no, tells the other Git yes please, send a123456 (which makes it offer the parents of a123456, and we add those to the list of commits to check).

Your Git repeats this for all the commits needed, including those added because your Git said yes, please send that one. The result is the set of objects you must download, plus a list of hash IDs that you can borrow or steal from the reference clone.

Their Git (GitHub's) packages up the commits and supporting objects that must be sent, knowing which commits your Git has said you already have. So you get only new objects, not existing ones, and furthermore, those new objects can be delta-compressed against your existing objects.

Now the --dissociate flag comes into play. If you did not use it, your Git stores, in your user2/huge_repo clone, an "alternates" entry that lists the path name in which the objects you're borrowing live. The effect of this is that you're now borrowing them for the foreseeable future: some time in the future, you might try to use a commit or supporting object from this clone that you didn't actually clone. You borrowed it from the reference.

Your Git software will go back to the reference clone and try to use that object from there. If it's still there, everything works, but if something has happened to that reference clone—such as removing it entirely or merely discarding the referenced object from that clone because it was unused there—then your user2/huge_repo clone may become unusable. (You can restore the usability by retrieving the missing object, as long as there's some place to get it from.)

If, on the other hand, you did use --dissociate, your Git will, having cloned user2/huge_repo with reference to the reference clone, now copy the reference clone's objects into your new clone. This means the original reference clone can be removed, or have objects collected out of it, without damaging the new clone.

If you can guarantee that objects won't vanish from the reference clone, you can leave it in place and save disk space—potentially quite a bit. If not, you'll probably want to spend the disk space up front, to avoid having to repair a broken repository later.

(GitHub work around this kind of problem by never expiring any objects by default. That's why you have to ask them to remove sensitive data.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 torek