'Zip utility giving me different md5sum every time in Linux
When I zip (Zip 2.31) the same file in Linux I get a different checksum everytime. How can I keep the same md5sum from last time? I'm using the latest zip update from yum
Solution 1:[1]
The archive being generated does not only contain the compressed file data, but also "extra file attributes" (as refered in zip documentation), as file timestamps, file attributes, ...
If this metadata is different between compressions, you will never get the same checksum, as the metadata for the compresed file has changed and has been included in the archive.
You can use zip's -X option (or the long --no-extra option) to avoid including the files extra attributes in the archive:
zip -X foo.zip foo-file
Sucessive runs of this command without file modifications must not change the hash of the archive.
Solution 2:[2]
Adding -X flag as suggested in @mc-nd's answer worked fine for me on single-file zip.
But when I was compressing a directory (node_modules in my case) I was getting the different hash each time I reinstalled node_modules.
The fix was to also add -D flag:
-D
--no-dir-entries
Do not create entries in the zip archive for directories.
Directory entries are created by default so that their attributes can
be saved in the zip archive.
Solution 3:[3]
Neither -X or -D worked for me. It looks like zip still sets timestamps within the archive causing mismatching hashes on identical content.
I've fixed the issue by manually setting file timestamps using:
touch -t 202001010000 file
Solution 4:[4]
In order to make a deterministic archive, one that can be rebuilt and verified using a hash, several things are required:
Timestamps of all files must have predictable values
Set the timestamps of all files to a specific value, e.g.
find . -exec touch -d '1985-10-21 09:00:00' {} \;
As an aside, the earliest date supported by the zip format is 01/01/1980 - timestamping all files to the unix epoch (01/01/1970) won't have the desired effect.
If making a zip from a Git checkout you could use the Git commit timestamp of the last change to each file (inspired by this stackoverflow answer).
git ls-files | xargs -I {} sh -c 'chmod 644 "{}"; touch -m -t "$(git log --pretty=format:%cd -n 1 --date=iso "{}" | sed "s/-//g;s/ //;s/://;s/:/\./;s/ .*//")" "{}"'
Permissions of all files must have predictable values
Explicitly set permissions, say to 644, like this:
find . -type f -exec chmod 644 {} \;
Don't rely on the permissions applied by git clone because these depend on the environment's uname value and are therefore unpredictable.
Present files to zip in a specific order
The order in which files are added to a zip matters. Instead of relying on recursion and globbing that depend on the order files are stored in directories which is filesystem dependent and unpredictable. Use somthing like find and sort the list to provide a predictable order.
Disable the zip "extra attributes" feature
This ensures that non-deterministic data such as archive modification timestamps, user names, etc, is not written to the archive. Use the -X option to do this.
Example:
find . -type f | sort | TZ=UTC zip -qX myfile.zip -@
Also, here, the timezone is forced to UTC to avoid further confusion.
Such a zip should be deterministic; verifable using md5sum, sha256sum, etc.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | RomanHotsiy |
| Solution 3 | Valer |
| Solution 4 |
