'decompressing git objects with zlib
I decompressed three git objects with this Python script:
import zlib
filename = '/path_to_file'
compressed_contents = open(filename, 'rb').read()
decompressed_contents = zlib.decompress(compressed_contents)
print(decompressed_contents)
for these three objects, I get these three outputs:
b'tree 32\x00100644 file\x00\xe6\x01\xe5\x92\x8e\xcc.\xc5\xbe\t\x91{\xe9\x92:\x85\xc4\x89\xe9H'
b'commit 196\x00tree 2b32fe41c7f8c21d5010fb59a59bcce42b2b3ab5\nauthor author <author> 1643729123 +0100\ncommitter author_email <author_email> 1643729123 +0100\n\nadd hello\n'
b'blob 6\x00hello\n'
In the git documentation (git probook) they say that git add a null byte at the end of the object header which is \u0000. But when I decompress those objects with zlib, \u0000 are replaced by \x00.
- So, what does git really store in those files,
\u0000or\x00? - Does this script output the
raw contentof git objects?
Solution 1:[1]
\x00 is stored. Or more precisely: a single byte with the value 0 (or 0x00, if you want) is stored.
\u0000 is the Unicode NUL character, a.k.a U+0000 NUL. The \u escape mechanism is a common way to represent Unicode characters, even though it's usually limited to 4 hex digits (which means it can't represent Unicode code points outside of the BMP, such as U+1F600 ?).
Why are these two used interchangeably? Because in most character encodings \u0000 is actually encoded as 0x00. Specifically most 8-bit encodings as well as UTF-8 follow this practice.
Note that it's still important to distinguish the two things, because one is a character (that will often be mapped onto a byte) and the other is a byte value (that can often be interpreted as a character).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
