'decompressing git objects with zlib

I decompressed three git objects with this Python script:

import zlib

filename = '/path_to_file' 
compressed_contents = open(filename, 'rb').read() 
decompressed_contents = zlib.decompress(compressed_contents) 
print(decompressed_contents)

for these three objects, I get these three outputs:

b'tree 32\x00100644 file\x00\xe6\x01\xe5\x92\x8e\xcc.\xc5\xbe\t\x91{\xe9\x92:\x85\xc4\x89\xe9H'

b'commit 196\x00tree 2b32fe41c7f8c21d5010fb59a59bcce42b2b3ab5\nauthor author <author> 1643729123 +0100\ncommitter author_email <author_email> 1643729123 +0100\n\nadd hello\n'

b'blob 6\x00hello\n'

In the git documentation (git probook) they say that git add a null byte at the end of the object header which is \u0000. But when I decompress those objects with zlib, \u0000 are replaced by \x00.

  • So, what does git really store in those files, \u0000 or \x00?
  • Does this script output the raw content of git objects?


Solution 1:[1]

\x00 is stored. Or more precisely: a single byte with the value 0 (or 0x00, if you want) is stored.

\u0000 is the Unicode NUL character, a.k.a U+0000 NUL. The \u escape mechanism is a common way to represent Unicode characters, even though it's usually limited to 4 hex digits (which means it can't represent Unicode code points outside of the BMP, such as U+1F600 ?).

Why are these two used interchangeably? Because in most character encodings \u0000 is actually encoded as 0x00. Specifically most 8-bit encodings as well as UTF-8 follow this practice.

Note that it's still important to distinguish the two things, because one is a character (that will often be mapped onto a byte) and the other is a byte value (that can often be interpreted as a character).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1