'decompressing git objects with zlib

I decompressed three git objects with this Python script:

import zlib

filename = '/path_to_file' 
compressed_contents = open(filename, 'rb').read() 
decompressed_contents = zlib.decompress(compressed_contents) 
print(decompressed_contents)

for these three objects, I get these three outputs:

b'tree 32\x00100644 file\x00\xe6\x01\xe5\x92\x8e\xcc.\xc5\xbe\t\x91{\xe9\x92:\x85\xc4\x89\xe9H'

b'commit 196\x00tree 2b32fe41c7f8c21d5010fb59a59bcce42b2b3ab5\nauthor author <author> 1643729123 +0100\ncommitter author_email <author_email> 1643729123 +0100\n\nadd hello\n'

b'blob 6\x00hello\n'

In the git documentation (git probook) they say that git add a null byte at the end of the object header which is \u0000. But when I decompress those objects with zlib, \u0000 are replaced by \x00.

So, what does git really store in those files, \u0000 or \x00?
Does this script output the raw content of git objects?

Solution 1:^[1]

\x00 is stored. Or more precisely: a single byte with the value 0 (or 0x00, if you want) is stored.

\u0000 is the Unicode NUL character, a.k.a U+0000 NUL. The \u escape mechanism is a common way to represent Unicode characters, even though it's usually limited to 4 hex digits (which means it can't represent Unicode code points outside of the BMP, such as U+1F600 ?).

Why are these two used interchangeably? Because in most character encodings \u0000 is actually encoded as 0x00. Specifically most 8-bit encodings as well as UTF-8 follow this practice.

Note that it's still important to distinguish the two things, because one is a character (that will often be mapped onto a byte) and the other is a byte value (that can often be interpreted as a character).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'decompressing git objects with zlib

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]