'Why is size of bytes bigger than the original file?

I am trying to convert a pdf to binary format using python. However, I realized that the binary format has a bigger size than the file. Is there a reason for this? Or am I doing it wrong?

book.pdf size: 4,672,474 bytes,
binary size: 11,302,404 bytes

with open("book.pdf", "rb") as f:
    f = f.read()
    b = bytearray(f)
    bn_str = "".join(format(ord(i), "08b") for i in str(b))
print(len(bn_str) / 8)

I was expecting to get the same size after converting to binary. However, it appeared to have 2~3 times bigger size.



Solution 1:[1]

The problem with the unexpected size is that str(b) creates a printable string representation of the binary array with a mix of ASCII characters and escaped binary chars resulting in a much larger string. f.read() returns a byte sequence instance which can be iterated over directly.

Try something like this:

with open("book.pdf", "rb") as f:
    data = f.read()
    print(len(data))
    bn_str = "".join(format(i, "08b") for i in data)
    print(len(bn_str)//8)

bn_str will be 8x the size of the original file since each byte will be represented by a 8-byte character sequence of 1 and 0s.

The first four bytes of a PDF file should be "%PDF" so the binary output a PDF file would start with 00100101 01010000 01000100 01000110.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1