'Why is size of bytes bigger than the original file?
I am trying to convert a pdf to binary format using python. However, I realized that the binary format has a bigger size than the file. Is there a reason for this? Or am I doing it wrong?
book.pdf size: 4,672,474 bytes,
binary size: 11,302,404 bytes
with open("book.pdf", "rb") as f:
f = f.read()
b = bytearray(f)
bn_str = "".join(format(ord(i), "08b") for i in str(b))
print(len(bn_str) / 8)
I was expecting to get the same size after converting to binary. However, it appeared to have 2~3 times bigger size.
Solution 1:[1]
The problem with the unexpected size is that str(b) creates a printable string representation of the binary array with a mix of ASCII characters and escaped binary chars resulting in a much larger string. f.read() returns a byte sequence instance which can be iterated over directly.
Try something like this:
with open("book.pdf", "rb") as f:
data = f.read()
print(len(data))
bn_str = "".join(format(i, "08b") for i in data)
print(len(bn_str)//8)
bn_str will be 8x the size of the original file since each byte will be represented by a 8-byte character sequence of 1 and 0s.
The first four bytes of a PDF file should be "%PDF" so the binary output a PDF file would start with 00100101 01010000 01000100 01000110.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
