'How to decode <class 'bytes'> PDF file with Python

I want to extract data from PDF url without using any library. I have problem with decoding, this is my code:

import requests

link = 'https://www.heimberg.ch/fileadmin/user_upload/u_Protokoll_GV_07.12.2021.pdf'

response = requests.get(link)
print(response, type(response))  #<Response [200]> <class 'requests.models.Response'>

data = response.content
print(type(data)) #<class 'bytes'>

print(data)

This is the response content (it is too big, so I pasted only a part of it):

b'%PDF-1.4\n%\xe2\xe3\xcf\xd3\r\n1 0 obj\n<< \n/Creator (Canon iR-ADV C5760  PDF)\n/CreationDate (D:20220127150550+01\'00\')\n/Producer (\\376\\377\\000A\\000d\\000o\\000b\\000e\\000 \\000P\\000S\\000L\\000 \\0001\\000.\\000\\\n3\\000e\\000 \\000f\\000o\\000r\\000 \\000C\\000a\\000n\\000o\\000n\\000\\000)\n>> \nendobj\n2 0 obj\n<< \n/Pages 3 0 R \n/Type /Catalog \n/OutputIntents 13 0 R \n/Metadata 14 0 R \n>> \nendobj\n4 0 obj\n<< /Type /XObject /Subtype /Image /Width 1240 /Height 1753 /BitsPerComponent 8 \n/ColorSpace /DeviceGray /Filter [ /FlateDecode /DCTDecode ] /Length 7753 >> \nstream\r\nx\x01\xed]\t\\TU\xdb\x1f\xdc\x97,\xcd\xa1\xb2pI\x19e\xa4\xf1\xcd\x05R\xd4\xcc\xe5\xaa3\x846\x83\x86\xa2\x92\xbb\x02\x82+\xa1\x14\xee\x9aM\x9a2\x0e\x98"j\xc3\xaa\x16\xbc\xe8\xb8\xe0\xd6\xeb\x82\x92J\x06*"\xa6\xe6\x9a\xa2\x98\x0bh\xa2\xc2\xf7\x7f\xce\x9d\x9d\x81\xd8\xfd\xbe\xef-\xcf\xef.\xe7\xde{\x9e\xe5<\xfb9C\x85\x99\x85\xbf\x0b\x1a\xcb\xfaK\xfb\x0blj\xd8\x08\xc2\xf0OPX x\xad\xef\x18

I have tried different types of encodings, but I think that the problem is with mixed encoding, or maybe I'm wrong?

I have also tried this:

print(BytesIO(data)) # <_io.BytesIO object at 0x7fb740760c20>
for i in BytesIO(data):
    print(i)

And I'm getting this response (row by row):

b'%PDF-1.4\n'
b'%\xe2\xe3\xcf\xd3\r\n'
b'1 0 obj\n'
b'<< \n'
b'/Creator (Canon iR-ADV C5760  PDF)\n'
b"/CreationDate (D:20220127150550+01'00')\n"
b'/Producer (\\376\\377\\000A\\000d\\000o\\000b\\000e\\000 \\000P\\000S\\000L\\000 \\0001\\000.\\000\\\n'
b'3\\000e\\000 \\000f\\000o\\000r\\000 \\000C\\000a\\000n\\000o\\000n\\000\\000)\n'
b'>> \n'
b'endobj\n'
b'2 0 obj\n'
b'<< \n'
b'/Pages 3 0 R \n'
b'/Type /Catalog \n'
b'/OutputIntents 13 0 R \n'
b'/Metadata 14 0 R \n'
b'>> \n'
b'endobj\n'
b'4 0 obj\n'
b'<< /Type /XObject /Subtype /Image /Width 1240 /Height 1753 /BitsPerComponent 8 \n'
b'/ColorSpace /DeviceGray /Filter [ /FlateDecode /DCTDecode ] /Length 7753 >> \n'
b'stream\r\n'
b'x\x01\xed]\t\\TU\xdb\x1f\xdc\x97,\xcd\xa1\xb2pI\x19e\xa4\xf1\xcd\x05R\xd4\xcc\xe5\xaa3\x846\x83\x86\xa2\x92\xbb\x02\x82+\xa1\x14\xee\x9aM\x9a2\x0e\x98"j\xc3\xaa\x16\xbc\xe8\xb8\xe0\xd6\xeb\x82\x92J\x06*"\xa6\xe6\x9a\xa2\x98\x0bh\xa2\xc2\xf7\x7f\xce\x9d\x9d\x81\xd8\xfd\xbe\xef-\xcf\xef.\xe7\xde{\x9e\xe5<\xfb9C\x85\x99\x85\xbf\x0b\x1a\xcb\xfaK\xfb\x0blj\xd8\x08\xc2\xf0OPX x\xad\xef\x18\xff\xa9\xfe\xad\xa4\xfe\xe3\x04\xf8\xaf\xf0\x82\xa0\xaf\xa0\xd6;\xef\xb4z\xa7\x95}\xabV\xf6\xce\x8e\xf6

... ...

How can I get text instead of chars above?



Solution 1:[1]

When you have a Binary.PDF it is composed primarily as a mix of objects many which can have their own encodings for the different parts, thus there is no single encoding, it may be hex or flate or zip or ... and the parts are out of order, see how there is a gap between 2 and 4 on the left, so one step is to try and rationalise those to less formats as here on the right.However we will see that will not make the task much easier.

enter image description here

What we can see is this file is heavily compressed and dependent on images so expansion is not much help. and if we simply copy or export the text we can see exactly why. It becomes instantly clear it was images plus OCR and the OCR has many flaws, rather than spell check it may be better to start afresh.

enter image description here

For an initial quality test I suggest you simply copy and paste which will give you a feel for what might be the best outcome.

Whilst the command line output on console has coding issues of its own, the same output to a text file should not need decoding. I am testing with Xpdf 4.03 since on Win7x32 but most 64 bit Pythons should have PDFtoTEXT poppler version 2022.02 or older.

If we visually look at the construction of the text we can also see how it will cause problems if extracted as sequential text blocks since commonly OCR breaks the line flow into many pieces.

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1