'How to change a texted pdf to utf-8 encoding?
I have a bunch of pdfs that display Cyrillic properly. But if I copy and paste texts from them, gibberish was produced.
Then I used the save as function from okular to convert the pdf to text file and find that the encoding is WINDOWS-1251, which is an old Cyrillic encoding. After converting it UTF-8, the Cyrillic is displayed properly.
A sample link of the file is https://cdn.esis.edu.mn/cover/01/01_mongol_khel.pdf
Is there a way to convert the pdfs to UTF-8 encoded so that I can copy, paste and search?
SOLVED:
With the information provided by @iPDFdev, I managed to solve this problem.
For anyone who might encounter a similar problem, I took the Windows-1251 to UTF-8 table at https://www.compart.com/en/unicode/charsets/windows-1251, and modified the code at https://github.com/pymupdf/PyMuPDF/issues/530. I disregarded the old Unicode map completely and added Cyrillic letters maps for all fonts on all pages.
import fitz
import re
doc = fitz.open(inputFileName)
new = '1 beginbfrange\n<c0> <ff> <0410>\nendbfrange'
for pno in range(doc.page_count):
font_tuples = doc.get_page_fonts(2)
for font_tuple in font_tuples:
for line in doc.xref_object(font_tuple[0]).splitlines():
line = line.strip()
if line.startswith("/ToUnicode"):
stream_id = int(line.split()[1])
old_stream_decoded = doc.xref_stream(stream_id).decode()
new_stream_decoded = re.sub('[0-9]+? beginbfrange.*endbfrange', new, old_stream_decoded, flags=re.DOTALL)
new_stream_encoded = new_stream_decoded.encode()
doc.update_stream(stream_id, new_stream_encoded)
doc.save(outputFileName)
Solution 1:[1]
The CIDs (character ids) can be converted correctly (manually) to Cyrillic using Windows 1251 encoding.
But PDFs do not support this encoding and the ToUnicode cmap on the font is built incorrectly. It also assumes Windows-1251 encoding when it should use Unicode values.
For example: the CID 0xCC is used to display the Cyrillic Capital Letter EM (U+041C). The internal font encoding maps 0xCC to the glyph (character image) representing U+041C so visually you get the correct letter.
But for text extraction you have to provide a ToUnicode cmap which tells what Unicode character each id represents. So the ToUnicode cmap should include an entry like this 0xCC -> U+041C but the ToUnicode cmap in the file includes this entry 0xCC -> U+00CC which is not Cyrillic Capital Letter EM.
Incidentally the 0xCC maps to U+041C using Windows 1251 encoding but the PDF processor has no way of knowing that.
Solution 2:[2]
Using
poppler-22.04.0\Library\bin>pdftotext -layout -f 1 -l 1 -enc ISO-8859-9 encoded.pdf -
Response, first page attempt with just layout seems to show there are some coding issues, but its a rough programmatic output that perhaps could be improved via FnR or code tweaking
?.??????, ?.?????,
?.?????????, ?.??????
?????? ??
I
?????? ?????????? ?????????
1 ??? ?????? ???? ?????
????????, ??, ?????? ?????, ???? ????
????????? ????.
????? ??? ????
???????? ????? ??? ?????.
?????????? ????????.
?????????? ???
2020 ??
Alternative is try this as more valid characters but needs "de-spacing"
pdftotext -layout -f 1 -l 1 -enc UTF-16 encoded.pdf -
? . ? ? ? ? ? ? ? ? ? , ? . ? ? ? ? ? ,
? . ? ? ? ? ? ? ? ? ? , ? . ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ?
I
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? , ! ?Q ? , ? ? ? ? ? ? ? ? ? ? ? ? , !? ? ? ? ? ? / ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .
? ? ? ? ? ? ? ?L ? ? ? ? ? ?
! ? ? ? ? ? ? ? ? ? ? ? ? ? ? ! ? ? ? ? ? ? ? ? .
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? .
? ? ? ? ? ? ? ? ? ? ? ? ? ?
2 0 2 0 ? ?
De-spaced but still needs replace ?! with C and ?/ with ?etc.
note in both cases chcp 1251
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | iPDFdev |
| Solution 2 |


