'How to add a hidden text layer in scanned PDF using python?
I got the text from scanned PDF and the location of the text.
It looks something like this.
[
('Lorem', {'_top': 89, '_left': 60, '_right': 91, '_bottom': 101}),
('Ipsum', {'_top': 89, '_left': 97, '_right': 127, '_bottom': 101}),
('Hello', {'_top': 130, '_left': 74, '_right': 104, '_bottom': 138}),
('world!', {'_top': 129, '_left': 109, '_right': 145, '_bottom': 138})
]
Now, I want to add this text to the given coordinates in the same PDF in a hidden text layer using python.
Note: I know there are some services which directly convert scanned PDFs to searchable PDFs like OCRmyPDF. But I want to use something like a custom function to do this task.
Solution 1:[1]
I'd love to find an elegant solution to this problem as well. I would like to convert ALTO or PAGE XML output into a searchable pdf but haven't been able to figure out how to do it either.
One tool you could look at is hoc2pdf (sudo apt install exactimage). It convert hocr into a pdf with hidden text layers. As far as I know, the tool has its issues, so a more configurable solution would be better. Maybe someone here knows of a Python module that can be used?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alex W. |
