'How to add a hidden text layer in scanned PDF using python?

I got the text from scanned PDF and the location of the text.

It looks something like this.

[
('Lorem', {'_top': 89, '_left': 60, '_right': 91, '_bottom': 101}), 
('Ipsum', {'_top': 89, '_left': 97, '_right': 127, '_bottom': 101}), 
('Hello', {'_top': 130, '_left': 74, '_right': 104, '_bottom': 138}), 
('world!', {'_top': 129, '_left': 109, '_right': 145, '_bottom': 138})
]

Now, I want to add this text to the given coordinates in the same PDF in a hidden text layer using python.

Note: I know there are some services which directly convert scanned PDFs to searchable PDFs like OCRmyPDF. But I want to use something like a custom function to do this task.



Solution 1:[1]

I'd love to find an elegant solution to this problem as well. I would like to convert ALTO or PAGE XML output into a searchable pdf but haven't been able to figure out how to do it either.

One tool you could look at is hoc2pdf (sudo apt install exactimage). It convert hocr into a pdf with hidden text layers. As far as I know, the tool has its issues, so a more configurable solution would be better. Maybe someone here knows of a Python module that can be used?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alex W.