'Parse PDF shape data in python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.

When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.

So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.

Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.

What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.

The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.

To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Parse PDF shape data in python

Sources

Related Questions