'How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.

A couple of other libraries I have tried apart from the paid tool is :

PyPDF2
Textract
Tika,
pdfPlumber,
pdfMiner
PDFtotext
PyMuPDF – bounding box technique
Tabula

But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.

So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

Sources

Related Questions