'How to extract only certain table from the pdf (invoice) which contains multiple tables in the structure format

enter image description here

How to extract only one table from a pdf which contains multiple tables. I have tried using AmazonTextract but the problem is it gives me all the tables belonging to that pdf in a csv. But I need to extract only certain tables based on some conditions like text the bounding box dimensions.

A couple of other libraries I have tried apart from the paid tool is :

  1. PyPDF2
  2. Textract
  3. Tika,
  4. pdfPlumber,
  5. pdfMiner
  6. PDFtotext
  7. PyMuPDF – bounding box technique
  8. Tabula

But the problem lies when I have multiple pdfs for some open source libraries are able to read the text and give the text of the pdf but not in a structured format. Sometimes they are not able to read the pdf text because it is scanned, image pdfs.

So I decided to use AmazonText. Let me know if you have any other recommendations for libraries / paid tool which works better than amazontextract.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source