'How to extract table as text from the PDF using Python?

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

My goal is to extract the table from the whole PDF document.

Solution 1:^[1]

I would suggest you to extract the table using tabula.
Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
Each table in your pdf is returned as one dataframe.
The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

Please refer to this repo of mine for more details.

Solution 2:^[2]

If your pdf is text-based and not a scanned document (i.e. if you can click and drag to select text in your table in a PDF viewer), then you can use the module camelot-py with

import camelot
tables = camelot.read_pdf('foo.pdf')

You then can choose how you want to save the tables (as csv, json, excel, html, sqlite), and whether the output should be compressed in a ZIP archive.

tables.export('foo.csv', f='csv', compress=False)

Edit: tabula-py appears roughly 6 times faster than camelot-py so that should be used instead.

import camelot
import cProfile
import pstats
import tabula

cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt

cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt

print(time_tabula, time_camelot, time_camelot/time_tabula)

gave

1.8495559890000015 11.057014036000016 5.978199147125147

Solution 3:^[3]

Extract table as text from the PDF using Python pdfminer

from pprint import pprint
from io import StringIO
import re
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from lxml import html
ID_LEFT_BORDER = 56
ID_RIGHT_BORDER = 156
QTY_LEFT_BORDER = 355
QTY_RIGHT_BORDER = 455
# Read PDF file and convert it to HTML
output = StringIO()
with open('example.pdf', 'rb') as pdf_file:
    extract_text_to_fp(pdf_file, output, laparams=LAParams(), output_type='html', codec=None)
raw_html = output.getvalue()
# Extract all DIV tags
tree = html.fromstring(raw_html)
divs = tree.xpath('.//div')
# Sort and filter DIV tags
filtered_divs = {'ID': [], 'Qty': []}
for div in divs:
    # extract styles from a tag
    div_style = div.get('style')
    # print(div_style)
    # position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:292px; top:1157px; width:27px; height:12px;
# get left position
    try:
        left = re.findall(r'left:([0-9]+)px', div_style)[0]
    except IndexError:
        continue
# div contains ID if div's left position between ID_LEFT_BORDER and ID_RIGHT_BORDER
    if ID_LEFT_BORDER < int(left) < ID_RIGHT_BORDER:
        filtered_divs['ID'].append(div.text_content().strip('\n'))
# div contains Quantity if div's left position between QTY_LEFT_BORDER and QTY_RIGHT_BORDER
    if QTY_LEFT_BORDER < int(left) < QTY_RIGHT_BORDER:
        filtered_divs['Qty'].append(div.text_content().strip('\n'))
# Merge and clear lists with data
data = []
for row in zip(filtered_divs['ID'], filtered_divs['Qty']):
    if 'ID' in row[0]:
        continue
    data_row = {'ID': row[0].split(' ')[0], 'Quantity': row[1]}
    data.append(data_row)
pprint(data)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Trenton McKinney
Solution 2
Solution 3	DS_ShraShetty

'How to extract table as text from the PDF using Python?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Extract table as text from the PDF using Python pdfminer

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]