'How to extract table as text from the PDF using Python?

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

My goal is to extract the table from the whole PDF document.

Please have a look at the sample image of a page in PDF



Solution 1:[1]

  • I would suggest you to extract the table using tabula.
  • Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
  • Each table in your pdf is returned as one dataframe.
  • The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

Please refer to this repo of mine for more details.

Solution 2:[2]

If your pdf is text-based and not a scanned document (i.e. if you can click and drag to select text in your table in a PDF viewer), then you can use the module camelot-py with

import camelot
tables = camelot.read_pdf('foo.pdf')

You then can choose how you want to save the tables (as csv, json, excel, html, sqlite), and whether the output should be compressed in a ZIP archive.

tables.export('foo.csv', f='csv', compress=False)

Edit: tabula-py appears roughly 6 times faster than camelot-py so that should be used instead.

import camelot
import cProfile
import pstats
import tabula

cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt

cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt

print(time_tabula, time_camelot, time_camelot/time_tabula)

gave

1.8495559890000015 11.057014036000016 5.978199147125147

Solution 3:[3]

Extract table as text from the PDF using Python pdfminer

from pprint import pprint
from io import StringIO
import re
from pdfminer.high_level import extract_text_to_fp
from pdfminer.layout import LAParams
from lxml import html
ID_LEFT_BORDER = 56
ID_RIGHT_BORDER = 156
QTY_LEFT_BORDER = 355
QTY_RIGHT_BORDER = 455
# Read PDF file and convert it to HTML
output = StringIO()
with open('example.pdf', 'rb') as pdf_file:
    extract_text_to_fp(pdf_file, output, laparams=LAParams(), output_type='html', codec=None)
raw_html = output.getvalue()
# Extract all DIV tags
tree = html.fromstring(raw_html)
divs = tree.xpath('.//div')
# Sort and filter DIV tags
filtered_divs = {'ID': [], 'Qty': []}
for div in divs:
    # extract styles from a tag
    div_style = div.get('style')
    # print(div_style)
    # position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:292px; top:1157px; width:27px; height:12px;
# get left position
    try:
        left = re.findall(r'left:([0-9]+)px', div_style)[0]
    except IndexError:
        continue
# div contains ID if div's left position between ID_LEFT_BORDER and ID_RIGHT_BORDER
    if ID_LEFT_BORDER < int(left) < ID_RIGHT_BORDER:
        filtered_divs['ID'].append(div.text_content().strip('\n'))
# div contains Quantity if div's left position between QTY_LEFT_BORDER and QTY_RIGHT_BORDER
    if QTY_LEFT_BORDER < int(left) < QTY_RIGHT_BORDER:
        filtered_divs['Qty'].append(div.text_content().strip('\n'))
# Merge and clear lists with data
data = []
for row in zip(filtered_divs['ID'], filtered_divs['Qty']):
    if 'ID' in row[0]:
        continue
    data_row = {'ID': row[0].split(' ')[0], 'Quantity': row[1]}
    data.append(data_row)
pprint(data)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Trenton McKinney
Solution 2
Solution 3 DS_ShraShetty