'Read all .pdf files in directory; Extract fillable fields to pandas df

I have am writing a script that reads a folder of .pdfs and extracts their fillable fields to a pandas df. I had success extracting one .pdf with the following code:

import numpy as np
import pandas as pd
import PyPDF2
import glob, os

pwd = os.getcwd()

pdfFileObj = open('pdf_filename', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T

I want to build a function that runs this script for all pdfs in the directory. My first idea was to use a function in glob that collects all pdfs. Here is what I have so far:


import numpy as np
import pandas as pd
import PyPDF2
import glob, os

pwd = os.getcwd()

def readfiles():
   os.chdir(pwd)
   pdfs = []
   for file in glob.glob("*.pdf"):
       print(file)
       pdfs.append(file)

pdfFileObj = open(readfiles, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

fields_dict = pdfReader.getFormTextFields()
series = pd.Series(fields_dict).to_frame()
df = pd.DataFrame(pd.Series(fields_dict)).T

Unfortunately, this doesn't work because I cannot put a function in the pdfFileReader. Does anyone have suggestions on a better way to do this? Thanks!



Solution 1:[1]

I can't comment, new account. But you could try making your readFiles function return the array pdfs.

Then in code execution below just:

listofPDF=readfiles()
arrayofDF=list()
for file in listofPDF:
       pdfFileObj = open(file , 'rb')
       pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
       ##execute your code to obtain a single dataframe from a pdf here
       fields_dict = pdfReader.getFormTextFields()
       series = pd.Series(fields_dict).to_frame()
       df = pd.DataFrame(pd.Series(fields_dict)).T
       arrayofDF.append(df)



You would end up having a list of dataframes, each one corresponding to one of the pdf files, if the first part of the code ( in which you get the dataframe from the singular pdf file) works.

Additionally, you could make a dictionary like {filename:file , dataframe: df} and then append that to your list, so you can later recover the dataframe based of the name of the file. It all depends on what you plan to do with the dataframes later.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Luis Vásquez