'How to read multiple pdf from a folder one by one

I'm trying to extract data from a pdf file and convert it into pandas dataframe I used 'fitz' from Pymupdf module to extract the data. and then with pandas i'm converting it into dataframe

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#Code for data extraction:

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        pypdf_text = ""
        for page in doc:
            pypdf_text += page.getText()

The above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf

But similarly, I have a folder which contains many pdf documents. My goal is to read each pdf file one by one from the folder and do the text extraction and then convert it into dataframe. How can I do that in python?

Solution 1:^[1]

try this:

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()


    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page " + str(i)) 
        Text = PageObj.extractText() 
        # print(Text)

or this:

from pdfminer.pdfpage import PDFPage
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
    if '.pdf' in i:
        firstpdf = i
        break

with open('F:/technophile/Proj/SOURCE/'+firstpdf, 'rb') as fh:

    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()
    allyourpdf.append(text)

Solution 2:^[2]

You can use pathlib builtin function to list out all the pdfs in your directory

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("<path>/<to>/<pdfs>/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

Now you can simply run your block of code in a loop to iterate over the pdfs.

for example:

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        ...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2

'How to read multiple pdf from a folder one by one

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]