'How to read multiple pdf from a folder one by one
I'm trying to extract data from a pdf file and convert it into pandas dataframe I used 'fitz' from Pymupdf module to extract the data. and then with pandas i'm converting it into dataframe
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
#Code for data extraction:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
pypdf_text = ""
for page in doc:
pypdf_text += page.getText()
The above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf
But similarly, I have a folder which contains many pdf documents. My goal is to read each pdf file one by one from the folder and do the text extraction and then convert it into dataframe. How can I do that in python?
Solution 1:[1]
try this:
import PyPDF2
import re
for k in range(1,100):
# open the pdf file
object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))
# get number of pages
NumPages = object.getNumPages()
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
or this:
from pdfminer.pdfpage import PDFPage
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
if '.pdf' in i:
firstpdf = i
break
with open('F:/technophile/Proj/SOURCE/'+firstpdf, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
allyourpdf.append(text)
Solution 2:[2]
You can use pathlib builtin function to list out all the pdfs in your directory
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("<path>/<to>/<pdfs>/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]
Now you can simply run your block of code in a loop to iterate over the pdfs.
for example:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
