'Reading pdf files line by line using python

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

The output is [u''] instead of reading the content.

python pypdf2

Solution 1:^[1]

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

I use it to iterate page by page of pdf and search for key terms in it and process further.

Solution 2:^[2]

May be this can help you to read PDF.

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

Solution 3:^[3]

I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.

Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

The find function can be found in Nadia Alramli's answer here Find a file in python

Solution 4:^[4]

To Read the files from Multiple Folders in a directory, below code can be used- This Example is for reading pdf files:

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content

Solution 5:^[5]

def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

Solution 6:^[6]

The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.

Sadly, the second one still happens for some PDFs.

Solution 7:^[7]

Hello Rahul Pipalia,

If not install PyPDF2 in your python so first install PyPDF2 after use this module.

Installation Steps for Ubuntu (Install python-pypdf)

First, open terminal
After type sudo apt-get install python-pypdf

Your Probelm Solution

Try this below code,

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

I hope my answer is helpful.
If any query so comments, please.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Martin Thoma
Solution 2	Tejas Thakar
Solution 3	Ahaha
Solution 4
Solution 5	thrinadhn
Solution 6	Martin Thoma
Solution 7

'Reading pdf files line by line using python

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]

Solution 6:[6]

Solution 7:[7]