'PyPDF2.extractText( ) extracts text but tokenizes strings when pdfs contain watermark

I am using PyPDF2 to read multiple files and extract the page number that contain specific text. For the most part it works fine but I notice that some files with watermark can't read properly. They can't return page numbers even though they meet the criteria, because the PY2PDF extractText() method tokenizes the strings instead of keeping them as they are in the pdf. For example, I am looking for string " tax return 1065" in the pdf and the extract text () returns "tax" "return" "1065", therefore it's considered unmatched and won't return the page number. My code is:

object = PyPDF2.PdfFileReader(filepath)
NumPages = object.getNumPages()
String = 'tax return 1065'
Pagelist=[]
for i in range(0,NumPages):
    PageObj=object.getPage(i)
    Text = PageObj.extractText()
    ReSearch = re.search(String,Text)
    if ReSearch != None:
        Pagelist.append(i)
print(Pagelist)

it returns an empty list even though the pdf has a few pages that contain the string "tax return 1065", so I use print(Text) to see the extracted text and notice that all the words are tokenized onto separate lines:

  • dear
  • client
  • please
  • review
  • your
  • tax
  • return
  • 1065

it only runs into the above issue when the pdf has watermark. Does anyone have a solution for it? Thank you!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source