'how can i classify the chapters of a pdf file and analyze the content per chapter?

I want to classify and analyze chapters and subchapters from a book in PDF format. So count the number of words and examine which word occurs how often and in which chapter.

pip install PyPDF2

import PyPDF2
from PyPDF2 import PdfFileReader

# Creating a pdf file object
pdf = open('C:/Users/Dominik/Desktop/bsc/pdf1.pdf',"rb")
# creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
# checking number of pages in a pdf file
print(pdf_reader.numPages)
print(pdf_reader.getDocumentInfo())
# creating a page object
page = pdf_reader.getPage(0)
# finally extracting text from the page
print(page.extractText())
# Extracting entire PDF
for i in range(pdf_reader.getNumPages()):
   page = pdf_reader.getPage(i)
   a = str(1+pdf_reader.getPageNumber(page))
   print (a)
   page_content = page.extractText()
   print (page_content)
# closing the pdf file
pdf.close()

this code already works. now I want to do more analysis like

  1. store each chapter in its own variable and count the number of words. In the end, everything should be stored in an excel file.


Solution 1:[1]

I tried something similar like this with CVs in PDF format. But all I came to know is the following:

PDF is an unstructured format. It is not possible to extract information from all the PDFs in a structured way. But if you know the structure of the books in PDF format, you can divide the Title of the chapters by using their unique identity like if they are written on BOLD or Italic format. This link can help you extract those information. You can then traverse through the chapter till it hits the next chapter title.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1