'Detecting Paragraphs and Titles in PDF

I am super new to programming AI and am trying to learn and experiment at same time. I find stack community a lot more helpful and informative compared to youtube videos so I thought I'd ask for help here.

Current program that I'm trying to make is about extracting information from academic research pdf's. I am using Easyocr to detect and read information from pdf. code I am using I learned from a video on youtube by AIEngineering . I was successfully able to detect in my PDF. I however don't exactly know how to select information inside specific bounding boxes and transfer the information into a file.

from pdf2image import convert_from_path
import easyocr
import numpy as np
import PIL
from PIL import ImageDraw
import spacy

reader = easyocr.Reader(['en'])

images = convert_from_path('/content/Testpdf1.pdf')

from IPython.display import display
from PIL import Image
display(images[0])


bounds = reader.readtext(np.array(images[0]),paragraph='True')
bounds

def draw_boxes(image, bounds, color='red', width=2):
 draw = ImageDraw.Draw(image)
 for bound in bounds:
  p0, p1, p2, p3 = bound[0]
  draw.line([*p1, *p1, *p2, *p3, *p0], fill=color, width=width)
 return image

draw_boxes(images[0], bounds)


bounds[4][1]

This is what the output looks like of the code

This is a test pdf but most of other pdf I have mostly follow same layout for showing information. Where others are titled properly with abstract and results and such before they actually start, as you can see it doesn't detect paragraphs very well and combines all of them together.

Taking current example into consideration, the abstract is all on bold letters in the photo and thats the only part I want in bounding box how do I do that? Same for title, it won't omit the authors or make it a different bounding box. I would like to extract authors differently. I tried playing around with bounding box setting in easyocr but most of the time it just makes it worse. I found this to be a neat alternative to easyocr however its in java and not python.

My main goal here is to detect proper information that I need and extract it to a json file.

In the future I am trying to add a ML model to it as I want to extract more than just title and abstract. But I am still learning more about it everyday! If you guys can also provide me with some resources that will help me do that, or some that help me learn more about it would be amazing!

Thank You for all your help!

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Detecting Paragraphs and Titles in PDF

Sources

Related Questions