'Text Extraction from PDF's of very large pixel dimensions

Recently delved into the world of Image recognition and analysis. I stumbled upon this particular PDF file from which I wanted to extract the tabular data. I have used multiple methods starting from plain OCR, pdfplumber, camelot, EAST Model etc. The thing is, I am not able to extract the text accurately. The reasons I feel it's failing is because of the large pixel dimensions of the scanned document. The bounding boxes are not capturing the data regions correctly.

Can someone provide an accurate solution to this problem? I am providing the PDF in this drive link for your perusal.

https://drive.google.com/file/d/1E-9emxVQlaumpNR-lcFl2e9Vp4XqoZeb/view?usp=sharing

Any help would be appreciated. Thank you!

Example attempts:

1)Convert PDF to Image

from pdf2image import convert_from_path

pdfs = r"C:\Users\IM-LP-1672\Desktop\PDF-Excel Validator\140-LS140576-004.pdf"
pages = convert_from_path(pdfs, 350, poppler_path=r'C:\Program Files\poppler-0.68.0\bin')

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"  
    page.save(image_name, "JPEG")
    i = i+1

2)Marking Regions of Image for Information Extraction

import cv2
import matplotlib.pyplot as plt

def rescale_frame(frame, scale=0.10):
    width= int(frame.shape[1]*scale)
    height= int(frame.shape[0]*scale)
    dimensions= (width,height)

    return cv2.resize(frame,dimensions,interpolation= cv2.INTER_AREA)


def mark_region(image_path):
    
    image = cv2.imread(image_path)
    

    # define threshold of regions to ignore
    THRESHOLD_REGION_IGNORE = 40

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (9,9), 0)
    thresh = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,11,30)

    # Dilate to combine adjacent text contours
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9,9))
    dilate = cv2.dilate(thresh, kernel, iterations=4)

    # Find contours, highlight text areas, and extract ROIs
    cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]

    line_items_coordinates = []
    for c in cnts:
        area = cv2.contourArea(c)
        x, y, w, h = cv2.boundingRect(c)
        

        
        if w < THRESHOLD_REGION_IGNORE or h < THRESHOLD_REGION_IGNORE:
            continue
        
        image = cv2.rectangle(image, (x,y), (x+w, y+h), color=(255,0,255), thickness=3)
        line_items_coordinates.append([(x,y), (x+w, y+h)])
    
#     image = cv2.rectangle(image, (5000,7100), (11814, 7703), color=(255,0,255), thickness=3)
#     line_items_coordinates.append([(5000,7100), (11814, 7703)])

    return image, line_items_coordinates

3)Applying OCR to the Image

import pytesseract
pytesseract.tesseract_cmd= r'C:\Program Files\Tesseract-OCR\tesseract.exe'


# load the original image
image = cv2.imread('Page_1.jpg')

# get co-ordinates to crop the image
c = line_items_coordinates[2]

# cropping image img = image[y0:y1, x0:x1]
img = image[c[0][1]:c[1][1], c[0][0]:c[1][0]]    

plt.figure(figsize=(10,10))
plt.imshow(img)
plt.savefig("cropped_image")

# convert the image to black and white for better OCR
ret,thresh1 = cv2.threshold(img,120,255,cv2.THRESH_BINARY)

# pytesseract image to string to get results
text = str(pytesseract.image_to_string(thresh1, config='--psm 6'))
print(text)

Ouput image and text:

N SHOP MATERIALS
ID MATERIAL DESCRIPTION COMMODITY CODE SIZE (in) QTY
PIPE
1 PIPE, CS A106-B, SMLS, BE, STD, B36.10M PPPC1HSMBESD00000501 12 18-7"
FITTINGS
CONT. ON 2 IRBR SW, CS A105, CL3000, SW, HDR/BRN, SP-97 PLSC1AK3000000F00F03 12 x 3/4 {
y 3 90LR ELBOW, CS A234-WPB, SMLS, BE, STD, B16.9 P9LC34SMBESD00000U01 12 2
1'-0 3/16" A 140-LS140576-KAAA1-12"-Ic-005
ot OA) t40-L8140576-008 FLANGES
“i 47 E 538'-9" ——
, bs S 1152'-9" 4  WNFLG, CS A105N, CL150, RF, STD, B16.5 PFWC1B15RFSD00000101 12 2
STEMS Lwyye” EL +48'-4 7/16"
Sa son VALVES / IN-LINE ITEMS
12°x3/4"NB 5 GATE VALVE, CS A105, CL800, MSW/FNPT, B16.11, MFR DIMS, EXT +PVGC1A80PF00BQ02CB03 3/4 1
ZO E 537-10 1/2" BODY, API 602, OS&Y, BB, STD PORT, SLD WDG, API TRIM 8, FG
SZ S 4459.9" PACKING, HW , V01GEPJ02104
fo EL +49'-4 5/8"
DD : FIELD MATERIALS
DZ fo ID MATERIAL DESCRIPTION COMMODITY CODE SIZE (in) QTY
RM bk? FITTINGS
; ol > go “ 6 ROUND HEAD PLUG, CS A105, MNPT, B16.11 PP7C1AMT00B200000T01 3/4 1
EL +48'-4 7/16" fo PIPE SPOOLS
Cc > PM-140-L$140576-004-1
NX
~ ‘ ~ ~ 7. ’
N . - 40m
we CONT. ON
>. XN. - 140-LS140576-KAAA1-12"-Ic-003
“Ss Ne _ 140-LS140576-003
we Al E 538'-9"
oS gp OS \a S 1166'-9"
we ge NK EL +48'-4 7/16"
we ge (PM-140-L$140576-004-1]
al \e 533-1"
S 1166'-9"
EL +48'-4 7/16"
PROJECT NUMBER CONSTRUCTION WORK PACKAGE CONSTRUCTION PHASE
DESIGN CONDITIONS = = RODEO RENEWED PROVECT PHILLIPS 68
ee ee ee es es 1, ALL BOLT HOLES TO STRADDLE NORTH-SOUTH AND/OR REFERENCES P&ID: 0140-YD-001-014 (8) | uneciass: KAA (3) Worley PHILLIPS 66 SAN FRANCISCO REFINERY COMPANY
PT td a FRI ERY DmeNStons ON HELD FABRICATED PIPING | rains: System 4 (13) | vesionpress: (Q )145 psig | service: = LS(4 ) |i RODEO, CA, USA San Francisco Refinery
a Po Farnese 4)NO | oesiontenp: 502 0) | nurs lc ~(5) | WASNO. | _UNTNO CINE NUNBER PIPING ISOMETRIG DRAWING NUMBER
7 N77 —_”
PO td a ai eerste ne WELOINo, PRENEAT, AND PHUHT teste: = HYDRO (15) | OPERPRESS: (44) 29 = —PSIG_| INSULTHICK: 20mmPyro (6) | 140 140 | $140576-KAAA1-12"-| o(1) 140-LS$140576-004
— — — ? 140-LS140576-KAAA1-12"-Ic-004 P-140/LS 140576

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Text Extraction from PDF's of very large pixel dimensions

1)Convert PDF to Image

2)Marking Regions of Image for Information Extraction

3)Applying OCR to the Image

Sources

Related Questions