'Problem extracting table from pdf from web page with tabula (Web Scraping in Python)

when I extract a table from a page, I manage to extract without problems, but the data is out of order. There is data from one column that appears as the title of another column for example, how can I fix this? My code:

from tabula import read_pdf

url='https://becas.osinergmin.gob.pe/seccion/centro_documental/hidrocarburos/SCOP/SCOP-DOCS/2022/01-Demanda-Nacional-Combustibles-Liquidos-Enero-2022.pdf'

df=read_pdf(url, pages=1)
df

Thanks in advance.



Solution 1:[1]

I found the solution: Use tabula program to find coordinates. We just need upload the program: https://tabula.technology/ and dowload the JSON file to see the coordinates. We need to put it in "area" argument of read_pdf function in this order: top(y1), left (x1) , bottom (y2) and right (x2) distance.

Now I've created a loop for all pdfs with the same coordinates and It's working well.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ABNER FRANCISCO CASALLO TRAUCO