'Problem extracting table from pdf from web page with tabula (Web Scraping in Python)
when I extract a table from a page, I manage to extract without problems, but the data is out of order. There is data from one column that appears as the title of another column for example, how can I fix this? My code:
from tabula import read_pdf
url='https://becas.osinergmin.gob.pe/seccion/centro_documental/hidrocarburos/SCOP/SCOP-DOCS/2022/01-Demanda-Nacional-Combustibles-Liquidos-Enero-2022.pdf'
df=read_pdf(url, pages=1)
df
Thanks in advance.
Solution 1:[1]
I found the solution: Use tabula program to find coordinates. We just need upload the program: https://tabula.technology/ and dowload the JSON file to see the coordinates. We need to put it in "area" argument of read_pdf function in this order: top(y1), left (x1) , bottom (y2) and right (x2) distance.
Now I've created a loop for all pdfs with the same coordinates and It's working well.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ABNER FRANCISCO CASALLO TRAUCO |
