'scraping pdf files multiple pages from url
I want to scrape the information on this PDF in python. I'm not sure where to start because it isn't organized at all. I'm used to scraping HTML. I tried converting it to HTML and that didn't really help.
How would you try to scrape this PDF? Here is a link to the PDFs (any will work, they're all similar): https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121-v2.pdf
Thank you for any help :D
Solution 1:[1]
It is organized - it's in a "table" - pdfplumber works well for this.
Once you have settings that correctly match your data you can .extract_table()
import pdfplumber
import pandas as pd
pdf = pdfplumber.open('file.pdf')
page = pdf.pages[0]
table = page.extract_table(
dict(vertical_strategy="text", keep_blank_chars=True)
)
df = pd.DataFrame(table)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |