'scraping pdf files multiple pages from url

I want to scrape the information on this PDF in python. I'm not sure where to start because it isn't organized at all. I'm used to scraping HTML. I tried converting it to HTML and that didn't really help.

How would you try to scrape this PDF? Here is a link to the PDFs (any will work, they're all similar): https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121-v2.pdf

Thank you for any help :D



Solution 1:[1]

It is organized - it's in a "table" - pdfplumber works well for this.

pdfplumber example

Once you have settings that correctly match your data you can .extract_table()

import pdfplumber
import pandas as pd

pdf = pdfplumber.open('file.pdf')

page = pdf.pages[0]
table = page.extract_table(
    dict(vertical_strategy="text", keep_blank_chars=True)
)

df = pd.DataFrame(table)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1