'Extract embedded pdf document from a webpage

I am trying to write a Python program that is able to extract a PDF file that is embedded in a website, e.g., in a PDF viewer. However, I haven't yet been able to find a robust way to accomplish this.

Is there a way or best practice to identify PDFs based on MIME-type maybe?



Solution 1:[1]

So basically what you need is to search for iframe in html page and check src attribute, it should contain url to the pdf file.

For example: <iframe src="/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf" style="border: none; width: 100%; height: 100%;" frameborder="0"></iframe> from https://pdfobject.com/examples/pdfjs-forced.html

And so needed pdf url will be: https://pdfobject.com/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf

Notice that not every pdf web-readers provide ability to check location of file. For example site that you've shared don't do that.

You can load html page with urllib or requests and search for html-tag with beautifulsoup, or use scrapy, or tons of other tool.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 aiven