'Extract embedded pdf document from a webpage
I am trying to write a Python program that is able to extract a PDF file that is embedded in a website, e.g., in a PDF viewer. However, I haven't yet been able to find a robust way to accomplish this.
Is there a way or best practice to identify PDFs based on MIME-type maybe?
Solution 1:[1]
So basically what you need is to search for iframe in html page and check src attribute, it should contain url to the pdf file.
For example:
<iframe src="/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf" style="border: none; width: 100%; height: 100%;" frameborder="0"></iframe> from https://pdfobject.com/examples/pdfjs-forced.html
And so needed pdf url will be: https://pdfobject.com/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf
Notice that not every pdf web-readers provide ability to check location of file. For example site that you've shared don't do that.
You can load html page with urllib or requests and search for html-tag with beautifulsoup, or use scrapy, or tons of other tool.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | aiven |
