'Extract URLs from PDF - text doesn't match URL
I'm using following code to extract URLs from PDF and it works fine to extract the anchor but does not work when anchor text is different than the URL behind it. For example: 'www.page.com/A' is used as a short url in the text but the actual URL behind it is a longer (full) version.
The code I'm using is:
urllib.request.urlretrieve(url, "remoteFile")
pdfFile = PyPDF2.PdfFileReader("remoteFile", strict=False)
key='/Annots'
uri='/URI'
ank='/A'
mylist=[]
for page_no in range(pdfFile.numPages):
page=pdfFile.getPage(page_no)
text=page.extractText()
pageObject=page.getObject()
if key in pageObject.keys():
ann = pageObject.keys():
for a in ann:
try:
u=a.getObject()
if uri in u[ank].keys():
mylist.append(u[ank][uri])
print(u[ank][uri])
except KeyError:
pass
As I said, it works ok if the anchor and the link are the same. If the link is different, it saves the anchor. Ideally I would save both (or just link).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
