'Getting URL Link Embedded in 'a' tag with 'accesskey'. 'Href' Not returning proper URL
I'm working through Mitchell's book "Web Scraping with Python" 2nd Edition. The pertinent gitHub link is https://github.com/REMitchell/python-scraping/blob/master/Chapter03-web-crawlers.ipynb. I'm having issues with the "Collecting Data Across an Entire Site" program, which targets Wikipedia. I'm trying to extract the URL(s) which permits the editing of a page.
The author's highlighted example, where he stopped the program, is the Finnish Civil War page. Upon inspecting his output code, it looks that he had issue returning this/these link(s) as well. A quick glance at https://en.wikipedia.org/wiki/Finnish_Civil_War shows multiple edit URL's, which should be accessible
I tried a few different things in the code, which appear to have gotten me one step closer, specifically in the code block where I search for descendents of 'ca-edit':
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
bs = BeautifulSoup(html, 'html.parser')
try:
print(bs.h1.get_text())
print(bs.find(id='mw-content-text').find_all('p')[0]) # gets first paragraph
print('---------------------------------------------------------------')
print(bs.find(id='ca-edit').getText())#.find(accesskey='e').getText())#.find(href=re.compile('^(/w/)')))
print('---------------------------------------------------------------')
tag = bs.find(id='ca-edit')
for descendant in tag.descendants:
print(type(descendant))
print(descendant)
print(descendant.findNext())
print('---------------------------------------------------------------')
print(bs.span.descendants)
print(bs.find_all('span'))
print('---------------------------------------------------------------')
print(bs.find_all('a', {'tag':'accesskey'}))
print('---------------------------------------------------------------')
print(bs.find_all('span', {'class':'edit'}))
print('---------------------------------------------------------------')
print(bs.find(id='ca-edit'))
print(len(bs.find_all(text='Edit this page [alt-shift-e]')))
print(bs.find(id='ca-edit').find('span').find('edit'))
except AttributeError:
print('This page is missing something! Continuing.')
for link in bs.find_all('a', href=re.compile('^(/wiki/)')): # this includes ALL pages
if 'href' in link.attrs:
if link.attrs['href'] not in pages: # This indicates that we've encountered a new page
newPage = link.attrs['href']
print('-'*20)
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks('')
If I run this code and stop on the '.../wiki/Wikipedia:Perennial_proposals' page, I get the following output, truncated to the completion of the "for descendant in tag.descendants:" code block:
/wiki/Wikipedia:Perennial_proposals
Wikipedia:Perennial proposals
<p>This is a list of things that have frequently been proposed on Wikipedia, and have been <a class="mw-redirect" href="/wiki/Wikipedia:Rejected_proposals" title="Wikipedia:Rejected proposals">rejected by the community</a> several times in the past. It should be noted that merely listing something on this page does not mean it will never happen, but that it has been discussed before and never met consensus. <a class="mw-redirect" href="/wiki/Wikipedia:Consensus_can_change" title="Wikipedia:Consensus can change">Consensus can change</a>, and some proposals that remained on this page for a long time have finally been proposed in a way that reached consensus, but you should address rebuttals raised in the past if you make a proposal along these lines. If you feel you would still like to do one of these proposals, then raise it at the <a href="/wiki/Wikipedia:Village_pump" title="Wikipedia:Village pump">village pump</a>.
</p>
---------------------------------------------------------------
Edit
---------------------------------------------------------------
<class 'bs4.element.Tag'>
<a accesskey="e" href="/w/index.php?title=Wikipedia:Perennial_proposals&action=edit" title="Edit this page [e]"><span>Edit</span></a>
<span>Edit</span>
<class 'bs4.element.Tag'>
<span>Edit</span>
<li class="mw-list-item" id="ca-history"><a accesskey="h" href="/w/index.php?title=Wikipedia:Perennial_proposals&action=history" title="Past revisions of this page [h]"><span>View history</span></a></li>
<class 'bs4.element.NavigableString'>
Edit
<li class="mw-list-item" id="ca-history"><a accesskey="h" href="/w/index.php?title=Wikipedia:Perennial_proposals&action=history" title="Past revisions of this page [h]"><span>View history</span></a></li>
---------------------------------------------------------------
How can I extract the URL located in the code block below?
<a accesskey="e" href="/w/index.php?title=Wikipedia:Perennial_proposals&action=edit" title="Edit this page [e]"><span>Edit</span></a>
<span>Edit</span>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
