'How do I get the HTML between two div elements in Python
I'm trying to scrape all the paragraphs from Wikipedia that come between the main heading of the page and the table of contents. I noticed that they always come between two div elements as shown below:
<div id="some-div">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
<div id="some-other-div">...</div>
I want to grab all of the HTML between the two div elements (not just the text) Looking for a solution in Python.
Solution 1:[1]
I doubt you can depend on utterly consistent formatting. However, this seems to work for the 'Python (programming language)' page, where the introductory text is delimited by the 'Contents' box.
I offer a few notes:
- fetchPreviousSiblings returns the paragraphs in reverse order.
- I would check the length of contents against the unlikely possibility of more than one occurrence.
- It's almost certainly necessary with this approach to check for rubbish.
from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
HTML = str ( urlopen ( URL ) . read() )
soup = BeautifulSoup ( HTML )
contents = soup.findAll('div', attrs={'id': 'toc'})
paras = contents[0].fetchPreviousSiblings('p')
Solution 2:[2]
With BeautifulSoup you will find the first div and the second div by their ids:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")
first_div = bs.find(id="some-div")
second_div = bs.find(id="some-other-div")
After this, we create a list with all the elements in between the two divs (converted to strings) and afterwards join them together. For this we loop through all the siblings after the first_div and break when we reach the second div:
in_between = []
for sibling in first_div.next_siblings:
if sibling == second_div:
break
else:
in_between.append(str(sibling))
in_between = "".join(in_between)
The previous codeblock can be replaced by this list comprehension in one line:
in_between = "".join([str(sibling) for sibling in takewhile(lambda x: x != second_div, first_div.next_siblings)])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Pux |
Solution 2 | Bastian |