'Odd type error warning when using bs4 to obtain value from website
The following is a snippet from a website, where I am trying to obtain (only) the "Text to Capture". That text is surrounded by a couple of "div" classes, which contain tables, text etc.
<div class="rankbox">
<div>Ranking
<div class="tooltip-wrapper"> ... </div>
<div class="tooltiptext hide"> ... </div>
**Text to Capture**
<span class="sr-only"> of 5</span>
<span class="rank_chip rankrect_1"> </span>
<span class="rank_chip rankrect_2"> </span>
<span class="rank_chip rankrect_3">3</span>
<span class="rank_chip rankrect_4"> </span>
<span class="rank_chip rankrect_5"> </span>
</div>
</div>
The oddity here is that the text to capture has no Tags associated to it whatsoever. I have gotten this to work:
rankbox = soup.find('div', attrs={'class': 'rankbox'})
lx = [x for x in list(rankbox.contents[1])]
returnvalue = str(lx[4]).strip()
However, I am getting a type error warning from pycharm: Expected type 'Iterable[_T]' (matched generic type 'Iterable[_T]'), got 'PageElement' instead because rankbox.contents[1] is a PageElement, not a List
I am wondering whether there is a more elegant way of doing achieving this , avoiding a warning too
Solution 1:[1]
Given this HTML source, the following is the a possible solution that I could think about.
The idea is
- Get the first
divtag underdiv.rankbox - Remove all
divandspantags - Obtain text from the remaining source
- Remove the text "Ranking" at the beginning
- Remove surrounding spaces
import re
from bs4 import BeautifulSoup
html = """
<div class="rankbox">
<div>Ranking
<div class="tooltip-wrapper"> ... </div>
<div class="tooltiptext hide"> ... </div>
**Text to Capture**
<span class="sr-only"> of 5</span>
<span class="rank_chip rankrect_1"> </span>
<span class="rank_chip rankrect_2"> </span>
<span class="rank_chip rankrect_3">3</span>
<span class="rank_chip rankrect_4"> </span>
<span class="rank_chip rankrect_5"> </span>
</div>
</div>
"""
soup = BeautifulSoup(html)
x = soup.select("div.rankbox div")[0] # div starting with Ranking
# remove all divs and spans
for d in x.find_all("div"):
d.extract()
for s in x.find_all("span"):
s.extract()
x = x.text
x = re.sub(r"^Ranking", "", x) # remove "Ranking" at first"
x = x.strip()
x
# '**Text to Capture**'
Solution 2:[2]
Previous answer helped me to find the shortest code for this:
xtract = soup.find('div', attrs={'class': 'zr_rankbox'})
x = xtract.select('div')[0].find_all(text=True, recursive=False)[1].get_text(strip=True)
without type error warning
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Kota Mori |
| Solution 2 | Quirn |
