'How do i get data from infobox?
It is necessary to find an article on Wiki and pull out the name for this level of classification from the table
I have this code:
import requests
from bs4 import BeautifulSoup
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text)
tble = bs.find('table', {'class' :'infobox'})
result = {}
row_count = 0
if table is None:
pas
else:
for tr in table.find_all('tr'):
if tr.find('th'):
pass
else:
row_count += 1
if row_count > 1:
if tr is not None:
result[tr.find('td').text.stip()] = tr.find('td').text
return result
print(urol(""))
Solution 1:[1]
Checking if the row has exactly two columns seems to be the easiest way. That works for me:
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text)
table = bs.find('table', {'class': 'infobox'})
result = {}
if table is None:
return None
for tr in table.find_all('tr'):
tds = tr.find_all('td')
if len(tds) == 2:
key, value = tds
result[key.text.strip()] = value.text.strip()
return result
print(get_infobox("https://en.wikipedia.org/wiki/Cat"))
Result:
{'Kingdom:': 'Animalia', 'Phylum:': 'Chordata', 'Class:': 'Mammalia', 'Order:': 'Carnivora', 'Suborder:': 'Feliformia', 'Family:': 'Felidae', 'Subfamily:': 'Felinae', 'Genus:': 'Felis', 'Species:': 'F.\xa0catus[1]'}
You can clean up results as necessary.
Solution 2:[2]
For Russian page you can do like this:
def get_infobox(url):
response = requests.get(url)
bs = BeautifulSoup(response.text, features='lxml')
return dict(x.getText().split(":") for x in bs.findAll('div', class_='ts-Taxonomy-rang-row'))
print(get_infobox('https://ru.wikipedia.org/wiki/%D0%9A%D0%BE%D1%88%D0%BA%D0%B0'))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Yevhen Kuzmovych |
| Solution 2 | Sergey K |
