'Webscraping issues with tr
Trying to scrape this website, initally on the this table as a beginner step before going for the further links.
https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter
The table is called 'table.basic' and has the source code:
<table class="basic" style=""><thead><tr><th class="cu500 " style="width: 11.6%; ">Date</th><th class="hide cs500 " style="width: 5.8%; ">Date</th><th style="width: 46.5%; ">Race</th><th class="cu500 " style="width: 29.1%; ">Winner</th><th style="width: 7%; ">Class</th></tr></thead><tbody>
<tr class="striked" ><td class="cu500 " >19.01 - 24.01</td><td class="hide cs500 " >19.01</td><td><span class="flag au"></span> <a href="race/tour-down-under/2021/startlist/preview">Santos Tour Down Under</a></td><td class="cu500 " ><a href="rider/"></a></td><td>2.UWT</td></tr>
<tr class="striked" ><td class="cu500 " >31.01</td><td class="hide cs500 " >31.01</td><td><span class="flag au"></span> <a href="race/great-ocean-race/2021/startlist/preview">Cadel Evans Great Ocean Road Race</a></td><td class="cu500 " ><a href="rider/"></a></td><td>1.UWT</td></tr>
<tr ><td class="cu500 " >21.02 - 27.02</td><td class="hide cs500 " >21.02</td><td><span class="flag ae"></span> <a href="race/uae-tour/2021">UAE Tour</a></td><td class="cu500 " ><a href="rider/tadej-pogacar">POGAČAR Tadej</a></td><td>2.UWT</td></tr>
<tr ><td class="cu500 " >27.02</td><td class="hide cs500 " >27.02</td><td><span class="flag be"></span> <a href="race/omloop-het-nieuwsblad/2021">Omloop Het Nieuwsblad ME</a></td><td class="cu500 " ><a href="rider/davide-ballerini">BALLERINI Davide</a></td><td>1.UWT</td></tr>
<tr ><td class="cu500 " >06.03</td><td class="hide cs500 " >06.03</td><td><span class="flag it"></span> <a href="race/strade-bianche/2021">Strade Bianche</a></td><td class="cu500 " ><a href="rider/mathieu-van-der-poel">VAN DER POEL Mathieu</a></td><td>1.UWT</td></tr>
<tr ><td class="cu500 " >07.03 - 14.03</td><td class="hide cs500 " >07.03</td><td><span class="flag fr"></span> <a href="race/paris-nice/2021">Paris-Nice</a></td><td class="cu500 " ><a href="rider/maximilian-schachmann">SCHACHMANN Maximilian</a></td><td>2.UWT</td></tr>
I can retrieve the table as a soup using:
url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.findAll('table')
print(table)
The issue I have is that when I try and retirve headers etc I just get errors:
File "C:\Users\Tim\OneDrive\Documents\Python Scripts\cycling stats website scrape.py", line 50, in headers = [heading_text for heading in table.find("class=cu500 ")]
File "C:\Users\Tim\anaconda3\lib\site-packages\bs4\element.py", line 2253, in getattr raise AttributeError(
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Solution 1:[1]
.find_all() will return a list of elements, while .find() will return the first instance it finds.
You also might want to familiarize yourself with basic html, specifically tags and attributes, so that you can use beautifulsoup correctly in finding those within the html.
Also, while this is a nice site/example to use to learn and practice using bs4, pandas also will get the job done too (usually my go to if I'm after <table> tags).
import requests
from bs4 import BeautifulSoup
url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', {'class':'basic'})
headers = [heading.text for heading in table.find_all('th',{"class":"cu500"})]
print(headers)
Output:
['Date', 'Winner']
Pandas:
import pandas as pd
url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'
# Usually the line below is enough
# But for some reason returning Forbidden
#dfs = pd.read_html(url)[0]
response = requests.get(url)
dfs = pd.read_html(response.text)[0]
Output:
print(dfs)
Date Date.1 ... Winner Class
0 19.01 - 24.01 19.01 ... NaN 2.UWT
1 31.01 31.01 ... NaN 1.UWT
2 21.02 - 27.02 21.02 ... POGA?AR Tadej 2.UWT
3 27.02 27.02 ... BALLERINI Davide 1.UWT
4 06.03 6.03 ... VAN DER POEL Mathieu 1.UWT
5 07.03 - 14.03 7.03 ... SCHACHMANN Maximilian 2.UWT
6 10.03 - 16.03 10.03 ... POGA?AR Tadej 2.UWT
7 20.03 20.03 ... STUYVEN Jasper 1.UWT
8 22.03 - 28.03 22.03 ... YATES Adam 2.UWT
9 24.03 24.03 ... BENNETT Sam 1.UWT
10 26.03 26.03 ... ASGREEN Kasper 1.UWT
11 28.03 28.03 ... VAN AERT Wout 1.UWT
12 31.03 31.03 ... VAN BAARLE Dylan 1.UWT
13 04.04 4.04 ... ASGREEN Kasper 1.UWT
14 05.04 - 10.04 5.04 ... ROGLI? Primož 2.UWT
15 18.04 18.04 ... VAN AERT Wout 1.UWT
16 21.04 21.04 ... ALAPHILIPPE Julian 1.UWT
17 25.04 25.04 ... POGA?AR Tadej 1.UWT
18 27.04 - 02.05 27.04 ... THOMAS Geraint 2.UWT
19 08.05 - 30.05 8.05 ... BERNAL Egan 2.UWT
20 30.05 - 06.06 30.05 ... PORTE Richie 2.UWT
21 06.06 - 13.06 6.06 ... CARAPAZ Richard 2.UWT
22 26.06 - 18.07 26.06 ... POGA?AR Tadej 2.UWT
23 31.07 31.07 ... POWLESS Neilson 1.UWT
24 09.08 - 15.08 9.08 ... ALMEIDA João 2.UWT
25 14.08 - 05.09 14.08 ... ROGLI? Primož 2.UWT
26 22.08 22.08 ... NaN 1.UWT
27 29.08 29.08 ... COSNEFROY Benoît 1.UWT
28 30.08 - 05.09 30.08 ... COLBRELLI Sonny 2.UWT
29 10.09 10.09 ... NaN 1.UWT
30 12.09 12.09 ... NaN 1.UWT
31 19.09 19.09 ... PHILIPSEN Jasper 1.UWT
32 03.10 3.10 ... COLBRELLI Sonny 1.UWT
33 09.10 9.10 ... POGA?AR Tadej 1.UWT
34 14.10 - 19.10 14.10 ... NaN 2.UWT
[35 rows x 5 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | chitown88 |
