'Webscraping issues with tr

Trying to scrape this website, initally on the this table as a beginner step before going for the further links.

https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter

The table is called 'table.basic' and has the source code:

<table class="basic"  style=""><thead><tr><th class="cu500  "  style="width: 11.6%; ">Date</th><th class="hide cs500  "  style="width: 5.8%; ">Date</th><th style="width: 46.5%; ">Race</th><th class="cu500  "  style="width: 29.1%; ">Winner</th><th style="width: 7%; ">Class</th></tr></thead><tbody>
<tr class="striked"  ><td class="cu500  " >19.01 - 24.01</td><td class="hide cs500  " >19.01</td><td><span class="flag au"></span> <a    href="race/tour-down-under/2021/startlist/preview">Santos Tour Down Under</a></td><td class="cu500  " ><a    href="rider/"></a></td><td>2.UWT</td></tr>
<tr class="striked"  ><td class="cu500  " >31.01</td><td class="hide cs500  " >31.01</td><td><span class="flag au"></span> <a    href="race/great-ocean-race/2021/startlist/preview">Cadel Evans Great Ocean Road Race</a></td><td class="cu500  " ><a    href="rider/"></a></td><td>1.UWT</td></tr>
<tr ><td class="cu500  " >21.02 - 27.02</td><td class="hide cs500  " >21.02</td><td><span class="flag ae"></span> <a    href="race/uae-tour/2021">UAE Tour</a></td><td class="cu500  " ><a    href="rider/tadej-pogacar">POGAČAR Tadej</a></td><td>2.UWT</td></tr>
<tr ><td class="cu500  " >27.02</td><td class="hide cs500  " >27.02</td><td><span class="flag be"></span> <a    href="race/omloop-het-nieuwsblad/2021">Omloop Het Nieuwsblad ME</a></td><td class="cu500  " ><a    href="rider/davide-ballerini">BALLERINI Davide</a></td><td>1.UWT</td></tr>
<tr ><td class="cu500  " >06.03</td><td class="hide cs500  " >06.03</td><td><span class="flag it"></span> <a    href="race/strade-bianche/2021">Strade Bianche</a></td><td class="cu500  " ><a    href="rider/mathieu-van-der-poel">VAN DER POEL Mathieu</a></td><td>1.UWT</td></tr>
<tr ><td class="cu500  " >07.03 - 14.03</td><td class="hide cs500  " >07.03</td><td><span class="flag fr"></span> <a    href="race/paris-nice/2021">Paris-Nice</a></td><td class="cu500  " ><a    href="rider/maximilian-schachmann">SCHACHMANN Maximilian</a></td><td>2.UWT</td></tr>

I can retrieve the table as a soup using:

url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.findAll('table')
print(table)

The issue I have is that when I try and retirve headers etc I just get errors:

File "C:\Users\Tim\OneDrive\Documents\Python Scripts\cycling stats website scrape.py", line 50, in headers = [heading_text for heading in table.find("class=cu500 ")]

File "C:\Users\Tim\anaconda3\lib\site-packages\bs4\element.py", line 2253, in getattr raise AttributeError(

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?



Solution 1:[1]

.find_all() will return a list of elements, while .find() will return the first instance it finds.

You also might want to familiarize yourself with basic html, specifically tags and attributes, so that you can use beautifulsoup correctly in finding those within the html.

Also, while this is a nice site/example to use to learn and practice using bs4, pandas also will get the job done too (usually my go to if I'm after <table> tags).

import requests
from bs4 import BeautifulSoup

url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', {'class':'basic'})
headers = [heading.text for heading in table.find_all('th',{"class":"cu500"})]


print(headers)

Output:

['Date', 'Winner']

Pandas:

import pandas as pd

url='https://www.procyclingstats.com/races.php?year=2021&circuit=1&class=&filter=Filter'

# Usually the line below is enough
# But for some reason returning Forbidden
#dfs = pd.read_html(url)[0]

response = requests.get(url)
dfs = pd.read_html(response.text)[0]

Output:

print(dfs)
             Date  Date.1  ...                 Winner  Class
0   19.01 - 24.01   19.01  ...                    NaN  2.UWT
1           31.01   31.01  ...                    NaN  1.UWT
2   21.02 - 27.02   21.02  ...          POGA?AR Tadej  2.UWT
3           27.02   27.02  ...       BALLERINI Davide  1.UWT
4           06.03    6.03  ...   VAN DER POEL Mathieu  1.UWT
5   07.03 - 14.03    7.03  ...  SCHACHMANN Maximilian  2.UWT
6   10.03 - 16.03   10.03  ...          POGA?AR Tadej  2.UWT
7           20.03   20.03  ...         STUYVEN Jasper  1.UWT
8   22.03 - 28.03   22.03  ...             YATES Adam  2.UWT
9           24.03   24.03  ...            BENNETT Sam  1.UWT
10          26.03   26.03  ...         ASGREEN Kasper  1.UWT
11          28.03   28.03  ...          VAN AERT Wout  1.UWT
12          31.03   31.03  ...       VAN BAARLE Dylan  1.UWT
13          04.04    4.04  ...         ASGREEN Kasper  1.UWT
14  05.04 - 10.04    5.04  ...          ROGLI? Primož  2.UWT
15          18.04   18.04  ...          VAN AERT Wout  1.UWT
16          21.04   21.04  ...     ALAPHILIPPE Julian  1.UWT
17          25.04   25.04  ...          POGA?AR Tadej  1.UWT
18  27.04 - 02.05   27.04  ...         THOMAS Geraint  2.UWT
19  08.05 - 30.05    8.05  ...            BERNAL Egan  2.UWT
20  30.05 - 06.06   30.05  ...           PORTE Richie  2.UWT
21  06.06 - 13.06    6.06  ...        CARAPAZ Richard  2.UWT
22  26.06 - 18.07   26.06  ...          POGA?AR Tadej  2.UWT
23          31.07   31.07  ...        POWLESS Neilson  1.UWT
24  09.08 - 15.08    9.08  ...           ALMEIDA João  2.UWT
25  14.08 - 05.09   14.08  ...          ROGLI? Primož  2.UWT
26          22.08   22.08  ...                    NaN  1.UWT
27          29.08   29.08  ...       COSNEFROY Benoît  1.UWT
28  30.08 - 05.09   30.08  ...        COLBRELLI Sonny  2.UWT
29          10.09   10.09  ...                    NaN  1.UWT
30          12.09   12.09  ...                    NaN  1.UWT
31          19.09   19.09  ...       PHILIPSEN Jasper  1.UWT
32          03.10    3.10  ...        COLBRELLI Sonny  1.UWT
33          09.10    9.10  ...          POGA?AR Tadej  1.UWT
34  14.10 - 19.10   14.10  ...                    NaN  2.UWT

[35 rows x 5 columns]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chitown88