'How to scrape sofifa website positions. Text inside of span beautiful soup

So I am webs scraping the sofifa website into a workable csv. Each player gets a column. My main problem is the position section of the website is only exporting the first position whenever I try to iterate through it. Ideally I would like all of the positions to be to be in the same column seperated by a comma.

Here is the source HTML and picture Sofifa Website 1

<tr>
<td class="col-avatar"><figure class="avatar">
<img alt="" data-src="https://cdn.sofifa.com/players/240/950/21_60.png" data-srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" src="https://cdn.sofifa.com/players/240/950/21_60.png" data-root="https://cdn.sofifa.com/players/" data-type="player" id="240950" class="player-check loaded" srcset="https://cdn.sofifa.com/players/240/950/21_120.png 2x, https://cdn.sofifa.com/players/240/950/21_180.png 3x" data-was-processed="true"></figure></td>
<td class="col-name">
<a class="tooltip" href="/player/240950/pedro-antonio-pereira-goncalves/210058/" data-tooltip="Pedro António Pereira Gonçalves"><div class="bp3-text-overflow-ellipsis"><img title="Portugal" alt="" src="https://cdn.sofifa.com/flags/pt.png" data-src="https://cdn.sofifa.com/flags/pt.png" data-srcset="https://cdn.sofifa.com/flags/[email protected] 2x, https://cdn.sofifa.com/flags/[email protected] 3x" class="flag loaded" srcset="https://cdn.sofifa.com/flags/[email protected] 2x, https://cdn.sofifa.com/flags/[email protected] 3x" data-was-processed="true"> Pedro Gonçalves</div></a><a rel="nofollow" href="/players?pn=23"><span class="pos pos23">RW</span></a> <a rel="nofollow" href="/players?pn=14"><span class="pos pos14">CM</span></a></td><td class="col col-ae" data-col="ae">22</td><td class="col col-oa" data-col="oa"><span class="bp3-tag p p-79">79</span></td><td class="col col-pt" data-col="pt"><span class="bp3-tag p p-87">87</span></td><td class="col-name">
<div class="bp3-text-overflow-ellipsis"><figure class="avatar avatar-sm transparent">
<img alt="" class="team loaded" data-src="https://cdn.sofifa.com/teams/237/30.png" data-srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" src="https://cdn.sofifa.com/teams/237/30.png" data-root="https://cdn.sofifa.com/teams/" data-type="team" srcset="https://cdn.sofifa.com/teams/237/60.png 2x, https://cdn.sofifa.com/teams/237/90.png 3x" data-was-processed="true">
</figure>
<a href="/team/237/sporting-cp/">Sporting CP</a><div class="sub">
2020 ~ 2025</div>
</div>
</td><td class="col col-vl" data-col="vl">€39.5M</td><td class="col col-wg" data-col="wg">€16K</td><td class="col col-tt" data-col="tt"><span class="bp3-tag p">2021</span></td><td class="col-comment">
5.2K</td>
</tr>

This is my webscraping API

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

# Get basic players information for all players
base_url = "https://sofifa.com/players?offset="
columns = ['ID', 'Name', 'Age',  'Positions','Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage',]
data = pd.DataFrame(columns = columns)


for offset in range(0, 335):
    url = base_url + str(offset * 60)
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    table_body = soup.find('tbody')
    for row in table_body.findAll('tr'):
        td = row.findAll('td')
        pid = td[0].find('img').get('id')
        nationality = td[1].find('img').get('title')
        name = td[1].find("a").get("data-tooltip")
        rel = td[1].findAll('a',{'rel': 'nofollow'})
        pos= rel[0].findAll('span')
        for span in pos :
            positions= (span.text.split)
        age = td[2].text
        overall = td[3].text.strip()
        potential = td[4].text.strip( )
        club = td[5].find('a').text
        value = td[6].text.strip()
        wage = td[7].text.strip()
        player_data = pd.DataFrame([[pid, name, age, positions, nationality, overall, potential, club, value, wage]])
        player_data.columns = columns
        data = data.append(player_data, ignore_index=True)
    print("done for "+str(offset),end="\r")
data.drop_duplicates()
data.head()

data.to_csv('player data.csv', encoding='utf-8-sig')

it yields this output

Excel Output2



Solution 1:[1]

To get positions as string separated by comma, you can try:

import requests
from bs4 import BeautifulSoup


def get_data(offset):
    url = "https://sofifa.com/players?offset=" + str(offset * 60)
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    rv = []
    for row in soup.select("tbody tr"):
        id_ = row.select_one("img[id]")["id"]
        name = row.select_one(".col-name [data-tooltip]")["data-tooltip"]
        age = row.select_one(".col-ae").get_text(strip=True)
        positions = [p.get_text(strip=True) for p in row.select("span.pos")]
        nationality = row.select_one("img.flag")["title"]
        overall = row.select_one(".col-oa").get_text(strip=True)
        potential = row.select_one(".col-pt").get_text(strip=True)
        club = row.select_one(".col-name > div > a").get_text(strip=True)

        # sometimes there isn't any club, just country:
        if club == "":
            club = row.select_one(".col-name > div > a")["title"]

        value = row.select_one(".col-vl").get_text(strip=True)
        wage = row.select_one(".col-wg").get_text(strip=True)
        rv.append(
            [
                id_,
                name,
                age,
                ", ".join(positions),
                nationality,
                overall,
                potential,
                club,
                value,
                wage,
            ]
        )

    return rv


all_data = []
for offset in range(0, 3):  # <--- increase offset here
    print("Offset {}...".format(offset))
    all_data.extend(get_data(offset))

df = pd.DataFrame(
    all_data,
    columns=[
        "ID",
        "Name",
        "Age",
        "Positions",
        "Nationality",
        "Overall",
        "Potential",
        "Club",
        "Value",
        "Wage",
    ],
)

print(df)
df.to_csv("data.csv", index=False)

Prints:

...

141  241637               Aurélien Tchouaméni  20       CM, CDM          France      77        85                 AS Monaco     €23M   €35K
142  258315             Bright Akwo Arrey-Mbi  17        CB, LB         Germany      62        85         Bayern München II    €1.2M   €500
143  245367                       Xavi Simons  17            CM     Netherlands      65        84       Paris Saint-Germain    €1.8M    €2K
144  207865                Marcos Aoás Corrêa  26       CB, CDM          Brazil      87        90       Paris Saint-Germain   €92.5M  €135K
145  241852                      Moussa Diaby  20        LW, LM          France      81        88       Bayer 04 Leverkusen     €51M   €60K
146  188567         Pierre-Emerick Aubameyang  31        ST, LW           Gabon      85        85                   Arsenal   €45.5M  €145K

...

and saves data.csv (screenshot from LibreOffice):

enter image description here

Solution 2:[2]

For those who might be interested (as of April 2022) the line in the answer above :

name = row.select_one(".col-name [data-tooltip]")["data-tooltip"]

does not work as they probably changed something in the HTML. This below works instead:

name = row.select_one(".col-name >a[aria-label]")["aria-label"]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrej Kesely
Solution 2 Andrea Grianti