'Python : Pandas - ONLY remove NaN rows and move up data, do not move up data in rows with partial NaNs
Alright, so here is my code that I'm currently drafting to pull all national league players fielding stats. It works fine, however, I am interested in knowing how to drop ONLY lines of NaNs in dataframes without disturbing any of the data:
# import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
# create a url object
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
# create list of the stats that we care about
standardFieldingStats = [
'player',
'team_ID',
'G',
'GS',
'CG',
'Inn_def',
'chances',
'PO',
'A',
'E_def',
'DP_def',
'fielding_perc',
'tz_runs_total',
'tz_runs_total_per_season',
'bis_runs_total',
'bis_runs_total_per_season',
'bis_runs_good_plays',
'range_factor_per_nine',
'range_factor_per_game',
'pos_summary'
]
# Create object page
page = requests.get(url)
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
# grab each teams current year batting stats and turn it into a dataframe
tableNLFielding = soup.find('table', id='players_players_standard_fielding_fielding')
# grab player UID
puidList = []
rows = tableNLFielding.select('tr')
for row in rows:
playerUID = row.select_one('td[data-append-csv]')
playerUID = playerUID.get('data-append-csv')if playerUID else None
if playerUID == None:
continue
else:
puidList.append(playerUID)
# grab players position
compList = []
for row in rows:
thingList = []
for stat in range(len(standardFieldingStats)):
thing = row.find("td", attrs={"data-stat" : standardFieldingStats[stat]})
if thing == None:
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Team Totals':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 NL teams':
continue
elif row.find("td", attrs={"data-stat" : 'player'}).text == 'Rank in 15 AL teams':
continue
elif thing.text == '':
continue
elif thing.text == 'NaN':
continue
else:
thingList.append(thing.text)
compList.append(thingList)
# insert the batting headers to a dataframe
NLFieldingDf = pd.DataFrame(data=compList, columns=standardFieldingStats)
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.dropna().values))
#NLFieldingDf = NLFieldingDf.apply(lambda x: pd.Series(x.fillna('').values))
# make all NaNs blanks for aesthic reasons
#NLFieldingDf = NLFieldingDf.fillna('')
#NLFieldingDf.insert(loc=0, column='pUID', value=puidList)
An example is: Dataframe I want to remove NaNs from:
player team pos_summary
NaN NaN NaN
Brandon Woodruff NaN P
William Woods ATL NaN
Kyle Wright ATL P
My dataframe when I try looks like this, moving the data out of place:
player team pos_summary
Brandon Woodruff ATL P
William Woods ATL P
Kyle Wright
Ideally, I want this, but no NaN rows and maintaining rows with partial NaNs:
player team pos_summary
Brandon Woodruff P
William Woods ATL
Kyle Wright ATL P
Refer to the end of the complete code to see my attempts.
Solution 1:[1]
try this to remove all NaN rows
df.dropna(how="all")
Further, if you need to replace the NaN values with '', then use
df.fillna('', inplace=True)
Solution 2:[2]
You could do it that way, however, your data isn't accurate. You shouldn't be getting nulls in player position or team.
Secondly, if you need to parse <table> tags (and you don't need to pull out any attributes like a href) let pandas parse that table for you. It uses beautifulsoup under the hood.
import pandas as pd
url = r'https://www.baseball-reference.com/leagues/NL/2022-standard-fielding.shtml'
df = pd.read_html(url)[-1]
df = df[df['Rk'].ne('Rk')]
Output:
print(df[['Name', 'Tm', 'Pos Summary']])
Name Tm Pos Summary
0 C.J. Abrams SDP SS-2B-OF
1 Ronald Acuna Jr. ATL OF
2 Willy Adames MIL SS
3 Austin Adams SDP P
4 Riley Adams WSN C-1B
.. ... ... ...
509 Miguel Yajure PIT P
510 Mike Yastrzemski SFG OF
511 Christian Yelich MIL OF
512 Juan Yepez STL OF
513 Huascar Ynoa ATL P
[495 rows x 3 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | chitown88 |
