'Replace null values by the mean of each group
I have a dataset similar to below with several columns which contain Nan values.

I would like to group the dataset by location and fill the Nan in Iso code and continent column with the same unique string for each location and fill the Nan values with the mean of each location as below:

I have found the code below but it only works for a column at a time and I would like to do it across all columns apart from Iso code and continent columns.
dff['total_deaths'].fillna(dff.groupby('location')['total_deaths'].transform('mean'))
Solution 1:[1]
You can handle the string columns using groupby().transform('first') and the numeric columns using transform('mean'), then put the pieces back together (with location in the middle) using concat:
gb = df.groupby('location')
strCols = ['Iso code','continent']
numCols = list(set(df.columns) - (set(strCols) | set(['location'])))
df = pd.concat([gb[strCols].transform('first'), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)
Output:
Iso code continent location new_cases total_deaths
0 AFG Asia afghanistan 5.0 NaN
1 AFG Asia afghanistan 7.0 2.0
2 AFG NaN afghanistan 8.0 2.0
3 NaN Europe portugal 3.0 7.0
4 PT Europe portugal 3.0 9.0
5 ANG Africa angola NaN NaN
6 ANG NaN angola 5.0 7.0
7 NaN Africa angola NaN 6.0
Iso code continent location new_cases total_deaths
0 AFG Asia afghanistan 5.0 2.0
1 AFG Asia afghanistan 7.0 2.0
2 AFG Asia afghanistan 8.0 2.0
3 PT Europe portugal 3.0 7.0
4 PT Europe portugal 3.0 9.0
5 ANG Africa angola 5.0 6.5
6 ANG Africa angola 5.0 7.0
7 ANG Africa angola 5.0 6.0
For the string columns, I have used the value returned by first in all rows, not just NaN rows, to ensure uniqueness. If you prefer to just fill NaN in the string columns, you can do this:
df = pd.concat([gb[strCols].fillna(gb[strCols].transform('first')), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)
UPDATED:
You can automatically identify numeric columns (dtype float64, for example) so that the only special column in the code is location, with everything else automated:
gb = df.groupby('location')
numCols = df.dtypes[df.dtypes == 'float64'].reset_index()['index'].to_list()
strCols = list(set(df.columns) - (set(numCols) | set(['location'])))
df = pd.concat([gb[strCols].fillna(gb[strCols].transform('first')), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
