'Replace null values by the mean of each group

I have a dataset similar to below with several columns which contain Nan values.

Dataset sample

I would like to group the dataset by location and fill the Nan in Iso code and continent column with the same unique string for each location and fill the Nan values with the mean of each location as below:

Desired output

I have found the code below but it only works for a column at a time and I would like to do it across all columns apart from Iso code and continent columns.

dff['total_deaths'].fillna(dff.groupby('location')['total_deaths'].transform('mean'))


Solution 1:[1]

You can handle the string columns using groupby().transform('first') and the numeric columns using transform('mean'), then put the pieces back together (with location in the middle) using concat:

gb = df.groupby('location')
strCols = ['Iso code','continent']
numCols = list(set(df.columns) - (set(strCols) | set(['location'])))
df = pd.concat([gb[strCols].transform('first'), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)

Output:

  Iso code continent     location  new_cases  total_deaths
0      AFG      Asia  afghanistan        5.0           NaN
1      AFG      Asia  afghanistan        7.0           2.0
2      AFG       NaN  afghanistan        8.0           2.0
3      NaN    Europe     portugal        3.0           7.0
4       PT    Europe     portugal        3.0           9.0
5      ANG    Africa       angola        NaN           NaN
6      ANG       NaN       angola        5.0           7.0
7      NaN    Africa       angola        NaN           6.0
  Iso code continent     location  new_cases  total_deaths
0      AFG      Asia  afghanistan        5.0           2.0
1      AFG      Asia  afghanistan        7.0           2.0
2      AFG      Asia  afghanistan        8.0           2.0
3       PT    Europe     portugal        3.0           7.0
4       PT    Europe     portugal        3.0           9.0
5      ANG    Africa       angola        5.0           6.5
6      ANG    Africa       angola        5.0           7.0
7      ANG    Africa       angola        5.0           6.0

For the string columns, I have used the value returned by first in all rows, not just NaN rows, to ensure uniqueness. If you prefer to just fill NaN in the string columns, you can do this:

df = pd.concat([gb[strCols].fillna(gb[strCols].transform('first')), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)

UPDATED:

You can automatically identify numeric columns (dtype float64, for example) so that the only special column in the code is location, with everything else automated:

gb = df.groupby('location')
numCols = df.dtypes[df.dtypes == 'float64'].reset_index()['index'].to_list()
strCols = list(set(df.columns) - (set(numCols) | set(['location'])))
df = pd.concat([gb[strCols].fillna(gb[strCols].transform('first')), df['location'], df[numCols].fillna(gb[numCols].transform('mean'))], axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1