'Adding a summarised column back into a dataframe in python
as part of some data cleansing, i want to add the mean of a variable back into a dataframe to use if the variable is missing for a particular observation. so i've calculated my averages as follows
avg=all_data2.groupby("portfolio")"[sales"].mean().reset_index(name="sales_mean")
now I wanted to add that back into my original dataframe using a left join, but it doesnt appear to be working. what format is my avg now? I thought it would be a dataframe but is it something else?
Solution 1:[1]
UPDATED
This is probably the most succinct way to do it:
all_data2.sales = all_data2.sales.fillna(all_data2.groupby('portfolio').sales.transform('mean'))
This is another way to do it:
all_data2['sales'] = all_data2[['portfolio', 'sales']].groupby('portfolio').transform(lambda x: x.fillna(x.mean()))
Output:
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 NaN
portfolio sales
0 1 10.0
1 1 20.0
2 2 30.0
3 2 40.0
4 3 50.0
5 3 60.0
6 3 55.0
To answer your the part of your question that reads "what format is my avg now? I thought it would be a dataframe but is it something else?", avg
is indeed a dataframe but using it may not be the most direct way to update missing data in the original dataframe. The dataframe avg
looks like this for the sample input data above:
portfolio sales_mean
0 1 15.0
1 2 35.0
2 3 55.0
A related SO question that you may find helpful is here.
Solution 2:[2]
If you want to add a new column, you can use this code:
df['sales_mean']=df[['sales_1','sales_2']].mean(axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |