'How to convert mean value of each column variable and fill this mean value to corresponding variable in dataframe? [duplicate]
I have a mining dataset which has a following features Rock_type, Gold in grams(AU). Rock type has 8 different rock types and Gold (AU) has presence of gold in grams in those particular rocktypes and size of dataset is around 30k. With varying value of gold presence in those rock types. Here we have many outliers and I cannot ignore them, so let me know how I can convert mean value of every rocktype and impute to corresponding rocktype
EX:
Rock_type: saprolite, margilite, saprolite, saprolite, mafic, mafic, UD, margilite
Gold(AU) : 25.0 , 0.7, 12.0 , 14.0 , 1.5 , 1.7 , 6.7 , 0.9
Need solution like this in pandas dataframe:
Rock_type: saprolite, margilite, saprolite, saprolite, mafic, mafic, UD, margilite
Gold(AU) : 41.6 , 1.15, 41.6 , 41.6 , 2.35 , 2.35 , 6.7 , 1.15
Also let me know is it good practice to have mean value here or do we need to consider mean or mode to get better prediction value.
Thanks in advance
Solution 1:[1]
Considering your data is stored into a DataFrame :
df = pd.DataFrame(
{
'Rock_type': ['saprolite', 'margilite', 'saprolite', 'saprolite', 'mafic', 'mafic', 'UD', 'margilite'],
'Gold(AU)': [25.0, 0.7, 12.0, 14.0, 1.5, 1.7, 6.7, 0.9]
}
)
You can get the mean value of each Rock_type with groupby().mean() :
df = df.groupby('Rock_type').mean()
But you will have only 1 column per type of rock :
# Ouput
Gold(AU)
Rock_type
UD 6.7
mafic 1.6
margilite 0.8
saprolite 17.0
But if you absolutely need the same format as the original DataFrame, you can easily build it back with the values :
rock_type = ['saprolite', 'margilite', 'saprolite', 'saprolite', 'mafic', 'mafic', 'UD', 'margilite']
gold = [df.loc[i,'Gold(AU)'] for i in rock_type]
df2 = pd.DataFrame(
{
'Rock_type': rock_type,
'Gold(AU)': gold
}
)
# Output
Rock_type Gold(AU)
0 saprolite 17.0
1 margilite 0.8
2 saprolite 17.0
3 saprolite 17.0
4 mafic 1.6
5 mafic 1.6
6 UD 6.7
7 margilite 0.8
Solution 2:[2]
I would say that you should use median instead of mean since it's going to be more robust to the outliers from your data. Let's say one of your value is completely wrong with the mean you are going to move from a lot but with the median is going to be more robust
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Titouan L |
| Solution 2 | DataSciRookie |
