'How to convert mean value of each column variable and fill this mean value to corresponding variable in dataframe? [duplicate]

I have a mining dataset which has a following features Rock_type, Gold in grams(AU). Rock type has 8 different rock types and Gold (AU) has presence of gold in grams in those particular rocktypes and size of dataset is around 30k. With varying value of gold presence in those rock types. Here we have many outliers and I cannot ignore them, so let me know how I can convert mean value of every rocktype and impute to corresponding rocktype

EX:

Rock_type: saprolite, margilite, saprolite, saprolite, mafic, mafic, UD, margilite
Gold(AU) :  25.0     , 0.7,     12.0   ,    14.0    ,  1.5   , 1.7  ,   6.7 , 0.9

Need solution like this in pandas dataframe:

Rock_type: saprolite, margilite, saprolite, saprolite, mafic, mafic,          UD,        margilite
Gold(AU) :  41.6   ,     1.15,         41.6   ,    41.6    ,  2.35  , 2.35  ,   6.7 , 1.15

Also let me know is it good practice to have mean value here or do we need to consider mean or mode to get better prediction value.

Thanks in advance

Solution 1:^[1]

Considering your data is stored into a DataFrame :

df = pd.DataFrame(
    {
        'Rock_type': ['saprolite', 'margilite', 'saprolite', 'saprolite', 'mafic', 'mafic', 'UD', 'margilite'],
        'Gold(AU)': [25.0, 0.7, 12.0, 14.0, 1.5, 1.7, 6.7, 0.9]
    }
)

You can get the mean value of each Rock_type with groupby().mean() :

df = df.groupby('Rock_type').mean()

But you will have only 1 column per type of rock :

# Ouput
           Gold(AU)
Rock_type          
UD              6.7
mafic           1.6
margilite       0.8
saprolite      17.0

But if you absolutely need the same format as the original DataFrame, you can easily build it back with the values :

rock_type = ['saprolite', 'margilite', 'saprolite', 'saprolite', 'mafic', 'mafic', 'UD', 'margilite']
gold = [df.loc[i,'Gold(AU)'] for i in rock_type]
df2 = pd.DataFrame(
    {
        'Rock_type': rock_type,
        'Gold(AU)': gold
    }
)

# Output
   Rock_type  Gold(AU)
0  saprolite      17.0
1  margilite       0.8
2  saprolite      17.0
3  saprolite      17.0
4      mafic       1.6
5      mafic       1.6
6         UD       6.7
7  margilite       0.8

Solution 2:^[2]

I would say that you should use median instead of mean since it's going to be more robust to the outliers from your data. Let's say one of your value is completely wrong with the mean you are going to move from a lot but with the median is going to be more robust

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Titouan L
Solution 2	DataSciRookie

'How to convert mean value of each column variable and fill this mean value to corresponding variable in dataframe? [duplicate]

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]