'How can I use groupby with multiple values in a column in pandas?

I've a dataframe like as follows,

import pandas as pd

data = {
    'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
    'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
    'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
    'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}

df = pd.DataFrame(data)
print(df)

      brand model  year   price
0  Mercedes     X  2011     NaN
1   Renault     Y  2010  1000.4
2      Ford     Z  2009  2000.3
3  Mercedes     X  2010  1000.0
4  Mercedes     X  2012  1100.3
5  Mercedes     X  2020  3000.5
6   Renault     Q  2011     NaN

And here is the another case to test your solution,

data = {
    'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'], 
    'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019], 
    'price': [None, None, None, 1000.0, 1200.50]
}

Expected output,

      brand model  year    price
0  Mercedes     X  2017      NaN
1  Mercedes     X  2018  1100.25
2  Mercedes     X  2018  1100.25
3  Mercedes     X  2019  1000.00
4  Mercedes     X  2019  1200.50

I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,

2011 - 1 = 2010
2011 + 1 = 2012

The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3

The mean -> (1000.0 + 1100.3) / 2 = 1050.15

I've tried something as follows,

for c_key, _ in df.groupby(['brand', 'model', 'year']):
    fc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
    )
    
    sc = (
        (df['brand'] == c_key[0])
        & (df['model'] == c_key[1])
        & (df['year'] == c_key[2])
        & (df['price'].isnull())
    )
        
    mean_val = df[fc]['price'].mean()

    df.loc[sc, 'price'] = mean_val

print(df)

      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011      NaN

But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year+1, brand and model?

Thanks in advance.



Solution 1:[1]

I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:

df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))

Output:

>>> df
      brand model  year    price
0  Mercedes     X  2011  1050.15
1   Renault     Y  2010  1000.40
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011  1000.40

That also handles all the columns.

Solution 2:[2]

Based on the solution of @richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though

# Make an interpolated average 
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))

# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()

# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )

# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])

Solution 3:[3]

Here goes a solution that looks simpler:

  1. Sort values in the original dataframe:

    df = df.sort_values(["brand", "model", "year"])
    
  2. Group by "brand" and "model", and store the groups in a variable (to calculate only once):

    groups = df.groupby(["brand", "model"])
    
  3. Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).

    df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
    

Resulting code:

df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)

Output:

      brand model  year    price
2      Ford     Z  2009  2000.30
3  Mercedes     X  2010  1000.00
0  Mercedes     X  2011  1050.15
4  Mercedes     X  2012  1100.30
5  Mercedes     X  2020  3000.50
6   Renault     Q  2011      NaN
1   Renault     Y  2010  1000.40

Solution 4:[4]

This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of ifs inside a np.where on a sorted data frame.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
    'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
    'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
    'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})

data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() & 
                                  (data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
                                  (data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) & 
                                  (data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
                                  (data['price'].shift(1)+data['price'].shift(-1))/2,
                                  data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 braml1
Solution 3
Solution 4 Gustavo Linari Rodrigues