'How can I use groupby with multiple values in a column in pandas?
I've a dataframe like as follows,
import pandas as pd
data = {
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
}
df = pd.DataFrame(data)
print(df)
brand model year price
0 Mercedes X 2011 NaN
1 Renault Y 2010 1000.4
2 Ford Z 2009 2000.3
3 Mercedes X 2010 1000.0
4 Mercedes X 2012 1100.3
5 Mercedes X 2020 3000.5
6 Renault Q 2011 NaN
And here is the another case to test your solution,
data = {
'brand': ['Mercedes', 'Mercedes', 'Mercedes', 'Mercedes', 'Mercedes'],
'model': ['X', 'X', 'X', 'X', 'X'], 'year': [2017, 2018, 2018, 2019, 2019],
'price': [None, None, None, 1000.0, 1200.50]
}
Expected output,
brand model year price
0 Mercedes X 2017 NaN
1 Mercedes X 2018 1100.25
2 Mercedes X 2018 1100.25
3 Mercedes X 2019 1000.00
4 Mercedes X 2019 1200.50
I want to fill the missing values with the average of the observations containing year-1, year and year+1 and also same brand and model. For instance, Mercedes X model has a null price in 2011. When I look at the data,
2011 - 1 = 2010
2011 + 1 = 2012
The 4th observation -> Mercedes,X,2010,1000.0
The 5th observation -> Mercedes,X,2012,1100.3
The mean -> (1000.0 + 1100.3) / 2 = 1050.15
I've tried something as follows,
for c_key, _ in df.groupby(['brand', 'model', 'year']):
fc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'].isin([c_key[2] + 1, c_key[2], c_key[2] - 1]))
)
sc = (
(df['brand'] == c_key[0])
& (df['model'] == c_key[1])
& (df['year'] == c_key[2])
& (df['price'].isnull())
)
mean_val = df[fc]['price'].mean()
df.loc[sc, 'price'] = mean_val
print(df)
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
But this solution takes a long time for 90,000 rows and 27 columns so, is there a more effective solution? For instance, can I use groupby for the values year-1, year, year+1, brand and model?
Thanks in advance.
Solution 1:[1]
I think actually a more efficient way would be to sort by Brand and then Year, and then use interpolate:
df = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
Output:
>>> df
brand model year price
0 Mercedes X 2011 1050.15
1 Renault Y 2010 1000.40
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 1000.40
That also handles all the columns.
Solution 2:[2]
Based on the solution of @richardec, but with some addition to correct the price if the next year's price is known. Not sure if it faster than your original solution though
# Make an interpolated average
df_out = df.sort_values(['brand', 'year']).groupby('brand').apply(lambda g: g.interpolate(limit_area='inside'))
# Make an average per brand/year/model
df1 = df.sort_values(['brand', 'year']).groupby(['brand','year','model']).mean().reset_index()
# Check if the next line has the same brand and model. If so, take the next average price when the price isNa
mask1 = df1["model"] == df1["model"].shift(-1)
mask2 = df1["brand"] == df1["brand"].shift(-1)
mask3 = df1["price"].isna()
df1["priceCorr"] = np.where(mask1 & mask2 & mask3 ,df1["price"].shift(-1),df1["price"] )
# Merge everything together
df_out = df_out.merge(df1[["brand", "year", "model","priceCorr"]], on=["brand", "year", "model"])
df_out["price"] = np.where(df_out["price"].isna(),df_out["priceCorr"], df_out["price"])
Solution 3:[3]
Here goes a solution that looks simpler:
Sort values in the original dataframe:
df = df.sort_values(["brand", "model", "year"])Group by "brand" and "model", and store the groups in a variable (to calculate only once):
groups = df.groupby(["brand", "model"])Fill nan values using the average of the previous and next rows (Important: this assumes that you have data of consecutive years, meaning that if you're missing data for 2015 you know the values of 2014 and 2016. If you have no data for consecutive years, null values will remain null).
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
Resulting code:
df = df.sort_values(["brand", "model", "year"])
groups = df.groupby(["brand", "model"])
df["price"] = df["price"].fillna((groups["price"].ffill(limit=1) + groups["price"].bfill(limit=1)) / 2)
print(df)
Output:
brand model year price
2 Ford Z 2009 2000.30
3 Mercedes X 2010 1000.00
0 Mercedes X 2011 1050.15
4 Mercedes X 2012 1100.30
5 Mercedes X 2020 3000.50
6 Renault Q 2011 NaN
1 Renault Y 2010 1000.40
Solution 4:[4]
This is not a pretty solution, but from your description, I believe it would work and be really fast. It's just a lot of ifs inside a np.where on a sorted data frame.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'brand': ['Mercedes', 'Renault', 'Ford', 'Mercedes', 'Mercedes', 'Mercedes', 'Renault'],
'model': ['X', 'Y', 'Z', 'X', 'X', 'X', 'Q'],
'year': [2011, 2010, 2009, 2010, 2012, 2020, 2011],
'price': [None, 1000.4, 2000.3, 1000.0, 1100.3, 3000.5, None]
})
data = data.sort_values(by=['brand', 'model', 'year'])
data['adjusted_price'] = np.where(data['price'].isnull() &
(data['brand']==data['brand'].shift(1)) & (data['brand']==data['brand'].shift(-1)) &
(data['model']==data['model'].shift(1)) & (data['model']==data['model'].shift(-1)) &
(data['year']==(data['year'].shift(1)+1))&(data['year']==(data['year'].shift(-1)-1)),
(data['price'].shift(1)+data['price'].shift(-1))/2,
data['price'])
data['price'] = data['adjusted_price']
data = data.drop(['adjusted_price'], axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | braml1 |
| Solution 3 | |
| Solution 4 | Gustavo Linari Rodrigues |
