'Pandas aggregation: return only value in group, raise error if there are several
When wrangling dataframes, I often want to aggregate data while preserving some high-level variables. I don't want to group by these higher level variables, because this is not what I mean.
Is there an existing aggregation function that returns the unique value of a series (in each group), and throws an error if there are multiple different values?
Example
Consider this table:
df = (
pd.DataFrame({"plant_id": [14] * 2 + [15, 16] * 3,
"plant": ["cactus"] * 2 + ["rose", "clover"] * 3,
"spikes": [False, True] + [True, False] * 3})
.sort_values('plant_id')
)
# plant_id plant score spikes
# 0 14 cactus 1 False
# 1 14 cactus 2 True
# 2 15 clover 6 False
# 4 15 clover 7 False
# 6 15 clover 2 False
# 3 16 rose 4 True
# 5 16 rose 9 True
# 7 16 rose 4 True
I would like to get the average score of each plant, keeping both the plant_id and the plant name. One way to do it is to group by both variables.
df.group_by(['plant_id', 'plant']).agg({'score': "mean"})
This could be fine in this simple example, this isn't exactly what I mean.
What I actually want is to group by plant_id, and then to keep the plant
name, knowing that there is always only one plant name per plant id.
Is there an existing function or method that would do something like
unique_value here?
df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'score': "mean"})
# plant score
# plant_id
# 14 cactus 1.500000
# 15 clover 5.000000
# 16 rose 5.666667
Ideally, this function would raise an error if I try to apply it to a series that has several values:
df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'spikes': lambda x: unique_value(x),
'score': "mean"})
# ValueError: Not unique: some cactuses have spikes, and some don't!
Solution 1:[1]
Use Series.unique with custom function for raise error if not unique values - length is greater like 1:
def unique_value(x):
u = x.unique()
if len(u) > 1:
raise ValueError('Non unique value per group')
else:
return u[0]
df = df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'spikes': lambda x: unique_value(x),
'score': "mean"})
print (df)
ValueError: Non unique value per group
def unique_value(x):
u = x.unique()
if len(u) > 1:
raise ValueError('Non unique value per group')
else:
return u[0]
df = df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
'score': "mean"})
print (df)
plant score
plant_id
14 cactus 1.500000
15 rose 5.666667
16 clover 5.000000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
