'Pandas aggregation: return only value in group, raise error if there are several

When wrangling dataframes, I often want to aggregate data while preserving some high-level variables. I don't want to group by these higher level variables, because this is not what I mean.

Is there an existing aggregation function that returns the unique value of a series (in each group), and throws an error if there are multiple different values?

Example

Consider this table:

df = (
    pd.DataFrame({"plant_id": [14] * 2 + [15, 16] * 3,
                  "plant": ["cactus"] * 2 + ["rose", "clover"] * 3,
                  "spikes": [False, True] + [True, False] * 3})
    .sort_values('plant_id')
)

#    plant_id   plant  score  spikes
# 0        14  cactus      1   False
# 1        14  cactus      2    True
# 2        15  clover      6   False
# 4        15  clover      7   False
# 6        15  clover      2   False
# 3        16    rose      4    True
# 5        16    rose      9    True
# 7        16    rose      4    True

I would like to get the average score of each plant, keeping both the plant_id and the plant name. One way to do it is to group by both variables.

df.group_by(['plant_id', 'plant']).agg({'score': "mean"})

This could be fine in this simple example, this isn't exactly what I mean. What I actually want is to group by plant_id, and then to keep the plant name, knowing that there is always only one plant name per plant id. Is there an existing function or method that would do something like unique_value here?

df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                            'score': "mean"})

#            plant     score
# plant_id                  
# 14        cactus  1.500000
# 15        clover  5.000000
# 16          rose  5.666667

Ideally, this function would raise an error if I try to apply it to a series that has several values:

df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                            'spikes': lambda x: unique_value(x),
                            'score': "mean"})

# ValueError: Not unique: some cactuses have spikes, and some don't!


Solution 1:[1]

Use Series.unique with custom function for raise error if not unique values - length is greater like 1:

def unique_value(x):
    u = x.unique()
    if len(u) > 1:
        raise ValueError('Non unique value per group')
    else:
        return u[0]

df = df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                            'spikes': lambda x: unique_value(x),
                            'score': "mean"})
print (df)

ValueError: Non unique value per group


def unique_value(x):
    u = x.unique()
    if len(u) > 1:
        raise ValueError('Non unique value per group')
    else:
        return u[0]

df = df.groupby('plant_id').agg({'plant': lambda x: unique_value(x),
                                 'score': "mean"})
print (df)
           plant     score
plant_id                  
14        cactus  1.500000
15          rose  5.666667
16        clover  5.000000

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1