'How to do a custom Group By?
My goal is to group a data frame DF by values of column Name and aggregate specific column as sum.
Current data frame
| Name | Val1 | val2 | val3 | |
|---|---|---|---|---|
| 0 | Test | NaN | 5 | NaN |
| 1 | Test | 30 | NaN | 3 |
| 2 | Test | 30 | NaN | 3 |
Output excepted
| Name | Val1 | val2 | val3 | |
|---|---|---|---|---|
| 0 | Test | 60 | 5 | 3 |
What I tried
DF.groupby(['Name'], as_index=False)[["Val1"]].sum()
returns
| Name | Val1 | |
|---|---|---|
| 0 | Test | 60 |
Issue
I want to take val2 and val3 as unique values and then group them but I don't know how to do so.
Maybe introducing an intermediary DF
| Name | Val1 | val2 | val3 | |
|---|---|---|---|---|
| 0 | Test | NaN | 5 | 3 |
| 1 | Test | 30 | 5 | 3 |
| 2 | Test | 30 | 5 | 3 |
so that following code can work:
DF.groupby(['Name','val2','val3'], as_index=False)[["Val1"]].sum()
Keep in mind that my data frame has several values for Name in it.
What is the best way to do ?
Solution 1:[1]
If I understand correctly, there is only one unique non-missing value in each of the val2 and val3 columns per group. Otherwise your question does not make much sense, because you did not specify how to decide which value to take from these columns.
Given these constraints, you can use:
result = df.groupby('Name', as_index=False).agg({'Val1': 'sum', 'val2': 'first', 'val3': 'first'})
Solution 2:[2]
To group by one or multiple columns while aggregating others you can use groupBy followed by aggregate (or its alias agg).
Example
Given input:
Waiter Revenue Hours Tables Gender
0 Alice 3000 3.0 Outside f
1 Bob 2000 4.0 Inside m
2 Alex 1000 2.0 Inside d
3 Alex 500 0.5 Outside None
Expected output:
Revenue Hours Tables Gender
Waiter
Alex 1500 1.25 {Inside, Outside} d
Alice 3000 3.00 {Outside} f
Bob 2000 4.00 {Inside} m
Grouped by Waiter shows:
- the sum of
Revenue - the mean of worked
Hours - the set of unique
Tablesserved - the first (non undefined) value for
Gender
How to group-by with specific aggregation
Code:
import pandas as pd
df = pd.DataFrame({'Waiter': ['Alice','Bob','Alex', 'Alex'], 'Revenue': [3000, 2000, 1000, 500], 'Hours': [3, 4, 2, 0.5], 'Tables': ['Outside', 'Inside', 'Inside', 'Outside'], 'Gender': ['f', 'm', 'd', None]})
df.groupby(['Waiter']).agg({'Revenue': 'sum', 'Hours': 'mean', 'Tables': lambda x: set(x), 'Gender': 'first'})
Explained:
groupbycan be done be multiple of a single column, here onlyWaiteragg(oraggregate) by a dict defining the aggregation for each column. The column is specified as key likeRevenuewith the aggregate function specified either as function name'sum'(in quotes or as reference likelist) or as lambda likelambda x: set(x)
Note: to get the list of Tables we could also define following value as aggregation function:
'unique'produces a list with unique values (seeSeries.unique)setfor a set as we did with the lambdalistfor a list (which may contain duplicates)
See also
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | timgeb |
| Solution 2 | hc_dev |
