'How to more generally do a multi-column grouping with interleaving column headers?
This question is a generalization of the question How to add interleaving rows as result of sort groups? and using the accepted answer as a reference to extend from. I would like a more pandas idiomatic rewrite of doing a multi-column grouping with interleaving column headers.
Suppose I have the following toy case as input:
import pandas as pd
from pandas.api.types import is_numeric_dtype
df = pd.DataFrame({'g1': ['aa', 'aa', 'aa', 'aa', 'bb', 'bb', 'bb', 'bb', 'cc', 'cc', 'cc'],
'g2': ['xx', 'xx', 'yy', 'yy', 'xx', 'yy', 'yy', 'zz', 'yy', 'yy', 'zz'],
'g3': ['ll', 'll', 'mm', 'mm', 'll', 'll', 'mm', 'kk', 'll', 'mm', 'kk'],
'n1': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'],
'num1': [1]*11,
'num2': list(range(1, 12))})
print(df)
g1 g2 g3 n1 num1 num2
0 aa xx ll e 1 1
1 aa xx ll f 1 2
2 aa yy mm g 1 3
3 aa yy mm h 1 4
4 bb xx ll i 1 5
5 bb yy ll j 1 6
6 bb yy mm k 1 7
7 bb zz kk l 1 8
8 cc yy ll m 1 9
9 cc yy mm n 1 10
10 cc zz kk o 1 11
The g* columns are grouping key columns. n1 is the labels I want to keep and where the grouping columns are going to demarcate each subset. The num* are the computation columns that I'd like to aggregate over.
The following code will do what I need and there we have the desired output.
def func(x, name_col, group_by_col, keep_cols):
data = dict()
tmp = x.dropna(how='any', axis=0)
for col in x.columns:
if is_numeric_dtype(x[col]):
data[col] = [tmp[col].sum()]
else:
if col in keep_cols:
data[col] = [x[col].values[0]]
else:
data[col] = [None]
data[name_col] = [x[group_by_col].values[0]]
y = pd.DataFrame(data)
y = y[x.columns]
return y.append(x)
out = df.sort_values(by=['g1', 'g2', 'g3']).\
groupby(by=['g1', 'g2', 'g3'], group_keys=False).apply(lambda x: func(x, 'n1', 'g3', ['g1', 'g2'])).\
groupby(by=['g1', 'g2'], group_keys=False).apply(lambda x: func(x, 'n1', 'g2', ['g1'])).\
groupby(by=['g1'], group_keys=False).apply(lambda x: func(x, 'n1', 'g1', [])).\
reset_index(drop=True)
print(out)
g1 g2 g3 n1 num1 num2
0 None None None aa 4 10
1 aa None None xx 2 3
2 aa xx None ll 2 3
3 aa xx ll e 1 1
4 aa xx ll f 1 2
5 aa None None yy 2 7
6 aa yy None mm 2 7
7 aa yy mm g 1 3
8 aa yy mm h 1 4
9 None None None bb 4 26
10 bb None None xx 1 5
11 bb xx None ll 1 5
12 bb xx ll i 1 5
13 bb None None yy 2 13
14 bb yy None ll 1 6
15 bb yy ll j 1 6
16 bb yy None mm 1 7
17 bb yy mm k 1 7
18 bb None None zz 1 8
19 bb zz None kk 1 8
20 bb zz kk l 1 8
21 None None None cc 3 30
22 cc None None yy 2 19
23 cc yy None ll 1 9
24 cc yy ll m 1 9
25 cc yy None mm 1 10
26 cc yy mm n 1 10
27 cc None None zz 1 11
28 cc zz None kk 1 11
29 cc zz kk o 1 11
Let's pick the top few rows. The first row n1=aa and the aggregated values are 4 and 10 because that's the sum over the numerical columns when grouping by g1 and with value aa. The second row n1=xx and the aggregated values are 2 and 3 because that's the sum over the numerical columns when grouping by g1,g2 and with values aa,xx and so on. I hope you see the pattern.
Is there a simpler more pandas-idiomatic way to do this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
