'How to more generally do a multi-column grouping with interleaving column headers?

This question is a generalization of the question How to add interleaving rows as result of sort groups? and using the accepted answer as a reference to extend from. I would like a more pandas idiomatic rewrite of doing a multi-column grouping with interleaving column headers.

Suppose I have the following toy case as input:

import pandas as pd
from pandas.api.types import is_numeric_dtype

df = pd.DataFrame({'g1': ['aa', 'aa', 'aa', 'aa', 'bb', 'bb', 'bb', 'bb', 'cc', 'cc', 'cc'],
                   'g2': ['xx', 'xx', 'yy', 'yy', 'xx', 'yy', 'yy', 'zz', 'yy', 'yy', 'zz'],
                   'g3': ['ll', 'll', 'mm', 'mm', 'll', 'll', 'mm', 'kk', 'll', 'mm', 'kk'],
                   'n1': ['e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o'],
                   'num1': [1]*11,
                   'num2': list(range(1, 12))})
print(df)

    g1  g2  g3 n1  num1  num2
0   aa  xx  ll  e     1     1
1   aa  xx  ll  f     1     2
2   aa  yy  mm  g     1     3
3   aa  yy  mm  h     1     4
4   bb  xx  ll  i     1     5
5   bb  yy  ll  j     1     6
6   bb  yy  mm  k     1     7
7   bb  zz  kk  l     1     8
8   cc  yy  ll  m     1     9
9   cc  yy  mm  n     1    10
10  cc  zz  kk  o     1    11

The g* columns are grouping key columns. n1 is the labels I want to keep and where the grouping columns are going to demarcate each subset. The num* are the computation columns that I'd like to aggregate over.

The following code will do what I need and there we have the desired output.

def func(x, name_col, group_by_col, keep_cols):
    data = dict()
    tmp = x.dropna(how='any', axis=0)
    for col in x.columns:
        if is_numeric_dtype(x[col]):
            data[col] = [tmp[col].sum()]
        else:
            if col in keep_cols:
                data[col] = [x[col].values[0]]
            else:
                data[col] = [None]
    data[name_col] = [x[group_by_col].values[0]]
    y = pd.DataFrame(data)
    y = y[x.columns]
    return y.append(x)


out = df.sort_values(by=['g1', 'g2', 'g3']).\
    groupby(by=['g1', 'g2', 'g3'], group_keys=False).apply(lambda x: func(x, 'n1', 'g3', ['g1', 'g2'])).\
    groupby(by=['g1', 'g2'], group_keys=False).apply(lambda x: func(x, 'n1', 'g2', ['g1'])).\
    groupby(by=['g1'], group_keys=False).apply(lambda x: func(x, 'n1', 'g1', [])).\
    reset_index(drop=True)
print(out)

      g1    g2    g3  n1  num1  num2
0   None  None  None  aa     4    10
1     aa  None  None  xx     2     3
2     aa    xx  None  ll     2     3
3     aa    xx    ll   e     1     1
4     aa    xx    ll   f     1     2
5     aa  None  None  yy     2     7
6     aa    yy  None  mm     2     7
7     aa    yy    mm   g     1     3
8     aa    yy    mm   h     1     4
9   None  None  None  bb     4    26
10    bb  None  None  xx     1     5
11    bb    xx  None  ll     1     5
12    bb    xx    ll   i     1     5
13    bb  None  None  yy     2    13
14    bb    yy  None  ll     1     6
15    bb    yy    ll   j     1     6
16    bb    yy  None  mm     1     7
17    bb    yy    mm   k     1     7
18    bb  None  None  zz     1     8
19    bb    zz  None  kk     1     8
20    bb    zz    kk   l     1     8
21  None  None  None  cc     3    30
22    cc  None  None  yy     2    19
23    cc    yy  None  ll     1     9
24    cc    yy    ll   m     1     9
25    cc    yy  None  mm     1    10
26    cc    yy    mm   n     1    10
27    cc  None  None  zz     1    11
28    cc    zz  None  kk     1    11
29    cc    zz    kk   o     1    11

Let's pick the top few rows. The first row n1=aa and the aggregated values are 4 and 10 because that's the sum over the numerical columns when grouping by g1 and with value aa. The second row n1=xx and the aggregated values are 2 and 3 because that's the sum over the numerical columns when grouping by g1,g2 and with values aa,xx and so on. I hope you see the pattern.

Is there a simpler more pandas-idiomatic way to do this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source