'Pandas groupby: efficiently chain several functions

I need to group a DataFrame and apply several chained functions on each group.

My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.

There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?


First here is some large testing data:

from string import ascii_lowercase

import numpy as np
import pandas as pd


n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
    {
        "x": np.random.choice(np.array([*ascii_lowercase]), size=n),
        "y": np.random.normal(size=n),
    }
)

Below is the performance of each function:

%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.

%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.

def cumsum_shift(s):
    return s.cumsum().shift()

%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source