'Efficient way of applying arbitrary functions to a pandas DataFrameGroupBy object?

I have a dataframe with an 'id' column and a number of other columns. For each id, I need to compute a number of features, using data from the corresponding rows. The features can be complicated functions, rather than simple aggregations.

Preferably, the features should be computed relatively efficiently, and in a transparent way, i.e. how the features are computed from data should be defined in a single place.

I would do something like below - defining in e.g. a dictionary how features are computed, then using that dictionary with groupby (possibly parallelizing the groupby loop).

Is this a reasonable approach, or can it be made more efficient?

import pandas as pd

example_data = pd.DataFrame({
    "id": [1, 2, 2, 3],
    "a": [10, 20, 30, 40],
    "b": ["foo", "bar", "baz", "meh"]})


# Dictionary mapping feature names to functions that compute them
feature_definitions = {
    "sum_a": lambda df: df["a"].sum(),
    "mean_a": lambda df: df["a"].mean(),
    "b_counts": lambda df: sum(s.count("b") for s in df["b"]),
    # ... could be many more here
}


def compute(df):
    # Compute features from a given dataframe
    res = pd.DataFrame({header: [fun(df)] for header, fun in feature_definitions.items()})
    return res


# Compute features for each id
parts = []
for id_, df in example_data.groupby("id"):
    part = compute(df)
    part["id"] = id_
    parts.append(part)

# Combine results into one dataframe
result = pd.concat(parts)
result.set_index("id", inplace=True)

print(result)
#     sum_a  mean_a  b_counts
# id
# 1      10    10.0         0
# 2      50    25.0         2
# 3      40    40.0         0

python pandas

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Efficient way of applying arbitrary functions to a pandas DataFrameGroupBy object?

Sources

Related Questions