'Efficient way of applying arbitrary functions to a pandas DataFrameGroupBy object?
I have a dataframe with an 'id' column and a number of other columns. For each id, I need to compute a number of features, using data from the corresponding rows. The features can be complicated functions, rather than simple aggregations.
Preferably, the features should be computed relatively efficiently, and in a transparent way, i.e. how the features are computed from data should be defined in a single place.
I would do something like below - defining in e.g. a dictionary how features are computed, then using that dictionary with groupby (possibly parallelizing the groupby loop).
Is this a reasonable approach, or can it be made more efficient?
import pandas as pd
example_data = pd.DataFrame({
"id": [1, 2, 2, 3],
"a": [10, 20, 30, 40],
"b": ["foo", "bar", "baz", "meh"]})
# Dictionary mapping feature names to functions that compute them
feature_definitions = {
"sum_a": lambda df: df["a"].sum(),
"mean_a": lambda df: df["a"].mean(),
"b_counts": lambda df: sum(s.count("b") for s in df["b"]),
# ... could be many more here
}
def compute(df):
# Compute features from a given dataframe
res = pd.DataFrame({header: [fun(df)] for header, fun in feature_definitions.items()})
return res
# Compute features for each id
parts = []
for id_, df in example_data.groupby("id"):
part = compute(df)
part["id"] = id_
parts.append(part)
# Combine results into one dataframe
result = pd.concat(parts)
result.set_index("id", inplace=True)
print(result)
# sum_a mean_a b_counts
# id
# 1 10 10.0 0
# 2 50 25.0 2
# 3 40 40.0 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
