'Apply expanding function on dataframe
I have a function that I wish to apply to a subsets of a pandas DataFrame, so that the function is calculated on all rows (until current row) from the same group - i.e. using a groupby and then expanding.
For example, this dataframe:
df = pd.DataFrame.from_dict(
{
'group': ['A','A','A','B','B','B'],
'time': [1,2,3,1,2,3],
'x1': [10,40,30,100,200,300],
'x2': [1,0,1,2,0,3]
}).sort_values('time')
i.e.
group time x1 x2
0 A 1 10 1
3 B 1 100 2
1 A 2 40 2
4 B 2 200 0
2 A 3 30 1
5 B 3 300 3
and this function, for example:
def foo(_df):
return _df['x1'].max() * _df['x2'].iloc[-1]
[Edited for clarity following feedback from jezrael: my actual function is more complicated, and cannot be easily broken down into components for this task. this simple function is just for an MCVE.]
I want to do something like:
df['foo_result'] = df.groupby('group').expanding().apply(foo, raw=False)
To obtain this result:
group time x1 x2 foo_result
0 A 1 10 1 10
3 B 1 100 2 200
1 A 2 40 2 80
4 B 2 200 0 0
2 A 3 30 1 40
5 B 3 300 3 900
Problem is, running df.groupby('group').expanding().apply(foo, raw=False) results in KeyError: 'x1'.
Is there a correct way to run this, or is it not possible to do so in pandas without breaking down my function into components?
Solution 1:[1]
Applying a dataframe function on an expanding window is apparently not possible (at least not for pandas version 0.23.0; EDITED - and also not 1.3.0), as one can see by plugging a print statement into the function.
Running df.groupby('group').expanding().apply(lambda x: bool(print(x)) , raw=False) on the given DataFrame (where the bool around the print is just to get a valid return value) returns:
0 1.0
dtype: float64
0 1.0
1 2.0
dtype: float64
0 1.0
1 2.0
2 3.0
dtype: float64
0 10.0
dtype: float64
0 10.0
1 40.0
dtype: float64
0 10.0
1 40.0
2 30.0
dtype: float64
(and so on - and also returns a dataframe with '0.0' in each cell, of course).
This shows that the expanding window works on a column-by-column basis (we see that first the expanding time series is printed, then x1, and so on), and does not really work on a dataframe - so a dataframe function can't be applied to it.
So, to get the obtained functionality, one would have to put the expanding inside the dataframe function, like in the accepted answer.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
