'Cumulative sum with custom window bounds
I am trying to calculate a trailing sum over varying date ranges.
I would like to sum up a data column ("data") based on the days given in another column ("t"), which counts the days between two start and end dates. In the actual data, time windows are approximately one month (last Friday of each month to the next). This is used to calculate stock market index parameters. Next and last are dates for the monthly rebalancing. I use a datetime index.
Please see an example below. The NaN are specific to this extract. I hope this helps.
import pandas as pd
df = pd.DataFrame({'data': [5, 10, 20, 25, 20, 24, 16, 12],
'next': ['2021-02-26', '2021-02-26', '2021-02-26', '2021-03-26', '2021-03-26', '2021-03-26', '2021-03-26', '2021-03-26'],
'last': ['2020-01-29', '2020-01-29', '2020-01-29', '2021-02-26', '2021-02-26', '2021-02-26', '2021-02-26', '2021-02-26'],
't': [26, 27, 28, 1, 2, 3, 4, 5],
'data_expected_sum': ["NaN", "NaN", "NaN", 25, 45, 69, 85, 97]},
index = ['2021-02-24', '2021-02-25', '2021-02-26', '2021-02-27', '2021-02-28', '2021-03-01', '2021-03-02', '2021-03-03'])
So far, I have been unsuccessfully trying to somehow use df.rolling(). I was hoping this would work with custom windows rolling and passing my next and last columns as starting and ending indices but I could not get it to work. I am working in jupyter notebook.
I was thinking something like the below:
from pandas.api.indexers import BaseIndexer
class CustomIndexer(BaseIndexer):
def get_window_bounds(self, num_values, min_periods, center, closed):
start = start_dates
end = end_dates
return start, end
indexer = CustomIndexer()
df['data'] = df['data'].rolling(indexer).sum()
The actual dataset is large and I would need to perform this calculation for multiple data columns but using the same timeframes.
How could I implement this? I am open to any solution.
Edit: I found a solution that works for my case, summing up the values between two dates based on my date index. This is not a rolling sum but suffices for my case, as I only need the sum at these cut-off dates for my following calculations. Yet, if you have a more elegant solution, happy to hear.
df[columns] = df.apply(lambda x: df.loc[(df.index > x.last) &
(df.index <= x.next), columns].sum(), axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
