'Precise 12 months rolling sum with pandas groupby
Is there a way using pandas to compute rolling sum over 12 months as opposed to 365 days? (which is suggested here) For the sake of simplicity I do it here just for one group, but the code should work with multiple groups. The true data spans about a century, so the one day errors add up.
Example
Note the switch from the end of month to beginning of month around the end of 2020.
data = {'date': ['2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30', '2020-05-31', '2020-06-30',
'2020-07-31', '2020-08-31', '2020-09-30',
'2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-01'],
'values': [1, 1,1,1,1,1,1,1,1,1,1,1,1],
'group': [1, 1,1,1,1,1,1,1,1,1,1,1,1]}
df = pd.DataFrame(data, columns=['date', 'values', 'group' ''])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date').set_index('date')
df.groupby('group').rolling("365d", min_periods=12).sum()[['values']]
The input
date values group
0 2020-01-31 1 1
1 2020-02-29 1 1
2 2020-03-31 1 1
3 2020-04-30 1 1
4 2020-05-31 1 1
5 2020-06-30 1 1
6 2020-07-31 1 1
7 2020-08-31 1 1
8 2020-09-30 1 1
9 2020-10-31 1 1
10 2020-11-30 1 1
11 2020-12-31 1 1
12 2021-01-01 1 1
is transformed to
values
group date
1 2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 12.0
2021-01-01 13.0
The desired output is
values
group date
1 2020-01-31 NaN
2020-02-29 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 12.0
2021-01-01 12.0
Edit
I also need to account for the case that certain months are missing so that the sum starts from scratch. Note that February 2020 is missing in the following example:
data = {'date': ['2020-01-31', '2020-03-31',
'2020-04-30', '2020-05-31', '2020-06-30',
'2020-07-31', '2020-08-31', '2020-09-30',
'2020-10-31', '2020-11-30', '2020-12-31',
'2021-01-31', '2021-02-28', '2021-03-31' ],
'values': [1, 1,1,1,1,1,1,1,1,1,1,1,1, 1],
'group': [1, 1,1,1,1,1,1,1,1,1,1,1,1, 1]}
df = pd.DataFrame(data, columns=['date', 'values', 'group' ''])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date').set_index('date')
df.groupby('group').rolling(12, min_periods=12).sum()[['values']]
Output:
values
group date
1 2020-01-31 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 12.0
2021-02-28 12.0
2021-03-31 12.0
Desired output:
values
group date
1 2020-01-31 NaN
2020-03-31 NaN
2020-04-30 NaN
2020-05-31 NaN
2020-06-30 NaN
2020-07-31 NaN
2020-08-31 NaN
2020-09-30 NaN
2020-10-31 NaN
2020-11-30 NaN
2020-12-31 NaN
2021-01-31 NaN
2021-02-28 NaN
2021-03-31 12.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
