'How to make a custom sum in pandas?
I'm trying to create a running sum on a Dataframe that follows some conditions, sum all previous values with same 'first index current day' value and condiction is true, the pseudocode would be something like
df_ = df[first index current day:current index] #Delimiter dataframe range from begin of day until current row
df_ = df_[df_['toSum'] == True] #drop False
result = sum(df_['sales']) #Sum all sales until now with 'toSum' == True
This is very easy using a loop or even a apply but take too much time, my dataframe have more than 1000k rows, the best I could do until now is use threads but if it's possible to vectorize that algorithm it would be a lot fastter
The result is something like this:
| Actual Id | First ID Actual Day | Sales | toSum | AcumulativeDay |
|---|---|---|---|---|
| 0 | 0 | 1 | True | 1 |
| 1 | 0 | 1 | True | 2 |
| 2 | 0 | 1 | False | 2 |
| 3 | 0 | 1 | False | 2 |
| 4 | 0 | 1 | True | 3 |
| 5 | 5 | 1 | True | 1 |
| 6 | 5 | 1 | True | 2 |
| 7 | 5 | 1 | False | 2 |
| 8 | 5 | 1 | False | 2 |
| 9 | 5 | 1 | True | 3 |
Solution 1:[1]
Use groupby over First ID Actual Day to get all previous values with same first day ID. Then, multiply the boolean column toSum with Sales so that cumsum can interpret the False values as zeros. Remove the indices with droplevel and make the attribution to the variable.
import pandas as pd
d= {'Actual Id': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'First ID Actual Day': {0: 0,1: 0,2: 0,3: 0,4: 0,5: 5,6: 5,7: 5,8: 5,9: 5},
'Sales': {0: 1.0,1: 11.33,2: 1.0,3: 31.2,4: 1.0,5: 241.14,6: 1.0,7: 1.0,8: 1.0,9: 54.22},
'toSum': {0: True,1: True,2: False,3: False,4: True,5: True,6: True,7: False,8: False,9: True}
}
df = pd.DataFrame(d)
df['AcumulativeDay'] = df.groupby(['First ID Actual Day'])\
.apply(lambda x: (x['Sales']*x['toSum'])\
.cumsum()).droplevel(0)
Actual Id First ID Actual Day Sales toSum AcumulativeDay
0 0 0 1.00 True 1.00
1 1 0 11.33 True 12.33
2 2 0 1.00 False 12.33
3 3 0 31.20 False 12.33
4 4 0 1.00 True 13.33
5 5 5 241.14 True 241.14
6 6 5 1.00 True 242.14
7 7 5 1.00 False 242.14
8 8 5 1.00 False 242.14
9 9 5 54.22 True 296.36
Solution 2:[2]
After having the table filtered as you did use:
df['cumsum'] = df['sales'].cumsum()
You may want to have the dataframe sorted by date.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | n1colas.m |
| Solution 2 | Ziur Olpa |
