'Pandas - count streak since last value change
I want to get the streak since a value in a specific column changed:
datetime val1 val2 val1-streak val2-streak
2018-04-01 00:00:00 4 1 0 0
2018-05-01 00:00:00 5 2 0 0
2018-06-01 00:00:00 5 2 1 1
2018-07-01 00:00:00 6 2 0 2
2018-08-01 00:00:00 7 2 0 3
2018-09-01 00:00:00 7 3 1 0
2018-10-01 00:00:00 7 3 2 1
2018-11-01 00:00:00 5 2 0 0
For now it is ok to assume that the dattime column is equaly spaced. That way i can just count the "periods" since the value in a target column changed. Ideally the function can additionally count the difference in datetinme since the value changed last.
I found a few solutions on this site but when i looked more closely they did not really describe my problem:
Pandas - count since last transaction
Counting changes in pandas row values?
In pandas, how do you find the time since a column by value changes, when grouped by another column?
Pythonic way to calculate streaks in pandas dataframe
finding streaks in pandas dataframe
https://joshdevlin.com/blog/calculate-streaks-in-pandas/
.... at least I was not able to transfer the given answers to my problem.
Solution 1:[1]
Use custom function for generate counter by consecutive values by compare differencies per columns specified in list by Series.diff, compare for not equal by Series.ne with cumulative sum, last pass to GroupBy.cumcount:
vals = ['val1','val2']
def f(x):
x = x.diff().ne(0).cumsum()
return x.groupby(x).cumcount()
df = df.join(df[vals].apply(f).add_suffix('_streak'))
Solution 2:[2]
Let's try the following:
(i) Create a stacked Series using set_index + unstack (this is to use not repeat the same function on two columns separately).
(ii) Use groupby + diff to find differences among consecutive elements: tmp
(iii) We're interested in the location where the differences are 0; we do another groupby + cumsum to get the desired outcome.
(iv) Use pivot to get the output back into the shape of df.
tmp = df.set_index('datetime')[['val1','val2']].unstack().groupby(level=0).diff()
df[['val1-streak','val2-streak']] = pd.pivot(tmp.eq(0)
.groupby([tmp.index.get_level_values(0),
tmp.ne(0).cumsum()])
.cumsum()
.reset_index(),
'datetime', 'level_0', 0).to_numpy()
Output:
datetime val1 val2 val1-streak val2-streak
0 2018-04-01 00:00:00 4 1 0 0
1 2018-05-01 00:00:00 5 2 0 0
2 2018-06-01 00:00:00 5 2 1 1
3 2018-07-01 00:00:00 6 2 0 2
4 2018-08-01 00:00:00 7 2 0 3
5 2018-09-01 00:00:00 7 3 1 0
6 2018-10-01 00:00:00 7 3 2 1
7 2018-11-01 00:00:00 5 2 0 0
Solution 3:[3]
Let's start with how to do it with a single column. First off we need to find each point where the value changes:
diffs = df['val1'].diff(1)
change_points = diffs != 0
Since previously encountered values can show up again we need to find a proper way to distinguish different streaks of the same value. We'll achieve that with a cumulative sum of the change point array. For ease we'll create a temporary DataFrame to hold those results (you can create a new column in your dataframe, but that's a bit messy)
change_points_cumsum = change_points.cumsum()
tmp_merged = pd.concat([serie, change_points_cumsum], axis=1, keys=['val1', 'change_points_cumsum'])
Every streak is assigned a different change_points_cumsum value so now a simple group by and cumulative count can be applied to get the final result
tmp_merged["val1-streak"] = tmp_merged.groupby(change_points_cumsum).cumcount()
print(tmp_merged)
val1 change_points_cumsum val1-streak
0 4 1 0
1 5 2 0
2 5 2 1
3 6 3 0
4 7 4 0
5 7 4 1
6 7 4 2
7 5 5 0
If you are looking for a shorter, more compact solution
change_points_cumsum = df['val1'].diff(1).ne(0).cumsum()
change_points_cumsum.groupby(change_points_cumsum).cumcount()
0 0
1 0
2 1
3 0
4 0
5 1
6 2
7 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jezrael |
| Solution 2 | |
| Solution 3 |
