'How can I group elements in pandas series based on how many times they repeat?
I have this example_series:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 False
22 False
23 False
24 False
25 False
26 False
27 True
28 False
29 False
30 False
And i want to put the elements in groups, where a new group is created if True repeats n (lets say 5) times. I know how to group them based on where the switch happens:
grouper = example_series.diff().ne(0).cumsum()
which would give me this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 3
22 3
23 3
24 3
25 3
26 3
27 4
28 5
29 5
30 5
But this created a new group e.g. at index 27 which I do not want because True has not repeated 5 times. so 21-30 should all remain group 3. I have been meddling with some loops but didn't really come to anything. Is there a oneliner for something like this in pandas?
Solution 1:[1]
Not sure if there's a one-liner, but this may work if IIUC. It builds on cumsum
to apply the counts to each row. If the row count is less than 5, in your example, those rows should stay with preceding group number. The bfill
and ffill
are needed depending on where the counts are less than 5.
Note: I named the column 'value' in df1
df1 = df.diff().ne(0).cumsum()
df2 = df1.groupby(['value'])['value'].transform('count').to_frame()
df2.loc[df2.value < 5, 'value'] = np.nan
df2 = df2.value.bfill().ffill()
df_final = df2.diff().ne(0).cumsum()
df_final
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonathan Leon |