'Group Values in Pandas

I have following data in csv file. I want to groupby values on 'Value' column such that it breaks and prints the count of True and False if got atleast two consective values. I want to achieve this by using pandas because i will have the huge data and need to compute in lesser time.

Date Value
2022-05-16 17:54:30 True
2022-05-16 17:54:50 True
2022-05-16 17:55:01 True
2022-05-16 17:57:34 False
2022-05-16 17:57:44 True
2022-05-16 17:57:44 True
2022-05-16 17:57:54 False
2022-05-16 17:58:35 False
2022-05-16 18:31:43 False
2022-05-16 18:31:43 False
2022-05-16 18:31:43 True
2022-05-16 18:31:54 True
2022-05-16 18:31:54 True
2022-05-16 18:31:54 False
2022-05-16 18:32:45 False
2022-05-16 18:32:45 False
2022-05-16 18:32:45 False
2022-05-16 18:33:15 True
2022-05-16 18:33:15 False
2022-05-16 18:33:25 False
2022-05-16 18:33:25 False

Output should be like this

True: 5
False: 1
2022-05-16 17:57:44

True: 0
False: 4
2022-05-16 18:31:43

True: 3
False: 0
2022-05-16 18:31:54

True: 1
False: 7
2022-05-16 18:33:25


Solution 1:[1]

For this, you have to create a variable that indicates when a new "phase" start and end. A new phase is when a break occurs, that is, when a row is followed by two consecutive values that are different from the actual row. To get this phase, first I created a column "sum_next_two", which for each row, calculates the sum of the next two rows. We are interested in rows where sum_next_two==2 or sum_next_two==0, because they indicate the start/end of a new phase. So, when phase==True it's a positive phase that ends when sum_next_two==0. Then, start a negative phase (phase==False) that ends when sum_next_two==0. After identifying all the existing phases, make a variable group that create a unique number for each phase.

import pandas as pd

data = [{'Date': '2022-05-16 17:54:30', 'Value': True},
 {'Date': '2022-05-16 17:54:50', 'Value': True},
 {'Date': '2022-05-16 17:55:01', 'Value': True},
 {'Date': '2022-05-16 17:57:34', 'Value': False},
 {'Date': '2022-05-16 17:57:44', 'Value': True},
 {'Date': '2022-05-16 17:57:44', 'Value': True},
 {'Date': '2022-05-16 17:57:54', 'Value': False},
 {'Date': '2022-05-16 17:58:35', 'Value': False},
 {'Date': '2022-05-16 18:31:43', 'Value': False},
 {'Date': '2022-05-16 18:31:43', 'Value': False},
 {'Date': '2022-05-16 18:31:43', 'Value': True},
 {'Date': '2022-05-16 18:31:54', 'Value': True},
 {'Date': '2022-05-16 18:31:54', 'Value': True},
 {'Date': '2022-05-16 18:31:54', 'Value': False},
 {'Date': '2022-05-16 18:32:45', 'Value': False},
 {'Date': '2022-05-16 18:32:45', 'Value': False},
 {'Date': '2022-05-16 18:32:45', 'Value': False},
 {'Date': '2022-05-16 18:33:15', 'Value': True},
 {'Date': '2022-05-16 18:33:15', 'Value': False},
 {'Date': '2022-05-16 18:33:25', 'Value': False},
 {'Date': '2022-05-16 18:33:25', 'Value': False}]

df = pd.DataFrame(data)

# for each row, calculates the sum of the next two rows
df['sum_next_two'] = df['Value'].rolling(2).sum().shift(-2).fillna(method='ffill')

# We are interested in rows where sum_next_two==2 or sum_next_two==0, because they indicate the start/end of a new phase
df.loc[0, 'phase'] = True
df.loc[df['sum_next_two'].eq(2), 'phase'] = True
df.loc[df['sum_next_two'].eq(0), 'phase'] = False
df['phase'] = df['phase'].shift(1).fillna(method='ffill')
# identify uniquely each phase
df['group'] = df['phase'].diff().abs().cumsum().fillna(method='bfill')
df

enter image description here

After that, then simply aggregate to get the desired output:

df.groupby('group').agg(total_true=('Value', 'sum'), total_rows=('Value', 'count'), date=('Date', 'max'))\
    .assign(total_false=lambda x: x['total_rows'] - x['total_true'])\
    [['date', 'total_true', 'total_false']]

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jjsantoso