'Checking and filtering the first event of each group in Pandas
I have an event dataset. First of all, I grouped my dataset by user_id and location with pandas library. Then I order my each group by created_at. I want to delete all group if the group does not start with 'Open' event.
My dataset:
| user_id | location | event_name | created_at |
|---|---|---|---|
| A1 | AAA | Event5 | 9/1/2020 9:06:52 PM |
| A1 | AAA | Event6 | 9/1/2020 9:07:02 PM |
| A1 | AAA | Open | 9/1/2020 9:07:12 PM |
| A1 | AAA | Event1 | 9/1/2020 9:07:22 PM |
| A1 | AAA | Event2 | 9/1/2020 9:07:32 PM |
| A1 | AAA | Event3 | 9/1/2020 9:07:42 PM |
| A1 | AAA | Event2 | 9/1/2020 9:07:52 PM |
| A1 | AAA | Open | 9/1/2020 9:08:02 PM |
| A1 | AAA | Event1 | 9/1/2020 9:08:12 PM |
| A1 | AAA | Event2 | 9/1/2020 9:08:22 PM |
| A2 | AAA | Event9 | 9/1/2020 9:07:22 PM |
I want:
| user_id | location | event_name | created_at |
|---|---|---|---|
| A1 | AAA | Open | 9/1/2020 9:07:12 PM |
| A1 | AAA | Event1 | 9/1/2020 9:07:22 PM |
| A1 | AAA | Event2 | 9/1/2020 9:07:32 PM |
| A1 | AAA | Event3 | 9/1/2020 9:07:42 PM |
| A1 | AAA | Open | 9/1/2020 9:08:02 PM |
| A1 | AAA | Event1 | 9/1/2020 9:08:12 PM |
| A1 | AAA | Event2 | 9/1/2020 9:08:22 PM |
Thank you.
Solution 1:[1]
In short:
df[~(df['event_name'] != 'Open').groupby([df['user_id'], df['location']]).cumprod()]
Explanation:
(df['event_name'] != 'Open') gives us a series of True/False which can also be understood as 1/0 when interpreted as numbers. cumprod apply cumulative product to the True/False series so that as soon as it sees an False (i.e. 0 or 'Open'), it returns a zero and remain zero thereafter.
So we may use zero as the indicator of rows that we wanna to keep.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
