'How to count number of events in a dataframe before and after a given date?

I'm trying to identify individuals who have events before or after events of their first occurrence of an event of a specific type.

For example, I'm interested in events of type 'y'. I want to know which individuals have had events of types 'x', 'z', 'p' etc. either before or after the first time they are linked to events of type 'y'.

An example input dataframe would be:

df=pd.DataFrame({'IDs':['a','a','a','b','b','c'],'Event_types':['x','y','z','p','y','x'],'date':['2020-01-01','2020-01-06','2020-01-11','2020-01-11','2020-01-14','2020-01-27']})

Giving:

  IDs Event_types       date
0   a           x 2020-01-01
1   a           y 2020-01-06
2   a           z 2020-01-11
3   b           p 2020-01-11
4   b           y 2020-01-14
5   c           x 2020-01-27

If I'm focussing on events of type 'y then desired output would then be

  IDs  Event_type_x_before  Event_type_x_after  Event_type_z_before  Event_type_z_after
0   a                    1                   0                    0                   1
1   b                    0                   0                    0                   0
2   c                    0                   1                    0                   0

The best solution I've found is Pandas count the number of times an event has occurred in last n days by group which is close to what I want but it runs too slowly for the size of the dataframe and also only deals with events in the past (or future if the timedelta was modified). Any help would be much appreciated.



Solution 1:[1]

With the following toy dataframe (one more example than yours):

import pandas as pd

df = pd.DataFrame(
    {
        "IDs": ["a", "a", "a", "b", "b", "c", "c"],
        "Event_types": ["x", "y", "z", "p", "y", "x", "x"],
        "date": [
            "2020-01-01",
            "2020-01-06",
            "2020-01-11",
            "2020-01-11",
            "2020-01-14",
            "2020-01-27",
            "2020-01-28",
        ],
    }
)

Here is one way to do it:

# Setup
TARGET = "y"
first_occurrence = df.loc[df["Event_types"] == TARGET, "date"].min()

# Count number of events before/after TARGET event
df = (
    df.loc[~(df["Event_types"] == TARGET), :]
    .assign(before=lambda df_: (df_["date"] < first_occurrence).astype(int))
    .assign(after=lambda df_: (df_["date"] >= first_occurrence).astype(int))
    .groupby(["IDs", "Event_types"])
    .sum()
    .reset_index()
)

# Create new dataframe
new_df = pd.DataFrame()
for idx in df["IDs"].unique():
    for evt in df["Event_types"].unique():
        mask = (df["IDs"] == idx) & (df["Event_types"] == evt)
        try:
            new_df.loc[idx, f"{evt}_before_{TARGET}"] = df.loc[mask, "before"].values[0]
        except Exception:
            pass
        try:
            new_df.loc[idx, f"{evt}_after_{TARGET}"] = df.loc[mask, "after"].values[0]
        except Exception:
            pass

# Cleanup
new_df = new_df.fillna(0).astype(int)
print(new_df)
# Output
   x_before_y  x_after_y  z_before_y  z_after_y  p_before_y  p_after_y
a           1          0           0          1           0          0
b           0          0           0          0           0          1
c           0          2           0          0           0          0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurent