'Predecessor within group satisfying a property

I have a pandas DataFrame where rows corresponds to chronological events with a time column t and an indicator whether an certain even happened or not. I want to add an additional column about the last time that event did happen, and an additional column for last time the event did not happen, within that group. Cases before witnessing this are set to NaN. Here's an example:

df = pd.DataFrame.from_dict({'group_id':[1,1,1,1, 2,2,2,2, 3,3,3,3], 't':[1, 5, 9, 12, 100, 120, 150, 180, 400, 410, 450, 490], 
                             'happened': [1, 0, 1, 0, 0, 0, 1, 0, 1,  1, 1, 0]})

out = pd.DataFrame.from_dict({'group_id':[1,1,1,1, 2,2,2,2, 3,3,3,3], 't':[1, 5, 9, 12, 100, 120, 150, 180, 400, 410, 450, 490],
                              'happened': [1, 0, 1, 0,     0, 0, 1, 0,    1,  1, 1, 0],
                             't_yes': [np.nan, 1, 1, 9,     np.nan, np.nan, np.nan, 150,    np.nan, 400, 410, 450],
                             't_no': [np.nan, np.nan, 5, 5,     np.nan, 100, 120, 120,    np.nan, np.nan, np.nan, np.nan]
                             })

I know it has something to do with groupby and probably shift, but I don't know exactly how



Solution 1:[1]

First multiply ts with happeneds to see if an event happened or not at any point in time. Since we want t_yes, t_no separately, we multiply separately. Then use groupby.apply to get the last time event happened through shift + ffill.

df['t_yes'] = df['happened'].mul(df['t']).replace(0, float('nan'))
df['t_no'] = (~df['happened'].astype(bool)).mul(df['t']).replace(0, float('nan'))
df[['t_yes','t_no']] = df.groupby('group_id')[['t_yes','t_no']].apply(lambda x: x.shift().ffill())

Output:

    group_id    t  happened  t_yes   t_no
0          1    1         1    NaN    NaN
1          1    5         0    1.0    NaN
2          1    9         1    1.0    5.0
3          1   12         0    9.0    5.0
4          2  100         0    NaN    NaN
5          2  120         0    NaN  100.0
6          2  150         1    NaN  120.0
7          2  180         0  150.0  120.0
8          3  400         1    NaN    NaN
9          3  410         1  400.0    NaN
10         3  450         1  410.0    NaN
11         3  490         0  450.0    NaN

Another a bit more involved solution:

First pivot the DataFrame. Then multiply the values with t with happeneds to see if an event happened or not at any point in time. Then stack + reindex will get the pivoted DataFrame back into the original shape. Finally, groupby.apply will get us the desired output: the last time event happened:

pivoted = df.pivot('group_id', 't', 'happened')
df['t_yes'] = pivoted.replace(0, float('nan')).mul(pivoted.columns).stack().reindex([df['group_id'], df['t']]).groupby(level=0).apply(lambda x: x.ffill().shift()).reset_index(drop=True)
df['t_no'] = pivoted.replace({1:float('nan'), 0:1}).mul(pivoted.columns).stack().reindex([df['group_id'], df['t']]).groupby(level=0).apply(lambda x: x.ffill().shift()).reset_index(drop=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1