'How to group by values in a column and find time difference using python?

I have a dataframe as shown below:

Col A Time Col B Col C
123 2018-01-06 03:45:23 B 1
141 2018-01-08 12:45:55 C 0
123 2018-01-08 11:45:29 A 0
123 2018-01-08 01:45:15 C 0
141 2018-01-08 11:35:06 C 0
123 2018-01-09 10:45:37 A 1
123 2018-01-09 09:45:41 C 0
141 2018-01-06 03:45:28 B 0
123 2018-01-08 12:45:23 C 0
141 2018-01-08 11:45:17 A 0
141 2018-01-08 01:45:02 C 0
123 2018-01-08 11:35:56 C 0
141 2018-01-09 10:45:39 A 0
141 2018-01-09 09:45:19 C 0
178 2019-04-17 08:34:13 B 1
178 2019-05-02 19:48:44 C 0
178 2019-04-18 13:27:28 A 1

For this dataframe, there are two operations that need to be performed:

  1. I have to group by values in Col A sort it by Time and calculate the lagging time difference
  2. Filter the dataframe only for values B and A in Col B, and apply the above operation
  3. Keeping the conditions in Step 2, Calculate time difference for each A in Col B from the very first timestamp value of that group by value in Col A
  4. Keeping the filter conditions in Step 2, and where the value of Col C is 1, calculate time difference of B and A of Col B. In this case, only one B and one A for each group of Col A can have value 1 in Col C

For both the above steps, a new column should be created. In case 2, where the value in column Col B is C, in the newly created column it should be filled with NaN. Can someone please help with this implementation using Python?

Expected Output:

Col A Time Col B Col C Lag Time Filtered Lag Diff New Col
123 2018-01-06 03:45:23 B 1 NaN NaN NaN NaN
123 2018-01-08 01:45:15 C 0 165592 NaN NaN NaN
123 2018-01-08 11:35:56 C 0 35441 NaN NaN NaN
123 2018-01-08 11:45:29 A 0 573 201606 201606 NaN
123 2018-01-08 12:45:23 C 0 3594 NaN NaN NaN
123 2018-01-09 09:45:41 C 0 75618 NaN NaN NaN
123 2018-01-09 10:45:37 A 1 3596 82808 284414 284414
178 2019-04-17 08:34:13 B 1 NaN NaN NaN NaN
178 2019-04-18 13:27:28 A 1 103995 103995 103995 103995
178 2019-05-02 19:48:44 C 0 1232476 NaN NaN NaN

Similarly for other values in Col A also it should calculate values for new columns.

Note: Values in new columns are calculated in seconds



Solution 1:[1]

Would this work? The first column seems straightforward. For the second column, we use where to only work with rows with A or B in ColB.

df['Time'] = pd.to_datetime(df['Time'])
df['new'] = df.sort_values('Time').groupby('Col A')['Time'].diff().dt.total_seconds()
df['new2'] = df['Time'].where(df['Col B'].isin(['A','B'])).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()

Output:

    Col A                Time Col B        new      new2
0     123 2018-01-06 03:45:23     B        NaN       NaN
1     141 2018-01-08 12:45:55     C     3638.0       NaN
2     123 2018-01-08 11:45:29     A      573.0  201606.0
3     123 2018-01-08 01:45:15     C   165592.0       NaN
4     141 2018-01-08 11:35:06     C    35404.0       NaN
5     123 2018-01-09 10:45:37     A     3596.0   82808.0
6     123 2018-01-09 09:45:41     C    75618.0       NaN
7     141 2018-01-06 03:45:28     B        NaN       NaN
8     123 2018-01-08 12:45:23     C     3594.0       NaN
9     141 2018-01-08 11:45:17     A      611.0  201589.0
10    141 2018-01-08 01:45:02     C   165574.0       NaN
11    123 2018-01-08 11:35:56     C    35441.0       NaN
12    141 2018-01-09 10:45:39     A     3620.0   82822.0
13    141 2018-01-09 09:45:19     C    75564.0       NaN
14    178 2019-04-17 08:34:13     B        NaN       NaN
15    178 2019-05-02 19:48:44     C  1232476.0       NaN
16    178 2019-04-18 13:27:28     A   103995.0  103995.0

Solution 2:[2]

Below is the solution to the above question meeting all the conditions:

# Converting `Time` column to `datetime`
df['Time'] = pd.to_datetime(df['Time'])

# Group by `Col A` after sorting by `Time` and calculate row difference in seconds for each group
df['Lag Time'] = df.sort_values('Time').groupby('Col A')['Time'].diff().dt.total_seconds()

# Filter dataframe where value of `Col B` is in `["A", "B"]`, group by `Col A` then sort by `Time` using lambda function and calculate row difference in seconds for each group
df['Filtered Lag'] = df['Time'].where(df['Col B'].isin(['B','A'])).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()

# Filter dataframe where value of `Col B` is in `["A", "B"]` and value of `Col C` is `1`, group by `Col A` then sort by `Time` using lambda function and calculate row difference in seconds for each group
df['Diff'] = df['Time'].where(df['Col B'].isin(['B','A']) & df['Col C']==1).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()
    

Values for New Col are yet to be calculated, will post the updated answer soon.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2