'How to group by values in a column and find time difference using python?
I have a dataframe as shown below:
Col A | Time | Col B | Col C |
---|---|---|---|
123 | 2018-01-06 03:45:23 | B | 1 |
141 | 2018-01-08 12:45:55 | C | 0 |
123 | 2018-01-08 11:45:29 | A | 0 |
123 | 2018-01-08 01:45:15 | C | 0 |
141 | 2018-01-08 11:35:06 | C | 0 |
123 | 2018-01-09 10:45:37 | A | 1 |
123 | 2018-01-09 09:45:41 | C | 0 |
141 | 2018-01-06 03:45:28 | B | 0 |
123 | 2018-01-08 12:45:23 | C | 0 |
141 | 2018-01-08 11:45:17 | A | 0 |
141 | 2018-01-08 01:45:02 | C | 0 |
123 | 2018-01-08 11:35:56 | C | 0 |
141 | 2018-01-09 10:45:39 | A | 0 |
141 | 2018-01-09 09:45:19 | C | 0 |
178 | 2019-04-17 08:34:13 | B | 1 |
178 | 2019-05-02 19:48:44 | C | 0 |
178 | 2019-04-18 13:27:28 | A | 1 |
For this dataframe, there are two operations that need to be performed:
- I have to group by values in
Col A
sort it byTime
and calculate the lagging time difference - Filter the dataframe only for values
B
andA
inCol B
, and apply the above operation - Keeping the conditions in Step 2, Calculate time difference for each
A
inCol B
from the very first timestamp value of that group by value inCol A
- Keeping the filter conditions in Step 2, and where the value of
Col C
is1
, calculate time difference ofB
andA
ofCol B
. In this case, only oneB
and oneA
for each group ofCol A
can have value1
inCol C
For both the above steps, a new column should be created. In case 2, where the value in column Col B
is C
, in the newly created column it should be filled with NaN. Can someone please help with this implementation using Python?
Expected Output:
Col A | Time | Col B | Col C | Lag Time | Filtered Lag | Diff | New Col |
---|---|---|---|---|---|---|---|
123 | 2018-01-06 03:45:23 | B | 1 | NaN | NaN | NaN | NaN |
123 | 2018-01-08 01:45:15 | C | 0 | 165592 | NaN | NaN | NaN |
123 | 2018-01-08 11:35:56 | C | 0 | 35441 | NaN | NaN | NaN |
123 | 2018-01-08 11:45:29 | A | 0 | 573 | 201606 | 201606 | NaN |
123 | 2018-01-08 12:45:23 | C | 0 | 3594 | NaN | NaN | NaN |
123 | 2018-01-09 09:45:41 | C | 0 | 75618 | NaN | NaN | NaN |
123 | 2018-01-09 10:45:37 | A | 1 | 3596 | 82808 | 284414 | 284414 |
178 | 2019-04-17 08:34:13 | B | 1 | NaN | NaN | NaN | NaN |
178 | 2019-04-18 13:27:28 | A | 1 | 103995 | 103995 | 103995 | 103995 |
178 | 2019-05-02 19:48:44 | C | 0 | 1232476 | NaN | NaN | NaN |
Similarly for other values in Col A
also it should calculate values for new columns.
Note: Values in new columns are calculated in seconds
Solution 1:[1]
Would this work? The first column seems straightforward. For the second column, we use where
to only work with rows with A
or B
in ColB
.
df['Time'] = pd.to_datetime(df['Time'])
df['new'] = df.sort_values('Time').groupby('Col A')['Time'].diff().dt.total_seconds()
df['new2'] = df['Time'].where(df['Col B'].isin(['A','B'])).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()
Output:
Col A Time Col B new new2
0 123 2018-01-06 03:45:23 B NaN NaN
1 141 2018-01-08 12:45:55 C 3638.0 NaN
2 123 2018-01-08 11:45:29 A 573.0 201606.0
3 123 2018-01-08 01:45:15 C 165592.0 NaN
4 141 2018-01-08 11:35:06 C 35404.0 NaN
5 123 2018-01-09 10:45:37 A 3596.0 82808.0
6 123 2018-01-09 09:45:41 C 75618.0 NaN
7 141 2018-01-06 03:45:28 B NaN NaN
8 123 2018-01-08 12:45:23 C 3594.0 NaN
9 141 2018-01-08 11:45:17 A 611.0 201589.0
10 141 2018-01-08 01:45:02 C 165574.0 NaN
11 123 2018-01-08 11:35:56 C 35441.0 NaN
12 141 2018-01-09 10:45:39 A 3620.0 82822.0
13 141 2018-01-09 09:45:19 C 75564.0 NaN
14 178 2019-04-17 08:34:13 B NaN NaN
15 178 2019-05-02 19:48:44 C 1232476.0 NaN
16 178 2019-04-18 13:27:28 A 103995.0 103995.0
Solution 2:[2]
Below is the solution to the above question meeting all the conditions:
# Converting `Time` column to `datetime`
df['Time'] = pd.to_datetime(df['Time'])
# Group by `Col A` after sorting by `Time` and calculate row difference in seconds for each group
df['Lag Time'] = df.sort_values('Time').groupby('Col A')['Time'].diff().dt.total_seconds()
# Filter dataframe where value of `Col B` is in `["A", "B"]`, group by `Col A` then sort by `Time` using lambda function and calculate row difference in seconds for each group
df['Filtered Lag'] = df['Time'].where(df['Col B'].isin(['B','A'])).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()
# Filter dataframe where value of `Col B` is in `["A", "B"]` and value of `Col C` is `1`, group by `Col A` then sort by `Time` using lambda function and calculate row difference in seconds for each group
df['Diff'] = df['Time'].where(df['Col B'].isin(['B','A']) & df['Col C']==1).groupby(df['Col A']).apply(lambda x: x.sort_values().diff()).droplevel(0).dt.total_seconds()
Values for New Col
are yet to be calculated, will post the updated answer soon.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |