'Looking for a faster method for iterative calculations in DataFrame
My current code works but I would like help making it cleaner/faster.
Goal: With col 'numb' and col 'shopper' being consistent, keep rows with a timeframe longer than 10 minutes. A simple diff() or shift() function would not work in this case since I'm not just looking at the previous row but total elapsed time for rows with the same numb/shopper combination.
Original dataframe:
| timestamp | rand | shopper | rand2 | numb |
|---|---|---|---|---|
| 2022-08-06 11:25:00 | b | Mark | i | 706040 |
| 2022-08-06 11:30:00 | a | John | h | 845843 |
| 2022-08-06 11:55:00 | c | John | g | 845843 |
| 2022-08-06 11:57:00 | d | John | h | 845843 |
| 2022-08-06 11:59:00 | f | John | j | 845843 |
| 2022-08-06 12:07:00 | d | John | h | 845843 |
| 2022-08-06 12:10:00 | d | John | h | 845843 |
| 2022-08-06 12:12:00 | f | Peter | j | 635640 |
Expected output:
| timestamp | rand | shopper | rand2 | numb |
|---|---|---|---|---|
| 2022-08-06 11:25:00 | b | Mark | i | 706040 |
| 2022-08-06 11:30:00 | a | John | h | 845843 |
| 2022-08-06 11:55:00 | c | John | g | 845843 |
| 2022-08-06 12:07:00 | d | John | h | 845843 |
| 2022-08-06 12:12:00 | f | Peter | j | 635640 |
My code:
ref_idx = 0
next_idx = 1
numdays = timedelta(minutes = 10)
saved_rows = df.iloc[[0]]
while True:
try:
next_time = df['timestamp'].iloc[next_idx]
except IndexError:
break
else:
ref_time = df['timestamp'].iloc[ref_idx]
ref_numb = df['numb'].iloc[ref_idx]
next_numb = df['numb'].iloc[next_idx]
if (((next_time - ref_time) >= numdays) and (ref_numb == next_numb) or ref_numb != next_numb):
saved_rows.loc[len(saved_rows.index)] = df.iloc[next_idx]
ref_idx = next_idx
next_idx += 1
print(saved_rows)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
