'group time stamps based on intervals

I have a dataset that looks like this:

      main_id            time_stamp                        
          aaa            2019-05-29 08:16:05+05     
          aaa            2019-05-30 00:11:05+05     
          aaa            2020-05-30 09:15:07+05     
          bbb            2019-05-29 09:11:05+05     

For each main_id, I want to sort the time_stamps such that they are listed from the lowest to the highest.

I want to create a new column day, which uses the time_stamp to derive a number that describes the business day.

Business days are defined like this (example):

Monday 05:00 - Tuesday 01:00 (1 Business Day i.e Monday)

Tuesday 05:00 - Wednesday 01:00 => (1 Business Day i.e Tuesday)

and so on...

This means that the first and second rows with main_id = aaaare from the same business day as the second row is showing time before 1 am on the next day.

However, in the third row, the timestamp is from another day so we add 2 as the day.

The end result could look something like this:

      main_id        time_stamp                             day
          aaa            2019-05-29 08:16:05+05              1
          aaa            2019-05-30 00:11:05+05              1
          aaa            2020-05-30 09:15:07+05              2
          bbb            2019-05-29 09:11:05+05              1

Day 1 would be anywhere between the first 5:00 am - next day's 1 am. While day 2 would be the next possible business day (next 5 am - 1 am)

How can I achieve this?



Solution 1:[1]

You can define your business start day in business_days and then trivially compare them

enter image description here

from datetime import time

df['time_stamp'] = pd.to_datetime(df['time_stamp'])

business_days = {
    0: 1,  # Mon
    2: 2,  # Tue
    3: 3,  # Wed
    4: 4,  # Thu
    5: 5,  # Fri
}

def map_business(x):
    # -1 means not a business day
    week = int(x.strftime('%w'))

    if week in business_days.keys():
        if x.time() > time(5, 00):
            return business_days[week]
        elif x.time() < time(3, 00):
            return business_days[week-1]
        else:
            return -1
    elif (week-1) in business_days.keys():
        if x.time() < time(1, 00):
            return business_days[week-1]
        else:
            return -1
    else:
        return -1


df = df.sort_values(['main_id', 'time_stamp'])
df['days'] = df.sort_values(['main_id', 'days'])[df['days'].ne(-1)].groupby('days').ngroup() + 1
df['days'] = df['days'].fillna(-1).astype(int)
  main_id                time_stamp  days
0     aaa 2019-05-29 08:16:05+05:00     1
1     aaa 2019-05-30 00:11:05+05:00     1
2     aaa 2020-05-30 09:15:07+05:00    -1
3     bbb 2019-05-29 09:11:05+05:00     1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1