'Construct non-overlapping datetime record (start, end datetime) dataframe
I need to create a dataframe removing over-lapping start and end datetimes for multiple ids. I will use the start and end datetimes to aggregate values in a high frequency pandas dataframe, so I need to remove those overlapping datetimes in mst_df.
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
Desired DataFrame:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
Solution 1:[1]
I don't know if pandas has special function for this. It has Interval.overlaping() to check if two ranges overlaps (and it works even with datetime) but I don't see function to merge these two ranges so it still needs own code for merging. Fortunately it is easy.
Rows are sorted by start so rows are not overlaping when previous_end < next_start and I use it in for-loop.
But first I group by site to work with every site separatelly.
Next I get first row (as previous) and run loop with other rows (as next) and check previous_end < next_start.
If it is True then I can put previous on list of results and get next as previous to work with rest of rows.
If it is False then I create new range from both rows and use it to work with rest of rows.
Finally I add previous to list.
After processing all groups I convert all to DataFrame.
import pandas as pd
#Proxy reference dataframe
master = [
['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], # slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], # overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], # overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], # overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], # overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00'] # slightly overlapping
]
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
result = []
for val, group in mst_df.groupby('id'):
# get first
prev = group.iloc[0]
for idx, item in group[1:].iterrows():
if prev['end'] < item['start']:
# not overlapping - put previous to results and use next as previous
result.append(prev)
prev = item
else:
# overlappig - create on range start, end
prev['start'] = min(prev['start'], item['start'])
prev['end'] = max(prev['end'], item['end'])
# add when there is no next item
result.append(prev)
print(pd.DataFrame(result))
Result:
id start end
0 site a 2021-07-08 00:00:00 2021-07-08 12:00:00
2 site a 2021-07-08 17:36:00 2021-07-09 11:40:00
5 site b 2021-07-08 00:00:00 2021-07-08 10:24:00
7 site b 2021-07-08 17:32:00 2021-07-09 13:00:00
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
