'Efficiently calculating time interval overlaps with data.table
I have a large data.table with ~1.5 million rows of start and end POSIXct times. For each of these rows I want to calculate the percentage of the start-end interval that overalps with each start-end in the rest of the data, and save a subset of the high-overlap rows to a separate table. This is straightforward to do with a for loop, however my current approach takes a very long time to run. In the code below, looping through just the first 1,000 rows takes ~20 seconds on my desktop; it seems the main subseting (i.e., temp <- dt[end_time > focal$start_time & start_time < focal$end_time]) is the bulk of the processing. Is there another approach that would be faster?
library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
"id" = 1:n,
"start_time" = start_times,
"end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]
start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
focal <- dt[i]
temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]
sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
choice_list <- rbindlist(list(choice_list,
list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start
Solution 1:[1]
You could use foverlaps:
setkey(dt,start_time,end_time)
foverlaps(dt,dt)[,window_overlap_pct:=as.numeric(difftime(pmin(end_time,i.end_time),pmax(start_time,i.start_time),units='hours'))/
as.numeric(difftime(end_time,start_time,units='hours'))][
window_overlap_pct>.8&window_overlap_pct<1][
,unique(.SD[sample(.N,2,replace=T)]),by=.(start_time,end_time)]
start_time end_time id i.id i.start_time i.end_time window_overlap_pct
<POSc> <POSc> <int> <int> <POSc> <POSc> <num>
1: 2000-01-04 04:07:07 2000-01-07 23:40:57 444 21 2000-01-01 18:11:32 2000-01-07 18:07:19 0.9392702
2: 2000-01-04 04:07:07 2000-01-07 23:40:57 444 9856 2000-01-01 19:17:25 2000-01-07 11:32:50 0.8674663
3: 2000-01-01 18:11:32 2000-01-07 18:07:19 21 9856 2000-01-01 19:17:25 2000-01-07 11:32:50 0.9466909
4: 2000-01-01 18:11:32 2000-01-07 18:07:19 21 8668 2000-01-02 13:23:45 2000-01-13 14:45:43 0.8665766
5: 2000-01-05 05:01:53 2000-01-07 14:43:53 3956 939 2000-01-05 16:25:01 2000-01-08 08:51:51 0.8026778
---
13706: 2009-12-31 22:48:04 2010-01-09 15:03:55 8013 2751 2009-12-31 12:21:27 2010-01-08 22:16:44 0.9193986
13707: 2009-12-31 22:48:04 2010-01-09 15:03:55 8013 7056 2009-12-31 23:26:28 2010-01-10 20:01:14 0.9969279
13708: 2009-12-28 05:46:18 2010-01-08 19:13:54 4027 5221 2009-12-30 12:17:44 2010-01-09 05:40:29 0.8034898
13709: 2009-12-31 23:26:28 2010-01-10 20:01:14 7056 5221 2009-12-30 12:17:44 2010-01-09 05:40:29 0.8379163
13710: 2009-12-31 23:26:28 2010-01-10 20:01:14 7056 2751 2009-12-31 12:21:27 2010-01-08 22:16:44 0.8066545
Worked for 100 000 rows in 3 minutes, wasn't able to scale to 1 million rows due to memory allocation error.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
