'Efficiently calculating time interval overlaps with data.table

I have a large data.table with ~1.5 million rows of start and end POSIXct times. For each of these rows I want to calculate the percentage of the start-end interval that overalps with each start-end in the rest of the data, and save a subset of the high-overlap rows to a separate table. This is straightforward to do with a for loop, however my current approach takes a very long time to run. In the code below, looping through just the first 1,000 rows takes ~20 seconds on my desktop; it seems the main subseting (i.e., temp <- dt[end_time > focal$start_time & start_time < focal$end_time]) is the bulk of the processing. Is there another approach that would be faster?

library(data.table)
set.seed(123)
n <- 1500000
start_times <- as.POSIXct(runif(n, as.POSIXct("2000-01-01"), as.POSIXct("2010-01-01")), origin = "1970-01-01")
end_times <- start_times + as.double(runif(n)*1000000)
dt <- data.table(
  "id" = 1:n,
  "start_time" = start_times,
  "end_time" = end_times)
dt[, window_length_hours := as.double(end_time - start_time, units = "hours")]

start <- Sys.time()
choice_list <- data.table("focal_id" = character(), option_id = character(), overlap = double())
for(i in 1:1000){
  focal <- dt[i]
  temp <- dt[end_time > focal$start_time & start_time < focal$end_time]
  temp[, window_overlap_pct := (focal$window_length_hours - pmax(0, as.double(focal$end_time - end_time, units="hours")) - pmax(0, as.double(start_time - focal$start_time, units="hours")))/focal$window_length_hours]

  sample <- unique(temp[window_overlap_pct > .80][sample(.N, 2, replace = T)]) # save two rows that have high overlap
  
  choice_list <- rbindlist(list(choice_list,
                                 list("focal_id" = focal$id, "option_id" = sample$id, "overlap" = sample$window_overlap_pct),
                                 list("focal_id" = focal$id, "option_id" = focal$id, "overlap" = NA)))
}
end <- Sys.time()
end-start

r data.table posixct

Solution 1:^[1]

You could use foverlaps:

setkey(dt,start_time,end_time)
foverlaps(dt,dt)[,window_overlap_pct:=as.numeric(difftime(pmin(end_time,i.end_time),pmax(start_time,i.start_time),units='hours'))/
                                      as.numeric(difftime(end_time,start_time,units='hours'))][
                 window_overlap_pct>.8&window_overlap_pct<1][
                 ,unique(.SD[sample(.N,2,replace=T)]),by=.(start_time,end_time)]

                start_time            end_time    id  i.id        i.start_time          i.end_time window_overlap_pct
                    <POSc>              <POSc> <int> <int>              <POSc>              <POSc>              <num>
    1: 2000-01-04 04:07:07 2000-01-07 23:40:57   444    21 2000-01-01 18:11:32 2000-01-07 18:07:19          0.9392702
    2: 2000-01-04 04:07:07 2000-01-07 23:40:57   444  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.8674663
    3: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  9856 2000-01-01 19:17:25 2000-01-07 11:32:50          0.9466909
    4: 2000-01-01 18:11:32 2000-01-07 18:07:19    21  8668 2000-01-02 13:23:45 2000-01-13 14:45:43          0.8665766
    5: 2000-01-05 05:01:53 2000-01-07 14:43:53  3956   939 2000-01-05 16:25:01 2000-01-08 08:51:51          0.8026778
   ---                                                                                                               
13706: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.9193986
13707: 2009-12-31 22:48:04 2010-01-09 15:03:55  8013  7056 2009-12-31 23:26:28 2010-01-10 20:01:14          0.9969279
13708: 2009-12-28 05:46:18 2010-01-08 19:13:54  4027  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8034898
13709: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  5221 2009-12-30 12:17:44 2010-01-09 05:40:29          0.8379163
13710: 2009-12-31 23:26:28 2010-01-10 20:01:14  7056  2751 2009-12-31 12:21:27 2010-01-08 22:16:44          0.8066545

Worked for 100 000 rows in 3 minutes, wasn't able to scale to 1 million rows due to memory allocation error.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Efficiently calculating time interval overlaps with data.table

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]