'Match two tables based on a time difference criterium

I have a data table (lv_timest) with time stamps every 3 hours for each date:

# A tibble: 6 × 5
     LV0_mean LV1_mean  LV2_mean Date_time           Date      
     <dbl>    <dbl>     <dbl>    <S3:POSIXct>        <date>    
1    0.778    -4.12     0.736    2016-12-28 00:00:00 2016-12-28
2    0.376    -0.234    0.388    2016-12-28 03:00:00 2016-12-28
3    0.409    1.46      0.241    2016-12-28 06:00:00 2016-12-28
4    0.760    2.07      0.460    2016-12-28 09:00:00 2016-12-28
5    0.759    2.91      0.735    2016-12-28 12:00:00 2016-12-28
6    0.857    3.00      0.803    2016-12-28 15:00:00 2016-12-28

from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):

# A tibble: 6 × 4
   Event_number Date_time           Date       Date_time_new
   <int>        <S3: POSIXct>       <date>     <S3: POSIXct>
1  75           2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2  123          2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3  264          2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4  317          2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5  318          2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6  369          2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00

For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:

Event_number Date_time.x          Date.x      Date_time_new      LV0_mean LV1_mean   LV2_mean Date_time.y          Date.y
<int>        <S3: POSIXct>        <date>      <S3: POSIXct>      <dbl>    <dbl>      <dbl>    <S3: POSIXct>        <date>                         
75           2016-12-28 08:00:00  2016-12-28 2016-12-28 08:00:00 0.760    2.07       0.460    2016-12-28 09:00:00  2016-12-28

In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.

require(lubridate)
require(fuzzyjoin)

diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour

match_timest <- fuzzy_inner_join(event_timest, lv_timest,
                                 by = c("Date" = "Date",
                                        "Date_time_new" = "Date_time"),
                                 match_fun = list(`==`, diff_timest))
head(match_timest)

# A tibble: 6 × 9
  Event_number Date_time.x         Date.x     Date_time_new       LV0_mean LV1_mean LV2_mean Date_time.y         Date.y    
         <int> <dttm>              <date>     <dttm>                 <dbl>    <dbl>    <dbl> <dttm>              <date>    
1           75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00   0.760     2.07     0.460 2016-12-28 09:00:00 2016-12-28
2          123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00   1.24      1.83     2.05  2016-12-30 15:00:00 2016-12-30
3          264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00  -0.128    -5.43     2.72  2017-01-07 06:00:00 2017-01-07
4          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.0751    0.171    2.56  2017-01-09 09:00:00 2017-01-09
5          317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09
6          318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00  -0.204    -0.797    2.28  2017-01-09 12:00:00 2017-01-09

Would there be another way to do this?



Solution 1:[1]

Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

datetimes_a <- tibble(
  id = seq(3),
  group = "A",
  datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
  id = seq(3),
  group = "B",
  datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)

datetimes_a %>%
  # start with cross product of all possible pairs
  expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
  mutate(diff = abs(datetime - datetime_b)) %>%
  # get shortest time difference
  group_by(id, id_b) %>%
  arrange(diff) %>%
  slice(1) %>%
  # time diff must be less than 1hr
  filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups:   id, id_b [3]
#>      id group datetime             id_b group_b datetime_b          diff     
#>   <int> <chr> <dttm>              <int> <chr>   <dttm>              <drtn>   
#> 1     1 A     2016-12-28 00:00:00     1 B       2016-12-28 00:00:10   10 secs
#> 2     2 A     2016-12-28 03:00:00     2 B       2016-12-28 03:20:00 1200 secs
#> 3     3 A     2016-12-28 23:59:59     3 B       2016-12-29 00:00:02    3 secs

Created on 2022-02-08 by the reprex package (v2.0.1)

This works also if the nearest timepoint is on another date e.g. right before and after midnight.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1