'Match two tables based on a time difference criterium
I have a data table (lv_timest) with time stamps every 3 hours for each date:
# A tibble: 6 × 5
LV0_mean LV1_mean LV2_mean Date_time Date
<dbl> <dbl> <dbl> <S3:POSIXct> <date>
1 0.778 -4.12 0.736 2016-12-28 00:00:00 2016-12-28
2 0.376 -0.234 0.388 2016-12-28 03:00:00 2016-12-28
3 0.409 1.46 0.241 2016-12-28 06:00:00 2016-12-28
4 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
5 0.759 2.91 0.735 2016-12-28 12:00:00 2016-12-28
6 0.857 3.00 0.803 2016-12-28 15:00:00 2016-12-28
from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):
# A tibble: 6 × 4
Event_number Date_time Date Date_time_new
<int> <S3: POSIXct> <date> <S3: POSIXct>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6 369 2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00
For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <S3: POSIXct> <date> <S3: POSIXct> <dbl> <dbl> <dbl> <S3: POSIXct> <date>
75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.
require(lubridate)
require(fuzzyjoin)
diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour
match_timest <- fuzzy_inner_join(event_timest, lv_timest,
by = c("Date" = "Date",
"Date_time_new" = "Date_time"),
match_fun = list(`==`, diff_timest))
head(match_timest)
# A tibble: 6 × 9
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <dttm> <date> <dttm> <dbl> <dbl> <dbl> <dttm> <date>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00 1.24 1.83 2.05 2016-12-30 15:00:00 2016-12-30
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00 -0.128 -5.43 2.72 2017-01-07 06:00:00 2017-01-07
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.0751 0.171 2.56 2017-01-09 09:00:00 2017-01-09
5 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
6 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
Would there be another way to do this?
Solution 1:[1]
Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
datetimes_a <- tibble(
id = seq(3),
group = "A",
datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
id = seq(3),
group = "B",
datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)
datetimes_a %>%
# start with cross product of all possible pairs
expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
mutate(diff = abs(datetime - datetime_b)) %>%
# get shortest time difference
group_by(id, id_b) %>%
arrange(diff) %>%
slice(1) %>%
# time diff must be less than 1hr
filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups: id, id_b [3]
#> id group datetime id_b group_b datetime_b diff
#> <int> <chr> <dttm> <int> <chr> <dttm> <drtn>
#> 1 1 A 2016-12-28 00:00:00 1 B 2016-12-28 00:00:10 10 secs
#> 2 2 A 2016-12-28 03:00:00 2 B 2016-12-28 03:20:00 1200 secs
#> 3 3 A 2016-12-28 23:59:59 3 B 2016-12-29 00:00:02 3 secs
Created on 2022-02-08 by the reprex package (v2.0.1)
This works also if the nearest timepoint is on another date e.g. right before and after midnight.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
