'Merging dataframes by column within some numerical threshold R

I need to merge two dataframes by two columns (participant and time) where time matches may be near, not exact (plus/minus 0.001). Below are two sample dataframes where the first three times are near matches and the others exact. Fuzzy_join in general doesn't work because these are large dataframes...

df1 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1", 
                               "2", "2", "2", "2", "2", "2", "2", "2"), 
             "item" = c("a", "b", "c", "d", "e", "f", "g", "h",
                        "i", "j", "k", "l", "m", "n", "o", "p"),
             "time" = c("43.565", "54.125", "65.923", "73.858", "111.123", "143.124", "255.500", "255.502",
                        "300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))

df2 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1", 
                                   "2", "2", "2", "2", "2", "2", "2", "2"), 
                 "value" = c("xyz", "hlm", "test", "nop", "test", "nop", "hlm", "test",
                             "hlm", "test", "xyz", "xyz", "test", "xyz", "nop", "xyz"),
                 "time" = c("43.566", "54.124", "65.922", "73.858", "111.123", "143.124", "255.500", "255.502",
                            "300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))


Solution 1:[1]

Why not use fuzzy_join?. Looks pretty fast, can't think of anything faster. I had to convert the character variable "time" to numeric before the fuzzy_join operation.

df1 <- df1 %>%
    readr::type_convert()
df2 <- df2%>%
    readr::type_convert()

library(fuzzyjoin)

fuzzy_join(df1, df2, by = "time",
           match_fun = ~ abs(.x - .y) < 0.002)

   participant.x item  time.x participant.y value  time.y
1              1    a  43.565             1   xyz  43.566
2              1    b  54.125             1   hlm  54.124
3              1    c  65.923             1  test  65.922
4              1    d  73.858             1   nop  73.858
5              1    e 111.123             1  test 111.123
6              1    f 143.124             1   nop 143.124
7              1    g 255.500             1   hlm 255.500
8              1    h 255.502             1  test 255.502
9              2    i 300.595             2   hlm 300.595
10             2    j 350.252             2  test 350.252
11             2    k 400.600             2   xyz 400.600
12             2    l 511.122             2   xyz 511.122
13             2    m 525.887             2  test 525.887
14             2    n 577.752             2   xyz 577.752
15             2    o 599.129             2   nop 599.129
16             2    p 601.992             2   xyz 601.992

EDIT

The OP asked for a function that matches multiple columns. If we want multiple pairs of columns, we can use e vector of columns to match, and a list of matching functions, as in:

fuzzy_join(df1, df2, by = c("participant", "time"),
           match_fun = list(`==`,
                            \(x,y) abs(x - y) < 0.002)
           )

   participant.x item  time.x participant.y value  time.y
1              1    a  43.565             1   xyz  43.566
2              1    b  54.125             1   hlm  54.124
3              1    c  65.923             1  test  65.922
4              1    d  73.858             1   nop  73.858
5              1    e 111.123             1  test 111.123
6              1    f 143.124             1   nop 143.124
7              1    g 255.500             1   hlm 255.500
8              1    h 255.502             1  test 255.502
9              2    i 300.595             2   hlm 300.595
10             2    j 350.252             2  test 350.252
11             2    k 400.600             2   xyz 400.600
12             2    l 511.122             2   xyz 511.122
13             2    m 525.887             2  test 525.887
14             2    n 577.752             2   xyz 577.752
15             2    o 599.129             2   nop 599.129
16             2    p 601.992             2   xyz 601.992

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1