'Merging dataframes by column within some numerical threshold R
I need to merge two dataframes by two columns (participant and time) where time matches may be near, not exact (plus/minus 0.001). Below are two sample dataframes where the first three times are near matches and the others exact. Fuzzy_join in general doesn't work because these are large dataframes...
df1 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2", "2", "2"),
"item" = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l", "m", "n", "o", "p"),
"time" = c("43.565", "54.125", "65.923", "73.858", "111.123", "143.124", "255.500", "255.502",
"300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))
df2 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1",
"2", "2", "2", "2", "2", "2", "2", "2"),
"value" = c("xyz", "hlm", "test", "nop", "test", "nop", "hlm", "test",
"hlm", "test", "xyz", "xyz", "test", "xyz", "nop", "xyz"),
"time" = c("43.566", "54.124", "65.922", "73.858", "111.123", "143.124", "255.500", "255.502",
"300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))
Solution 1:[1]
Why not use fuzzy_join?. Looks pretty fast, can't think of anything faster.
I had to convert the character variable "time" to numeric before the fuzzy_join operation.
df1 <- df1 %>%
readr::type_convert()
df2 <- df2%>%
readr::type_convert()
library(fuzzyjoin)
fuzzy_join(df1, df2, by = "time",
match_fun = ~ abs(.x - .y) < 0.002)
participant.x item time.x participant.y value time.y
1 1 a 43.565 1 xyz 43.566
2 1 b 54.125 1 hlm 54.124
3 1 c 65.923 1 test 65.922
4 1 d 73.858 1 nop 73.858
5 1 e 111.123 1 test 111.123
6 1 f 143.124 1 nop 143.124
7 1 g 255.500 1 hlm 255.500
8 1 h 255.502 1 test 255.502
9 2 i 300.595 2 hlm 300.595
10 2 j 350.252 2 test 350.252
11 2 k 400.600 2 xyz 400.600
12 2 l 511.122 2 xyz 511.122
13 2 m 525.887 2 test 525.887
14 2 n 577.752 2 xyz 577.752
15 2 o 599.129 2 nop 599.129
16 2 p 601.992 2 xyz 601.992
EDIT
The OP asked for a function that matches multiple columns. If we want multiple pairs of columns, we can use e vector of columns to match, and a list of matching functions, as in:
fuzzy_join(df1, df2, by = c("participant", "time"),
match_fun = list(`==`,
\(x,y) abs(x - y) < 0.002)
)
participant.x item time.x participant.y value time.y
1 1 a 43.565 1 xyz 43.566
2 1 b 54.125 1 hlm 54.124
3 1 c 65.923 1 test 65.922
4 1 d 73.858 1 nop 73.858
5 1 e 111.123 1 test 111.123
6 1 f 143.124 1 nop 143.124
7 1 g 255.500 1 hlm 255.500
8 1 h 255.502 1 test 255.502
9 2 i 300.595 2 hlm 300.595
10 2 j 350.252 2 test 350.252
11 2 k 400.600 2 xyz 400.600
12 2 l 511.122 2 xyz 511.122
13 2 m 525.887 2 test 525.887
14 2 n 577.752 2 xyz 577.752
15 2 o 599.129 2 nop 599.129
16 2 p 601.992 2 xyz 601.992
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
