'Check if a value in a dataframe is conditionally between a range of values specified by two columns of another dataframe
So, I have two dataframes - somewhat large (df1 ~= 20k rows & df2 ~= 1.5 million) - and I want to check if a value in df1 is between df2$low & df2$high, but do it conditionally (to limit the number of checks) and only do the check if abs(df1$val-df2$val) < 2. If the value in df1 is found to be within the df2 range then add into a new column with TRUE/FALSE values.
df1
| weight | low | high |
|---|---|---|
| 94.99610 | 94.99608 | 94.99613 |
| 95.00561 | 95.00558 | 95.00566 |
df2
| index | th_weight |
|---|---|
| 1 | 94.996092 |
| 2 | 95.496336 |
| 3 | 95.509906 |
| 4 | 97.473292 |
| 5 | 100.519060 |
Desired output should be:
df1
| weight | lower | upper | filter |
|---|---|---|---|
| 94.99610 | 94.99608 | 94.99613 | TRUE |
| 95.00561 | 95.00558 | 95.00566 | FALSE |
So in that example the logic is that rows 4 and 5 in df2 would get omitted. Value in row 1 of df2 is within row 1 of df1$low & df1$high, therefore df1[,filter := TRUE] for that weight
I have tried multiple ways, with double looping into the data.table, which, as expected, was the least efficient and I also tried to make a cartesian product of df1 x df2 with dplyr::full_join but having memory issues there. Even splitting into chunks and appending into a file (so I can later easily manipulate the file via UNIX sed or awk) didn't help.
There might be a very easy way to do this and I might be completely off-track, so I apologise in advance.
Solution 1:[1]
Use a non-equi join with data.table - convert the first data to data.table (setDT), create the filter column as logical (FALSE) values. Do a non-equi join, and assign (:=) the filter to TRUE, which changes the FALSE to TRUE only when the condition (abs(weight - th_weight) < 2) meets
library(data.table)
setDT(df1)[, filter := FALSE]
df1[df2, filter := abs(weight - th_weight) < 2,
on = .(low <= th_weight, high >= th_weight)]
-output
> df1
weight low high filter
<num> <num> <num> <lgcl>
1: 94.99610 94.99608 94.99613 TRUE
2: 95.00561 95.00558 95.00566 FALSE
data
df1 <- structure(list(weight = c(94.9961, 95.00561), low = c(94.99608,
95.00558), high = c(94.99613, 95.00566)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(index = 1:5, th_weight = c(94.996092, 95.496336,
95.509906, 97.473292, 100.51906)), class = "data.frame", row.names = c(NA,
-5L))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | akrun |
