'Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.

I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.

The output dataset should only have 7 observations.

library(data.table)
library(dplyr)

income_range <- data.frame(id = "France"
                     ,inc_lower  = c(10, 21, 31, 41,51,61,71)
                     ,inc_high    = c(20, 30, 40, 50,60,70,80)
                     ,perct       = c(1,2,3,4,5,6,7))

data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
                   ,income = sample(10:80, 50)
                   ,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))

setDT(income_range)
setDT(data_occ)

First attempt.

df2 <- income_range [data_occ , 
            on = .(id, inc_lower <= income, inc_high >= income),
            .(id, income, inc_lower,inc_high,perct,occ)]

Thank you in advance.



Solution 1:[1]

Since you tagged dplyr, here's one possible solution using that library:

library('fuzzyjoin')

# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
          fuzzy_left_join(data_occ,
                          by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
                          match_fun = list(`==`, `<=`, `>=`)) %>%
          rename(id = id.x) %>%
          select(-id.y)

# sort by income, and keep only the first row of every unique perct
result <- joined %>%
    arrange(income) %>%
    group_by(perct) %>%
    slice(1) 

And the (intermediate) results:

> head(joined)
      id inc_lower inc_high perct income     occ
1 France        10       20     1     10 manager
2 France        10       20     1     19 manager
3 France        10       20     1     14 manager
4 France        10       20     1     11 manager
5 France        10       20     1     17 manager
6 France        10       20     1     12 manager

> result    
# A tibble: 7 x 6
# Groups:   perct [7]
  id     inc_lower inc_high perct income occ    
  <chr>      <dbl>    <dbl> <dbl>  <int> <chr>  
1 France        10       20     1     10 manager
2 France        21       30     2     21 manual 
3 France        31       40     3     31 manual 
4 France        41       50     4     43 manager
5 France        51       60     5     51 clerk  
6 France        61       70     6     61 manager
7 France        71       80     7     71 manager

I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.

Solution 2:[2]

Here is one data.table approach:

cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]

result = data_occ[order(income)
                  ][income_range, 
                    on = .(id, inc_lower>=inc_lower, inc_high<=inc_high), 
                    mult="first"]

data_occ[, (cols) := NULL]


#        id income     occ inc_lower inc_high perct
# 1: France     10   clerk        10       20     1
# 2: France     21 manager        21       30     2
# 3: France     31   clerk        31       40     3
# 4: France     41   clerk        41       50     4
# 5: France     51   clerk        51       60     5
# 6: France     62 manager        61       70     6
# 7: France     71 manager        71       80     7

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DharmanBot
Solution 2 B. Christian Kamgang