'merge dataframes based on multiple columns and thresholds
I have two data.frames with multiple common columns (here: date, city, ctry, and (other_)number).
I would now like to merge them on the above columns but tolerate some level of difference:
threshold.numbers <- 3
threshold.date <- 5 # in days
If the difference between the date entries is > threshold.date (in days) or > threshold.numbers, I don't want the lines to be merged.
Similarly, if the entry in city is a substring of the other df's entry in the city column, I want the lines to be merged. [If anyone has a better idea to test for actual city names' similarity, I'd be happy to hear about it.] (And keep the first df's entries of date, city and country but both (other_)number columns and all other columns in the df.
Consider the following example:
df1 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17",
"1999-06-30", "1999-03-16", "1999-07-16", "2001-08-29", "2002-07-30"),
city = c("Berlin", "Paris", "London", "Rome", "Bern",
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"),
ctry = c("Germany", "France", "UK", "Italy", "Switzerland",
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
number = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
col = c("apple", "banana", "pear", "banana", "lemon", "cucumber", "apple", "peach", "cherry", "cherry"))
df2 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17", # all identical to df1
"1999-06-29", "1999-03-14", "1999-07-17", # all 1-2 days different
"2000-01-29", "2002-07-01"), # all very different (> 2 weeks)
city = c("Berlin", "East-Paris", "near London", "Rome", # same or slight differences
"Zurich", # completely different
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"), # same
ctry = c("Germany", "France", "UK", "Italy", "Switzerland", # all the same
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
other_number = c(13, 17, 3100, 45, 51, 61, 780, 85, 90, 101), # slightly different to very different
other_col = c("yellow", "green", "blue", "red", "purple", "orange", "blue", "red", "black", "beige"))
Now, I would like to merge the data.frames and receive a df where lines are merged if the above conditions are met.
(The first column is only for your convenience: behind the first digit, which indicates the original case, it shows whether the lines where merged (.) or whether the lines is from df1 (1) or df2 (2).
date city ctry number other_col other_number other_col2 #comment
1. 2003-08-29 Berlin Germany 10 apple 13 yellow # matched on date, city, number
2. 1999-06-12 Paris France 20 banana 17 green # matched on date, city similar, number - other_number == threshold.numbers
31 2000-08-29 London UK 30 pear <NA> <NA> # not matched: number - other_number > threshold.numbers
32 2000-08-29 near London UK <NA> <NA> 3100 blue #
41 1999-02-24 Rome Italy 40 banana <NA> <NA> # not matched: number - other_number > threshold.numbers
42 1999-02-24 Rome Italy <NA> <NA> 45 red #
51 2001-04-17 Bern Switzerland 50 lemon <NA> <NA> # not matched: cities different (dates okay, numbers okay)
52 2001-04-17 Zurich Switzerland <NA> <NA> 51 purple #
6. 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange # matched: date difference < threshold.date (cities okay, dates okay)
71 1999-03-16 Warsaw Poland 70 apple <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
72 1999-03-14 Warsaw Poland <NA> <NA> 780 blue #
81 1999-07-16 Moscow Russia 80 peach <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
82 1999-07-17 Moscow Russia <NA> <NA> 85 red #
91 2001-08-29 Tunis Tunisia 90 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
92 2000-01-29 Tunis Tunisia <NA> <NA> 90 black #
101 2002-07-30 Vienna Austria 100 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
102 2002-07-01 Vienna Austria <NA> <NA> 101 beige #
I tried different implementations of merging them but cannot get the threshold implemented.
EDIT Apologies for unclear formulation - I would like to retain all rows and receive an indicator whether the row is matched, unmatched and from df1 or unmatched and from df2.
the pseudo-code is:
if there is a case where abs("date_df2" - "date_df1") <= threshold.date:
if "ctry_df2" == "ctry_df1":
if "city_df2" ~ "city_df1":
if abs("number_df2" - "number_df1") <= threshold.numbers:
merge and go to next row in df2
else:
add row to df1```
Solution 1:[1]
I first turned the city names into character vectors, since (if I understood correctly) you want to include city names that are contained within df2.
df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)
Then merge them by country:
df = merge(df1, df2, by = ("ctry"))
> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue
The library stringr will allow you to see if city.x is within city.y here (see last column):
library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE
Then you can get the difference in days between dates:
df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)
and the difference in numbers:
df$numDiff<-abs(df$number - df$other_number)
Here was what the resulting dataframe looks like:
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070
But we want to drop things where city.x was not found within city.y, where the day difference is greater than 5 or the number difference is greater than 3:
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
What is left are the three rows that you had above (which contained dots in column 1).
Now we can drop the three columns we created, and the date and city from df2:
> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow
Solution 2:[2]
Step 1: Merge the data based on "city" and "ctry":
df = merge(df1, df2, by = c("city", "ctry"))
Step 2: Remove rows if the difference between the date entries is > threshold.date (in days):
date_diff = abs(as.numeric(difftime(strptime(df$date.x, format = "%Y-%m-%d"),
strptime(df$date.y, format = "%Y-%m-%d"), units="days")))
index_remove = date_diff > threshold.date
df = df[-index_remove,]
Step 3: Remove rows if the difference between the numbers is > threshhold.number:
number_diff = abs(df$number - df$other_number)
index_remove = number_diff > threshold.numbers
df = df[-index_remove,]
The data should be merged before applying conditions, in case the rows are not matching.
Solution 3:[3]
An option using data.table (explanations inline):
library(data.table)
setDT(df1)
setDT(df2)
#dupe columns and create ranges for non-equi joins
df1[, c("n", "ln", "un", "d", "ld", "ud") := .(
number, number - threshold.numbers, number + threshold.numbers,
date, date - threshold.date, date + threshold.date)]
df2[, c("n", "ln", "un", "d", "ld", "ud") := .(
other_number, other_number - threshold.numbers, other_number + threshold.numbers,
date, date - threshold.date, date + threshold.date)]
#perform non-equi join using ctry, num, dates in both ways
res <- rbindlist(list(
df1[df2, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
.(date1=x.date, date2=i.date, city1=x.city, city2=i.city, ctry1=x.ctry, ctry2=i.ctry, number, col, other_number, other_col)],
df2[df1, on=.(ctry, n>=ln, n<=un, d>=ld, d<=ud),
.(date1=i.date, date2=x.date, city1=i.city, city2=x.city, ctry1=i.ctry, ctry2=x.ctry, number, col, other_number, other_col)]),
use.names=TRUE, fill=TRUE)
#determine if cities are substrings of one and another
res[, city_match := {
i <- mapply(grepl, city1, city2) | mapply(grepl, city2, city1)
replace(i, is.na(i), TRUE)
}]
#just like SQL coalesce (there is a version in dev in rdatatable github)
coalesce <- function(...) Reduce(function(x, y) fifelse(!is.na(y), y, x), list(...))
#for rows that are matching or no matches to be found
ans1 <- unique(res[(city_match), .(date=coalesce(date1, date2),
city=coalesce(city1, city2),
ctry=coalesce(ctry1, ctry2),
number, col, other_number, other_col)])
#for rows that are close in terms of dates and numbers but are diff cities
ans2 <- res[(!city_match), .(date=c(.BY$date1, .BY$date2),
city=c(.BY$city1, .BY$city2),
ctry=c(.BY$ctry1, .BY$ctry2),
number=c(.BY$number, NA),
col=c(.BY$col, NA),
other_number=c(NA, .BY$other_number),
other_col=c(NA, .BY$other_col)),
names(res)][, seq_along(names(res)) := NULL]
#final desired output
setorder(rbindlist(list(ans1, ans2)), date, city, number, na.last=TRUE)[]
output:
date city ctry number col other_number other_col
1: 1999-02-24 Rome Italy 40 banana NA <NA>
2: 1999-02-24 Rome Italy NA <NA> 45 red
3: 1999-03-14 Warsaw Poland NA <NA> 780 blue
4: 1999-03-16 Warsaw Poland 70 apple NA <NA>
5: 1999-06-12 East-Paris France 20 banana 17 green
6: 1999-06-29 Copenhagen Denmark 60 cucumber 61 orange
7: 1999-07-16 Moscow Russia 80 peach NA <NA>
8: 1999-07-17 Moscow Russia NA <NA> 85 red
9: 2000-01-29 Tunis Tunisia NA <NA> 90 black
10: 2000-08-29 London UK 30 pear NA <NA>
11: 2000-08-29 near London UK NA <NA> 3100 blue
12: 2001-04-17 Bern Switzerland 50 lemon NA <NA>
13: 2001-04-17 Zurich Switzerland NA <NA> 51 purple
14: 2001-08-29 Tunis Tunisia 90 cherry NA <NA>
15: 2002-07-01 Vienna Austria NA <NA> 101 beige
16: 2002-07-30 Vienna Austria 100 cherry NA <NA>
17: 2003-08-29 Berlin Germany 10 apple 13 yellow
Solution 4:[4]
You can test the city match with grepl and ctry simple with ==. For those who match until here you can calculate the date difference by converting to date using as.Date and comparing it to a difftime. The number difference is done the same way.
i1 <- seq_len(nrow(df1)) #Store all rows
i2 <- seq_len(nrow(df2))
res <- do.call(rbind, sapply(seq_len(nrow(df1)), function(i) { #Loop over all rows in df1
t1 <- which(df1$ctry[i] == df2$ctry) #Match ctry
t2 <- grepl(df1$city[i], df2$city[t1]) | sapply(df2$city[t1], grepl, df1$city[i]) #Match city
t1 <- t1[t2 & abs(as.Date(df1$date[i]) - as.Date(df2$date[t1[t2]])) <=
as.difftime(threshold.date, units = "days") & #Test for date difference
abs(df1$number[i] - df2$other_number[t1[t2]]) <= threshold.numbers] #Test for number difference
if(length(t1) > 0) { #Match found
i1 <<- i1[i1!=i] #Remove row as it was found
i2 <<- i2[i2!=t1]
cbind(df1[i,], df2[t1,c("other_number","other_col")], match=".")
}
}))
rbind(res
, cbind(df1[i1,], other_number=NA, other_col=NA, match="1")
, cbind(df2[i2,1:3], number=NA, col=NA, other_number=df2[i2,4]
, other_col=df2[i2,5], match="2"))
# date city ctry number col other_number other_col match
#1 2003-08-29 Berlin Germany 10 apple 13 yellow .
#2 1999-06-12 Paris France 20 banana 17 green .
#6 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange .
#3 2000-08-29 London UK 30 pear NA <NA> 1
#4 1999-02-24 Rome Italy 40 banana NA <NA> 1
#5 2001-04-17 Bern Switzerland 50 lemon NA <NA> 1
#7 1999-03-16 Warsaw Poland 70 apple NA <NA> 1
#8 1999-07-16 Moscow Russia 80 peach NA <NA> 1
#9 2001-08-29 Tunis Tunisia 90 cherry NA <NA> 1
#10 2002-07-30 Vienna Austria 100 cherry NA <NA> 1
#31 2000-08-29 near London UK NA <NA> 3100 blue 2
#41 1999-02-24 Rome Italy NA <NA> 45 red 2
#51 2001-04-17 Zurich Switzerland NA <NA> 51 purple 2
#71 1999-03-14 Warsaw Poland NA <NA> 780 blue 2
#81 1999-07-17 Moscow Russia NA <NA> 85 red 2
#91 2000-01-29 Tunis Tunisia NA <NA> 90 black 2
#101 2002-07-01 Vienna Austria NA <NA> 101 beige 2
Solution 5:[5]
We can use {powerjoin} :
library(powerjoin)
power_full_join(
df1,
df2,
by = ~
# join if one city name contains the other
(mapply(grepl, .x$city, .y$city) | mapply(grepl, .y$city, .x$city)) &
# and dates are close enough
abs(difftime(.x$date, .y$date, units = "days")) <= threshold.date &
# and numbers are close enough
abs(.x$number - .y$other_number) <= threshold.numbers,
conflict = dplyr::coalesce)
#> number col other_number other_col date city ctry
#> 1 10 apple 13 yellow 2003-08-29 Berlin Germany
#> 2 20 banana 17 green 1999-06-12 Paris France
#> 3 60 cucumber 61 orange 1999-06-30 Copenhagen Denmark
#> 4 30 pear NA <NA> 2000-08-29 London UK
#> 5 40 banana NA <NA> 1999-02-24 Rome Italy
#> 6 50 lemon NA <NA> 2001-04-17 Bern Switzerland
#> 7 70 apple NA <NA> 1999-03-16 Warsaw Poland
#> 8 80 peach NA <NA> 1999-07-16 Moscow Russia
#> 9 90 cherry NA <NA> 2001-08-29 Tunis Tunisia
#> 10 100 cherry NA <NA> 2002-07-30 Vienna Austria
#> 11 NA <NA> 3100 blue 2000-08-29 near London UK
#> 12 NA <NA> 45 red 1999-02-24 Rome Italy
#> 13 NA <NA> 51 purple 2001-04-17 Zurich Switzerland
#> 14 NA <NA> 780 blue 1999-03-14 Warsaw Poland
#> 15 NA <NA> 85 red 1999-07-17 Moscow Russia
#> 16 NA <NA> 90 black 2000-01-29 Tunis Tunisia
#> 17 NA <NA> 101 beige 2002-07-01 Vienna Austria
Created on 2022-04-14 by the reprex package (v2.0.1)
Solution 6:[6]
Here's a flexible approach that lets you specify any collection of merge criteria you choose.
Prep work
I ensured that all the strings in df1 and df2 were strings, not factors (as noted in several of the other answers). I also wrapped the dates in as.Date to make them real dates.
Specify the merge criteria
Create a list of lists. Each element of the main list is one criterion; the members of a criterion are
final.col.name: the name of the column we want in the final tablecol.name.1: the name of the column indf1col.name.2: the name of the column indf2exact: boolean; should we do exact matching on this column?threshold: threshold (if we aren't doing exact matching)match.function: a function that returns whether or not the rows match (for special cases such as usinggreplfor string matching; note that this function must be vectorized)
merge.criteria = list(
list(final.col.name = "date",
col.name.1 = "date",
col.name.2 = "date",
exact = F,
threshold = 5),
list(final.col.name = "city",
col.name.1 = "city",
col.name.2 = "city",
exact = F,
match.function = function(x, y) {
return(mapply(grepl, x, y) |
mapply(grepl, y, x))
}),
list(final.col.name = "ctry",
col.name.1 = "ctry",
col.name.2 = "ctry",
exact = T),
list(final.col.name = "number",
col.name.1 = "number",
col.name.2 = "other_number",
exact = F,
threshold = 3)
)
Function for merging
This function takes three arguments: the two data frames we want to merge, and the list of match criteria. It proceeds as follows:
- Iterate through the match criteria and determine which row pairs do or don't meet all the criteria. (Inspired by @GKi's answer, it uses row indexes instead of doing a full outer join, which may be less memory-intensive for large datasets.)
- Create a skeleton data frame with just the rows we want (merged rows in the case of matches, unmerged rows for unmatched records).
- Iterate through the columns of the original data frames and use them to populate the desired columns in the new data frame. (Do this first for the columns that appear in the match criteria, and then for any other columns that are left.)
library(dplyr)
merge.data.frames = function(df1, df2, merge.criteria) {
# Create a data frame with all possible pairs of rows from df1 and rows from
# df2.
row.decisions = expand.grid(df1.row = 1:nrow(df1), df2.row = 1:nrow(df2))
# Iterate over the criteria in merge.criteria. For each criterion, flag row
# pairs that don't meet the criterion.
row.decisions$merge = T
for(criterion in merge.criteria) {
# If we're looking for an exact match, test for equality.
if(criterion$exact) {
row.decisions$merge = row.decisions$merge &
df1[row.decisions$df1.row,criterion$col.name.1] == df2[row.decisions$df2.row,criterion$col.name.2]
}
# If we're doing a threshhold test, test for difference.
else if(!is.null(criterion$threshold)) {
row.decisions$merge = row.decisions$merge &
abs(df1[row.decisions$df1.row,criterion$col.name.1] - df2[row.decisions$df2.row,criterion$col.name.2]) <= criterion$threshold
}
# If the user provided a function, use that.
else if(!is.null(criterion$match.function)) {
row.decisions$merge = row.decisions$merge &
criterion$match.function(df1[row.decisions$df1.row,criterion$col.name.1],
df2[row.decisions$df2.row,criterion$col.name.2])
}
}
# Create the new dataframe. Just row numbers of the source dfs to start.
new.df = bind_rows(
# Merged rows.
row.decisions %>% filter(merge) %>% select(-merge),
# Rows from df1 only.
row.decisions %>% group_by(df1.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df1.row),
# Rows from df2 only.
row.decisions %>% group_by(df2.row) %>% summarize(matches = sum(merge)) %>% filter(matches == 0) %>% select(df2.row)
)
# Iterate over the merge criteria and add columns that were used for matching
# (from df1 if available; otherwise from df2).
for(criterion in merge.criteria) {
new.df[criterion$final.col.name] = coalesce(df1[new.df$df1.row,criterion$col.name.1],
df2[new.df$df2.row,criterion$col.name.2])
}
# Now add all the columns from either data frame that weren't used for
# matching.
for(other.col in setdiff(colnames(df1),
sapply(merge.criteria, function(x) x$col.name.1))) {
new.df[other.col] = df1[new.df$df1.row,other.col]
}
for(other.col in setdiff(colnames(df2),
sapply(merge.criteria, function(x) x$col.name.2))) {
new.df[other.col] = df2[new.df$df2.row,other.col]
}
# Return the result.
return(new.df)
}
Apply the function, and we're done
df = merge.data.frames(df1, df2, merge.criteria)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dylan_Gomes |
| Solution 2 | |
| Solution 3 | chinsoon12 |
| Solution 4 | |
| Solution 5 | |
| Solution 6 | A. S. K. |
