'How to check if values in first dataframe are contained or match values in another dataframe
I am using R
to work with some dataframes. My issue is related on how to check if values in a variable in a first dataframe match with values in another dataframe. The match is very different to those like merge
or join
. I will introduce my dataframes (dput()
at end):
My first dataframe is df1
. It contains the variable name
which I want to contrast with other variable in a second dataframe. It looks like this:
df1
name
1 JUAN GIRON
2 GINA OLEAS
3 JUAN FERNANDO ELIZAGA
4 MARCO TORRES
5 JUAN PABLO GONZALEZ
6 IRMA GOMEZ
The second dataframe is df2
. It also contains a variable name
which will be used to the contrast with name
from df1
. It looks like this (In a real situation df2
can be very large with more than 1000 rows):
df2
name val
1 JUANA MARQUEZ 1
2 FERNANDO ELIZAGA 2
3 IRMA GOMEZ 3
4 PABLO GONZALEZ 4
5 GINA LUCIO 5
6 MARK TORRES 6
7 LETICIA BLACIO 7
8 JUAN PABLO GIRON BELTRAN 8
I am looking for a way to check if every row of df1
for name
variable is contained or match with any value for name
in df2
. For example, the value JUAN GIRON
after checking with name
from df2
should return, give a value of yes
because it is contained in the string JUAN PABLO GIRON BELTRAN
from df2
. The same case would apply for the other values. In the end I would like to have something like this:
df3
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
How can I reach that result? I have tried with grepl()
concatenating the strings using |
but it is not working because some values are returning a yes
match when there is not match.
Also, as data can be large, I would like to have a solution with dplyr
because the comparison is by row so it can be slow. Or any fast solution is welcome. Many thanks!
Data is next:
#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA",
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA,
-6L), class = "data.frame")
#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA",
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES",
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA,
-8L), class = "data.frame")
Solution 1:[1]
Perhaps we can do like this
df1 %>%
mutate(val = c("no", "yes")[1 + (rowSums(
outer(
strsplit(name, "\\s+"),
strsplit(df2$name, "\\s+"),
Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
)
) > 0)])
which gives
name val
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
Solution 2:[2]
The answer by ThomasIsCoding is great. But using outer()
is quite memory consuming and not parallelizable. The following solution uses nested map()
's. Also, the furrr
package is used to parallelize the outer map()
.
A benchmark with a much larger df2
shows that both nesting and parallelization yield a substantial speed-up for about double the speed in total.
Data and Packages
library(tidyverse)
library(furrr)
#df1
df1 <- structure(list(name = c("JUAN GIRON", "GINA OLEAS", "JUAN FERNANDO ELIZAGA",
"MARCO TORRES", "JUAN PABLO GONZALEZ", "IRMA GOMEZ")), row.names = c(NA,
-6L), class = "data.frame")
#df2
df2 <- structure(list(name = c("JUANA MARQUEZ", "FERNANDO ELIZAGA",
"IRMA GOMEZ", "PABLO GONZALEZ", "GINA LUCIO", "MARK TORRES",
"LETICIA BLACIO", "JUAN PABLO GIRON BELTRAN"), val = 1:8), row.names = c(NA,
-8L), class = "data.frame")
Demo on small data set
plan(multisession, workers = 8) # 8 for my quad-core with hyperthreading
n2 <- df2$name |>
str_split("\\s+")
df1 |>
mutate(val = name |>
str_split("\\s+") |>
future_map_lgl(\(n1e) map_lgl(n2,
\(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
) |> any()
) |>
factor(labels = c("no", "yes"))
)
#> name val
#> 1 JUAN GIRON yes
#> 2 GINA OLEAS no
#> 3 JUAN FERNANDO ELIZAGA yes
#> 4 MARCO TORRES no
#> 5 JUAN PABLO GONZALEZ yes
#> 6 IRMA GOMEZ yes
Benchmark Code
# Make df2 much larger
df2xl <- df2 |>
list() |>
rep(10000) |>
bind_rows()
bench::mark(
Thomas = df1 %>%
mutate(val = c("no", "yes")[1 + (rowSums(
outer(
strsplit(name, "\\s+"),
strsplit(df2xl$name, "\\s+"),
Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
)
) > 0)]),
nested_map = {
n2 <- df2xl$name |>
str_split("\\s+")
df1 |>
mutate(val = name |>
str_split("\\s+") |>
map_lgl(\(n1e) map_lgl(n2,
\(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
) |> any()
) |>
factor(labels = c("no", "yes"))
)
},
parallel_nested_map = {
n2 <- df2xl$name |>
str_split("\\s+")
df1 |>
mutate(val = name |>
str_split("\\s+") |>
future_map_lgl(\(n1e) map_lgl(n2,
\(n2e) all(n1e %in% n2e) | all(n2e %in% n1e)
) |> any()
) |>
factor(labels = c("no", "yes"))
)
},
check = F,
min_iterations = 10,
filter_gc = F
)
Benchmark Result
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 Thomas 2.48s 2.53s 0.396 21.51MB 8.31
#> 2 nested_map 1.72s 1.73s 0.563 2.46MB 8.28
#> 3 parallel_nested_map 1.07s 1.22s 0.827 2.86MB 2.56
Created on 2022-04-12 by the reprex package (v2.0.1)
Solution 3:[3]
Here's an approach that uses a regex pattern and handles names that are either length 2 or 3. There's room for improvement, and I'd love to read other answers to this question.
# Input
a <- strsplit(df2$name, " ")
# Output
b <- c()
# Define regex pattern
for(i in 1:length(a)){
if(length(a[[i]]) == 3){
temp <- paste0(
a[[i]][1], " ", a[[i]][2], "|",
a[[i]][1], " ", a[[i]][3], "|",
a[[i]][2], " ", a[[i]][3])
} else if(length(a[[i]] == 2)){
temp <- paste(a[[i]], collapse = " ")
} else {
stop("Length of split name was not 2 or 3")
}
b <- c(b, temp)
}
df1$val <- grepl(paste(b, collapse = "|"), df1$name)
Alternatively, after defining b
using the loop above:
library(dplyr)
patt <- paste(b, collapse = "|")
df1 %>%
mutate(val = grepl(patt, name))
Result:
> df1
name val
1 JUAN GIRON TRUE
2 GINA OLEAS FALSE
3 JUAN FERNANDO ELIZAGA TRUE
4 MARCO TORRES FALSE
5 JUAN PABLO GONZALEZ TRUE
6 IRMA GOMEZ TRUE
Solution 4:[4]
In case the order of the names does not change .*
could be inserted between the names and then use grep
in both directions (df1$names in df2$names and df2$names in df1$names) and combine them with or |
.
transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
\(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))])
# name val
#1 JUAN GIRON yes
#2 GINA OLEAS no
#3 JUAN FERNANDO ELIZAGA yes
#4 MARCO TORRES no
#5 JUAN PABLO GONZALEZ yes
#6 IRMA GOMEZ yes
In case no partly match of the names are allowed surround the names with \b
.
transform(df1, val = c("no", "yes")[1+(sapply(
gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
\(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name),
gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))])
In case the order can chaange make a positive look ahead by placing the name in (?=.*NAMME)
or also surround the name with \b
(?=.*\\bNAME\\b)
.
transform(df1, val = c("no", "yes")[1+(sapply(
gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
\(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] )
Its also possible to use agrepl
and allow deletions which will be similar to the version assuming that the order of the names does not change and part matches of the name are allowed.
transform(df1, val = c("no", "yes")[1+(
sapply(df1$name, \(x) any(agrepl(x, df2$name,
list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
deletions=99, substitutions=0)), df2$name, FALSE))])
Another option can be the usage of look up tables:
s1 <- strsplit(df1$name, " ", TRUE)
lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
s2 <- strsplit(df2$name, " ", TRUE)
lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
`[<-`(sapply(s1, \(x) any(Reduce(intersect, mget(x, lup2, ifnotfound =
list(NULL))))), unlist(lapply(s2, \(x) Reduce(intersect, mget(x, lup1,
ifnotfound = list(NULL))))), TRUE)
#[1] TRUE FALSE TRUE FALSE TRUE TRUE
Benchmark:
Its also possible to limit the comparisons only to those which didn't have a match (GKi1b) where maybe the usage of indices using which
instead of using the logical vector twice could fuhrer improve and making an exit of the loop in case all have a hit. In case the names are not unique use unique
on the names.
library(dplyr)
bench::mark(
Thomas = df1 %>%
mutate(val = c("no", "yes")[1 + (rowSums(
outer(
strsplit(name, "\\s+"),
strsplit(df2$name, "\\s+"),
Vectorize(function(x, y) all(x %in% y) | all(y %in% x))
)
) > 0)]),
GKi1 = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
\(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name), gsub(" +", ".*", df2$name), FALSE))]),
GKi1b = transform(df1, val = c("no", "yes")[1 +
Reduce(\(i, x) `[<-`(i, !i, grepl(x, df1$name[!i])), gsub(" +", ".*",
df2$name), sapply(gsub(" +", ".*", df1$name), \(x) any(grep(x, df2$name)),
USE.NAMES = FALSE)) ]),
GKi1c = transform(df1, val = c("no", "yes")[1+(sapply(gsub(" +", ".*", df1$name),
\(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
grepl(paste(gsub(" +", ".*", df2$name), collapse = "|"), df1$name) )]),
GKi2 = transform(df1, val = c("no", "yes")[1+(sapply(
gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df1$name),
\(x) any(grep(x, df2$name)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name),
gsub(" *(\\b[^ ]+\\b)", ".*\\\\b\\1\\\\b", df2$name), FALSE))]),
GKi3 = transform(df1, val = c("no", "yes")[1+(sapply(
gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df1$name),
\(x) any(grep(x, df2$name, perl=TRUE)), USE.NAMES = FALSE) |
Reduce(\(y, x) y | grepl(x, df1$name, perl=TRUE),
gsub(" *(\\b[^ ]+\\b)", "(?=.*\\\\b\\1\\\\b)", df2$name), FALSE))] ),
GKi4 = transform(df1, val = c("no", "yes")[1+(
sapply(df1$name, \(x) any(agrepl(x, df2$name,
list(cost=99, insertions=0, deletions=99, substitutions=0)))) |
Reduce(\(y, x) y | agrepl(x, df1$name, list(cost=99, insertions=0,
deletions=99, substitutions=0)), df2$name, FALSE))]),
GKi5 = {
s1 <- strsplit(df1$name, " ", TRUE)
lup1 <- list2env(split(rep(seq_along(s1), lengths(s1)), unlist(s1)))
s2 <- strsplit(df2$name, " ", TRUE)
lup2 <- list2env(split(rep(seq_along(s2), lengths(s2)), unlist(s2)))
transform(df1, val = c("no", "yes")[1+`[<-`(sapply(s1, \(x) any(Reduce(base::intersect, mget(x, lup2, ifnotfound =
list(NULL))))), unlist(lapply(s2, \(x) Reduce(base::intersect, mget(x, lup1,
ifnotfound = list(NULL))))), TRUE)])
}
)
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 Thomas 863µs 894µs 919. 4.08KB 21.4 429 10 467ms
2 GKi1 211µs 218µs 3770. 0B 14.6 1803 7 478ms
3 GKi1b 211µs 226µs 3020. 0B 14.6 1448 7 479ms
4 GKi1c 183µs 200µs 3424. 0B 10.3 1667 5 487ms
5 GKi2 262µs 275µs 2755. 0B 12.4 1336 6 485ms
6 GKi3 391µs 409µs 2010. 0B 9.19 875 4 435ms
7 GKi4 374µs 386µs 2295. 0B 16.5 1110 8 484ms
8 GKi5 272µs 285µs 2570. 2.82KB 21.1 1220 10 475ms
All variants are more than 2 times faster than ThomasIsCoding using one CPU-core.
Solution 5:[5]
library(data.table)
library(stringi)
library(purrr)
setDT(df1)
df1[,val := fifelse(map_lgl(stri_replace_all_fixed(name, " ", "|"), ~any(stri_count_regex(..df2$name, .x) >= 2)), "yes", "no")][]
#> name val
#> 1: JUAN GIRON yes
#> 2: GINA OLEAS no
#> 3: JUAN FERNANDO ELIZAGA yes
#> 4: MARCO TORRES no
#> 5: JUAN PABLO GONZALEZ yes
#> 6: IRMA GOMEZ yes
This creates a vector of partial match counts (needles) per name in df2
(haystack).
[[1]]
[1] 1 0 0 0 0 0 0 2
[[2]]
[1] 0 0 0 0 1 0 0 0
[[3]]
[1] 1 2 0 0 0 0 0 1
[[4]]
[1] 0 0 0 0 0 1 0 0
[[5]]
[1] 1 0 0 2 0 0 0 2
[[6]]
[1] 0 0 2 0 0 0 0 0
I've set the number of minimum matches that result in a "yes" to 2, but this number can be tweaked (given we don't know which part is a family name or given name).
Solution 6:[6]
I'm going for robust here, rather than speed or elegance. I am certain that it could be cleaned up and sped up, but this addresses all of the discussed concerns including the added specification in the comments about handling family and given name matches separately.
First, here is updated data that has the family and given names split:
df1_split <-
tibble(
Given = c("JUAN", "GINA", "JUAN FERNANDO"
, "MARCO", "JUAN PABLO", "IRMA"
, "JUAN", "JUAN CARLOS")
, Family = c("GIRON", "OLEAS", "ELIZAGA"
, "TORRES", "GONZALEZ", "GOMEZ"
, "GOMEZ", "MARTINEZ")
)
df2_split <-
tibble(
Given = c("JUANA", "FERNANDO",
"IRMA", "PABLO", "GINA", "MARK",
"LETICIA", "JUAN PABLO"
, "FERNANDO CARLOS"
, "JUAN FERNANDO")
, Family = c("MARQUEZ", "ELIZAGA",
"GOMEZ", "GONZALEZ", "LUCIO", "TORRES",
"BLACIO", "GIRON BELTRAN"
, "MARTINEZ"
, "ELIZAGA")
)
Note that I added a couple of names to highlight some of the problems with the doubled names.
Then, this function will check any name set. It handles the names differently depending on whether they are single (e.g. "Juan") or dual-named (e.g., "Juan Carlos"). For single names to be checked, it just looks if they are present at all. For dual-named names, it checks that they are in the same order if the name to check against has two names as well or if either name is present if the name to check against only has one name.
check_names <- function(to_check, against){
split_against <-
str_split(against, " ")
str_split(to_check, " ") %>%
lapply(function(this_name){
if(length(this_name) == 1){
to_ret <-
sapply(split_against, function(this_against){
any(this_name == this_against)
}) %>%
which
} else if(length(this_name) == 2){
to_ret <-
sapply(split_against, function(this_against){
if(length(this_against) == 2){
return(all(this_against == this_name))
} else if(length(this_against) == 1){
return(any(this_against == this_name))
} else{
stop("Names (against) cannot have three words: "
, this_against)
}
}) %>%
which
} else{
stop("Names (to_check) cannot have three words: "
, this_name)
}
})
}
Then, we wrap that function together to pass the family and given names separately. The results are then checked to see if there are any indices that are matches for both the family and given names.
check_both_simple <- function(to_check_given, to_check_family, against_given, against_family){
checked_given <- check_names(to_check_given, against_given)
checked_family <- check_names(to_check_family, against_family)
valid_matches <- lapply(1:length(checked_given), function(idx){
checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
})
to_return <-
ifelse(sapply(valid_matches, length) > 0
, "yes"
, "no")
return(to_return)
}
We can then use this in a call from mutate to add the column:
df1_split %>%
mutate(Match = check_both_simple(Given, Family
, df2_split$Given
, df2_split$Family))
Returns:
# A tibble: 8 × 3
Given Family Match
<chr> <chr> <chr>
1 JUAN GIRON yes
2 GINA OLEAS no
3 JUAN FERNANDO ELIZAGA yes
4 MARCO TORRES no
5 JUAN PABLO GONZALEZ yes
6 IRMA GOMEZ yes
7 JUAN GOMEZ no
8 JUAN CARLOS MARTINEZ no
And should handle all of the odd edge cases discussed in the comments as well.
The nice thing about this approach (and part of why I built it this robustly in the first place), is that you can also set the function to return the matching indices.
check_both_idx <- function(to_check_given, to_check_family, against_given, against_family){
checked_given <- check_names(to_check_given, against_given)
checked_family <- check_names(to_check_family, against_family)
valid_matches <- lapply(1:length(checked_given), function(idx){
checked_given[[idx]][checked_given[[idx]] %in% checked_family[[idx]]]
})
return(valid_matches)
}
Using this, you can actually pull the matches that are found and manually inspect them. This would allow you to identify any additional edge cases where you didn't agree with a found match or where one match may be markedly better than another.
df1_split %>%
mutate(Match_idx = check_both_idx(Given, Family
, df2_split$Given
, df2_split$Family)
, Matches = sapply(Match_idx, function(this_idx_set){
paste(df2_split$Given[this_idx_set]
, df2_split$Family[this_idx_set]) %>%
paste(collapse = "; ")
})
, Match = ifelse(sapply(Match_idx, length) > 0
, "yes"
, "no")
, Match_idx = sapply(Match_idx, paste, collapse = "; ")
)
Returns:
# A tibble: 8 × 5
Given Family Match_idx Matches Match
<chr> <chr> <chr> <chr> <chr>
1 JUAN GIRON "8" "JUAN PABLO GIRON BELTRAN" yes
2 GINA OLEAS "" "" no
3 JUAN FERNANDO ELIZAGA "2; 10" "FERNANDO ELIZAGA; JUAN FERNANDO ELIZAGA" yes
4 MARCO TORRES "" "" no
5 JUAN PABLO GONZALEZ "4" "PABLO GONZALEZ" yes
6 IRMA GOMEZ "3" "IRMA GOMEZ" yes
7 JUAN GOMEZ "" "" no
8 JUAN CARLOS MARTINEZ "" "" no
Editing to add: The following two sets present some tricky examples that would currently be decided incorrectly by the other answers. These examples arose from the discussion in the comments to clarify what should match.
tricky_1 <-
tibble(
Given = c("JUAN", "JUANITA GINA"
, "JUAN CARLO", "GOMEZ")
, Family = c("GIRON BELTRAN", "OLEAS"
, "MARTINEZ", "IRMA")
)
tricky_2 <-
tibble(
Given = c("JUAN PABLO", "GINA"
, "CARLO JUAN", "IRMA")
, Family = c("GIRON", "OLEAS GIRON"
, "MARTINEZ", "GOMEZ")
)
We can view them side by side like so:
bind_cols(
tricky_1 %>%
setNames(paste0("toCheck_", names(.)))
, tricky_2 %>%
setNames(paste0("against_", names(.)))
) %>%
mutate(shouldMatch = c("yes", "yes", "no", "no"))
returns:
# A tibble: 4 × 5
toCheck_Given toCheck_Family against_Given against_Family shouldMatch
<chr> <chr> <chr> <chr> <chr>
1 JUAN GIRON BELTRAN JUAN PABLO GIRON yes
2 JUANITA GINA OLEAS GINA OLEAS GIRON yes
3 JUAN CARLO MARTINEZ CARLO JUAN MARTINEZ no
4 GOMEZ IRMA IRMA GOMEZ no
The first two should match because the family and given names each have a 1-2 match in each direction. However, that means that neither the name to check or the name to check against is completely contained in the other. The third shares all of the components, but I believe that "JUAN CARLO" should not match "CARLO JUAN." The fourth has the family and given names reversed, so shouldn't return a match.
The code from my answer handles these cases:
tricky_1 %>%
mutate(Match_idx = check_both_idx(Given, Family
, tricky_2$Given
, tricky_2$Family)
, Matches = sapply(Match_idx, function(this_idx_set){
paste(tricky_2$Given[this_idx_set]
, tricky_2$Family[this_idx_set]) %>%
paste(collapse = "; ")
})
, Match = ifelse(sapply(Match_idx, length) > 0
, "yes"
, "no")
, Match_idx = sapply(Match_idx, paste, collapse = "; ")
)
Returns:
# A tibble: 4 × 5
Given Family Match_idx Matches Match
<chr> <chr> <chr> <chr> <chr>
1 JUAN GIRON BELTRAN "1" "JUAN PABLO GIRON" yes
2 JUANITA GINA OLEAS "2" "GINA OLEAS GIRON" yes
3 JUAN CARLO MARTINEZ "" "" no
4 GOMEZ IRMA "" "" no
Solution 7:[7]
df2_flat <- df2$name |> stringr::str_split(" ") |> purrr::flatten()
df1 |>
mutate(splitnames = stringr::str_split(name, " ")) |>
rowwise() |>
mutate(val = all(splitnames %in% df2_flat)) |>
select(-splitnames)
Produces the list you specified
Solution 8:[8]
How I would go about it
library(tidyverse)
df1 %>%
mutate(val = sapply(name, \(n) {
result = strsplit(n, " ")[[1]] %>%
sapply(., \(sn) { #loop through each name and look in df2
sum(grepl(sn, df2$name)) #Could be rewritten to a for-loop to break at the first missed name
})
if(0 %in% result) {
return("no") #at least one name was not found
} else {
return("yes") #defaults to yes, since this can only be reached if no names have been missed (i.e. all have been found)
}
}))
Solution 9:[9]
Adding another solution that I didn't see above:
Basically use either left_join()
, anti_join
and full_join
to match the values in the dataframe
#see which values in df2 match df1
left_join(df1,df2,by="name")
#see which values don't match in either df1 and df2
anti_join(df1,df2,by="name")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | |
Solution 5 | Donald Seinen |
Solution 6 | |
Solution 7 | Mark Hamlin |
Solution 8 | KnightofniDK |
Solution 9 | alejandro_hagan |