'Match records with a combination of regex and lookup
I want to match personal records between two tables using the following logic:
Regex match on last name up to minor variations - summarized by the following regex for a given last name:
grepl("LNAME( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE). The functionfuzzyjoin::regex_*_joinwas suggested, but I'm not sure how to use it if the name isn't static...?Match on first name based on the nicknames list. So basically matching all names in
nicknames[[fname]]or justfnameif that is empty. Should not be case-sensitive as well.Exact match on city, not case-sensitive.
Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future. There has to be a smarter way of doing this.
Example data:
df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","THOMAS","JAMES"),
"city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
"id1"=c("aaaa","bbbb","cccc"),
"misc1"=c("bla","ble","bla"))
df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
"fname2"=c("Jon","Tom","Jamie","John"),
"city"=c("New York","Los Angeles","Seattle","New York"),
"id2"=c("1111","2222","3333","4444"),
"misc2"=c("bonk","bzdonk","boom","bam"))
nicknames <- list("john"=c("john","jon","johnny"),
"thomas"=c("thomas","tom","tommy"),
"james"=c("james","jamie","jim"))
Expected output:
expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
"id2"=c("1111","4444","2222","3333"),
"lname1"=c("SMITH","SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
"lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
"fname2"=c("Jon","John","Tom","Jamie"),
"city"=c("New York","New York","Los Angeles","Seattle"),
"misc1"=c("bla","bla","ble","bla"),
"misc2"=c("bonk","bam","bzdonk","boom"))
# A tibble: 4 x 9
id1 id2 lname1 fname1 lname2 fname2 city misc1 misc2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 aaaa 1111 SMITH JOHN Smith Jr. Jon New York bla bonk
2 aaaa 4444 SMITH JOHN Smith John New York bla bam
3 bbbb 2222 BLACK THOMAS Black III Tom Los Angeles ble bzdonk
4 cccc 3333 MILLER JAMES Miller-Muller Jamie Seattle bla boom
EDIT:
This is as far as I got. Spent the past few hours trying to get the last step done but I can't. I have this:
df <- tibble("fname1"=c("JOHN","JOHN","JOHN"),
"lname1"=c("SMITH","SMITH","SMITH"),
"fname2"=c("FRANK","JOHN","BILL"),
"lname2"=c("SMITH","SMITH","SMITH"),
"city"=c("NEW YORK","NEW YORK","NEW YORK"))
nicknames_df <- tibble(fname = names(nicknames), nick = paste0("^(", sapply(nicknames, paste, collapse = "|"), ")$"))
>df
# A tibble: 3 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH FRANK SMITH NEW YORK
2 JOHN SMITH JOHN SMITH NEW YORK
3 JOHN SMITH BILL SMITH NEW YORK
>nicknames_df
# A tibble: 3 x 2
fname nick
<chr> <chr>
1 john ^(john|jon|johnny)$
2 thomas ^(thomas|tom|tommy)$
3 james ^(james|jamie|jim)$
Expected output:
> out
# A tibble: 1 x 5
fname1 lname1 fname2 lname2 city
<chr> <chr> <chr> <chr> <chr>
1 JOHN SMITH JOHN SMITH NEW YORK
How do I join it with nicknames df to get just the 2nd row?!
out <- df %>% fuzzyjoin::regex_left_join(nicknames_df, ???)
Solution 1:[1]
fuzzyjoin::regex_right_join(
df2, df1, by = c(lname2 = "lname1"),
ignore_case = TRUE)
# # A tibble: 4 x 10
# lname2 fname2 city.x id2 misc2 lname1 fname1 city.y id1 misc1
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Smith Jr. Jon New York 1111 bonk SMITH JOHN NEW YORK aaaa bla
# 2 Smith John New York 4444 bam SMITH JOHN NEW YORK aaaa bla
# 3 Black III Tom Los Angeles 2222 bzdonk BLACK THOMAS LOS ANGELES bbbb ble
# 4 Miller-Muller Jamie Seattle 3333 boom MILLER JAMES SEATTLE cccc bla
I didn't want to assume any resolution for city.x vs city.y; while it's clear visually that they're good, I'll let you work through that.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | r2evans |
