'Match records with a combination of regex and lookup

I want to match personal records between two tables using the following logic:

  1. Regex match on last name up to minor variations - summarized by the following regex for a given last name: grepl("LNAME( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE). The function fuzzyjoin::regex_*_join was suggested, but I'm not sure how to use it if the name isn't static...?

  2. Match on first name based on the nicknames list. So basically matching all names in nicknames[[fname]] or just fname if that is empty. Should not be case-sensitive as well.

  3. Exact match on city, not case-sensitive.

Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future. There has to be a smarter way of doing this.

Example data:

df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
              "fname1"=c("JOHN","THOMAS","JAMES"),
              "city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
              "id1"=c("aaaa","bbbb","cccc"),
              "misc1"=c("bla","ble","bla"))

df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
              "fname2"=c("Jon","Tom","Jamie","John"),
              "city"=c("New York","Los Angeles","Seattle","New York"),
              "id2"=c("1111","2222","3333","4444"),
              "misc2"=c("bonk","bzdonk","boom","bam"))

nicknames <- list("john"=c("john","jon","johnny"), 
                  "thomas"=c("thomas","tom","tommy"),
                  "james"=c("james","jamie","jim"))

Expected output:

expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
                          "id2"=c("1111","4444","2222","3333"),
                          "lname1"=c("SMITH","SMITH","BLACK","MILLER"),
                          "fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
                          "lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
                          "fname2"=c("Jon","John","Tom","Jamie"),
                          "city"=c("New York","New York","Los Angeles","Seattle"),
                          "misc1"=c("bla","bla","ble","bla"),
                          "misc2"=c("bonk","bam","bzdonk","boom"))

# A tibble: 4 x 9
  id1   id2   lname1      fname1 lname2        fname2 city        misc1 misc2 
  <chr> <chr> <chr>       <chr>  <chr>         <chr>  <chr>       <chr> <chr> 
1 aaaa  1111  SMITH       JOHN   Smith Jr.     Jon    New York    bla   bonk  
2 aaaa  4444  SMITH       JOHN   Smith         John   New York    bla   bam   
3 bbbb  2222  BLACK       THOMAS Black III     Tom    Los Angeles ble   bzdonk
4 cccc  3333  MILLER      JAMES  Miller-Muller Jamie  Seattle     bla   boom  

EDIT:

This is as far as I got. Spent the past few hours trying to get the last step done but I can't. I have this:

df <- tibble("fname1"=c("JOHN","JOHN","JOHN"),
             "lname1"=c("SMITH","SMITH","SMITH"),
             "fname2"=c("FRANK","JOHN","BILL"),
             "lname2"=c("SMITH","SMITH","SMITH"),
             "city"=c("NEW YORK","NEW YORK","NEW YORK"))

nicknames_df <- tibble(fname = names(nicknames), nick = paste0("^(", sapply(nicknames, paste, collapse = "|"), ")$"))

>df
# A tibble: 3 x 5
  fname1 lname1 fname2 lname2 city    
  <chr>  <chr>  <chr>  <chr>  <chr>   
1 JOHN   SMITH  FRANK  SMITH  NEW YORK
2 JOHN   SMITH  JOHN   SMITH  NEW YORK
3 JOHN   SMITH  BILL   SMITH  NEW YORK

>nicknames_df
# A tibble: 3 x 2
  fname  nick                
  <chr>  <chr>               
1 john   ^(john|jon|johnny)$ 
2 thomas ^(thomas|tom|tommy)$
3 james  ^(james|jamie|jim)$ 

Expected output:

> out
# A tibble: 1 x 5
  fname1 lname1 fname2 lname2 city    
  <chr>  <chr>  <chr>  <chr>  <chr>   
1 JOHN   SMITH  JOHN   SMITH  NEW YORK

How do I join it with nicknames df to get just the 2nd row?!

out <- df %>% fuzzyjoin::regex_left_join(nicknames_df, ???)



Solution 1:[1]

fuzzyjoin::regex_right_join(
  df2, df1, by = c(lname2 = "lname1"),
  ignore_case = TRUE)
# # A tibble: 4 x 10
#   lname2        fname2 city.x      id2   misc2  lname1 fname1 city.y      id1   misc1
#   <chr>         <chr>  <chr>       <chr> <chr>  <chr>  <chr>  <chr>       <chr> <chr>
# 1 Smith Jr.     Jon    New York    1111  bonk   SMITH  JOHN   NEW YORK    aaaa  bla  
# 2 Smith         John   New York    4444  bam    SMITH  JOHN   NEW YORK    aaaa  bla  
# 3 Black III     Tom    Los Angeles 2222  bzdonk BLACK  THOMAS LOS ANGELES bbbb  ble  
# 4 Miller-Muller Jamie  Seattle     3333  boom   MILLER JAMES  SEATTLE     cccc  bla  

I didn't want to assume any resolution for city.x vs city.y; while it's clear visually that they're good, I'll let you work through that.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 r2evans