'Combining multiple filters with %ilike%

I'm working on matching last names between two tables. However, there are some variations that I have to take into account. For instance, I found that "Smith" in db1 can potentially have other forms in db2:

Smith, Smith-Whatever, Smith Jr., Smith Sr., Smith III (any Roman numeral)

Lower/uppercase is also an issue.

I'm trying to implement this logic in dplyr. I found the %ilike% operator in data.table, which seems to work kind of like the SQL equivalent. I can use it like this:

match <- db2 %>% 
dplyr::filter(last_name %ilike% "^smith$" | last_name %ilike% "^smith-" | last_name %ilike% "^smith .r" | last_name %ilike% "^smith [ivx]")

Of course the strings wouldn't be hardcoded but rather obtained by iterating through db1. Either way, this is unwieldy.

Hence my question:

Is there a way to combine the functionality of %ilike% with something like %in% - by specifying a vector of regexes the 'ilikes' of which I would match against? Is there a smarter way of doing this?



Solution 1:[1]

You can combine the pattern with |. You may use grepl (or str_detect if you are using stringr).

library(dplyr)

db2 %>% filter(grepl("smith( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE))

#       last_name
#1          Smith
#2 Smith-Whatever
#3      Smith Jr.
#4      Smith Sr.
#5      Smith III

If you want to construct the pattern dynamically you can do -

pat <- c('smith', 'smith-', 'smith .r', 'smith [ivx]')

db2 %>%  dplyr::filter(grepl(paste0(pat, collapse = "|"), last_name, ignore.case = TRUE))

Also, would it be enough to filter rows that have only 'smith' in them ?

db2 %>%  filter(grepl('smith', last_name, ignore.case = TRUE))

Solution 2:[2]

Using RonakShah's pat and my db2 below, ...

Filter

You might try an any operator to iterate through each pattern:

db2 %>%
  filter(rowSums(sapply(pat, grepl, name)) > 0)
#              name
# 1           smith
# 2 smith-something

And since data.table::%ilike% and data.table::%like% are really using grepl under the hood, this is about the same thing.

Merge/join

If your patterns are in a new frame, you can join them in with:

patdf <- data.frame(ptn = pat, num = seq_along(pat))
patdf
#           ptn num
# 1       smith   1
# 2      smith-   2
# 3    smith .r   3
# 4 smith [ivx]   4

fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn"))
#              name    ptn num
# 1           smith  smith   1
# 2           jones   <NA>  NA
# 3          hubert   <NA>  NA
# 4 smith-something  smith   1
# 5 smith-something smith-   2

Granted, this is multiplying rows, since it matches multiple times. This can be reduced. Let's assume your original data has a unique id field:

db2$id <- 10L + seq_len(nrow(db2))
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn")) %>%
  filter(!is.na(ptn)) %>%
  group_by(id) %>%
  slice_min(num) %>%
  ungroup()
# # A tibble: 2 x 4
#   name               id ptn     num
#   <chr>           <int> <chr> <int>
# 1 smith              11 smith     1
# 2 smith-something    14 smith     1

Data

db2 <- structure(list(name = c("smith", "jones", "hubert", "smith-something")), class = "data.frame", row.names = c(NA, -4L))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ronak Shah
Solution 2