'Combining multiple filters with %ilike%
I'm working on matching last names between two tables. However, there are some variations that I have to take into account. For instance, I found that "Smith" in db1 can potentially have other forms in db2:
Smith, Smith-Whatever, Smith Jr., Smith Sr., Smith III (any Roman numeral)
Lower/uppercase is also an issue.
I'm trying to implement this logic in dplyr. I found the %ilike% operator in data.table, which seems to work kind of like the SQL equivalent. I can use it like this:
match <- db2 %>%
dplyr::filter(last_name %ilike% "^smith$" | last_name %ilike% "^smith-" | last_name %ilike% "^smith .r" | last_name %ilike% "^smith [ivx]")
Of course the strings wouldn't be hardcoded but rather obtained by iterating through db1. Either way, this is unwieldy.
Hence my question:
Is there a way to combine the functionality of %ilike% with something like %in% - by specifying a vector of regexes the 'ilikes' of which I would match against? Is there a smarter way of doing this?
Solution 1:[1]
You can combine the pattern with |. You may use grepl (or str_detect if you are using stringr).
library(dplyr)
db2 %>% filter(grepl("smith( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE))
# last_name
#1 Smith
#2 Smith-Whatever
#3 Smith Jr.
#4 Smith Sr.
#5 Smith III
If you want to construct the pattern dynamically you can do -
pat <- c('smith', 'smith-', 'smith .r', 'smith [ivx]')
db2 %>% dplyr::filter(grepl(paste0(pat, collapse = "|"), last_name, ignore.case = TRUE))
Also, would it be enough to filter rows that have only 'smith' in them ?
db2 %>% filter(grepl('smith', last_name, ignore.case = TRUE))
Solution 2:[2]
Using RonakShah's pat and my db2 below, ...
Filter
You might try an any operator to iterate through each pattern:
db2 %>%
filter(rowSums(sapply(pat, grepl, name)) > 0)
# name
# 1 smith
# 2 smith-something
And since data.table::%ilike% and data.table::%like% are really using grepl under the hood, this is about the same thing.
Merge/join
If your patterns are in a new frame, you can join them in with:
patdf <- data.frame(ptn = pat, num = seq_along(pat))
patdf
# ptn num
# 1 smith 1
# 2 smith- 2
# 3 smith .r 3
# 4 smith [ivx] 4
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn"))
# name ptn num
# 1 smith smith 1
# 2 jones <NA> NA
# 3 hubert <NA> NA
# 4 smith-something smith 1
# 5 smith-something smith- 2
Granted, this is multiplying rows, since it matches multiple times. This can be reduced. Let's assume your original data has a unique id field:
db2$id <- 10L + seq_len(nrow(db2))
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn")) %>%
filter(!is.na(ptn)) %>%
group_by(id) %>%
slice_min(num) %>%
ungroup()
# # A tibble: 2 x 4
# name id ptn num
# <chr> <int> <chr> <int>
# 1 smith 11 smith 1
# 2 smith-something 14 smith 1
Data
db2 <- structure(list(name = c("smith", "jones", "hubert", "smith-something")), class = "data.frame", row.names = c(NA, -4L))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ronak Shah |
| Solution 2 |
