'Parallelizing code for matching between two data frames
I'm trying to parallelize a function that works roughly like this:
process_row <- function(i, db1, db2){
row <- db1[,i]
match_db <- db2 %>% dplyr::filter(name_clear == row$name)
out <- cbind(row, match_db)
}
out_list <- lapply(1:length(db1), process_row, db1=db1, db2=db2)
In reality it's more complicated but the basic idea is that it goes row-by-row through db1, finds appropriate matching rows in db2 according to a set of rules, and then binds them together column-wise (one-to-many relationship). In terms of size, db1 has 40k rows, db2 has 6 million. The matches (out) are usually between 1 and 100 rows long.
When I run the regular lapply, it takes up very little CPU & memory. I can "manually" paralellize it by opening up multiple instances of Rstudio and changing the range of the lapply to 1:10000 in the first one, 10001:20000 in the second, etc. but that's ridiculous. Unfortunately it's the only "working" solution I found so far. Neither foreach nor future.apply work - they just crash after running out of memory completely, even on a range of 1:10.
I suspect this has to do with the way I wrote the function. How should I change it?
Solution 1:[1]
Here's a familiar dataset for db1
library(dplyr)
db1 = as_tibble(mtcars, rownames = "model")
From your previous post, I'm guessing that you might have something analogous to a tibble of 'brands' and a corresponding regular expression to extract brands that have models associated with them.
db2 <-
db1 |>
mutate(
brand = sub(" .*", "", model),
brand_re = paste0("^(?i)", brand, "( .*)$")
) |>
select(brand, brand_re) |>
distinct()
and you'd like to do the equivalent of
fuzzyjoin::regex_right_join(db1, db2, by = c(model = "brand_re")) |>
## just a few columns to illustrate
select(brand, model, mpg, cyl, disp, drat, wt)
The problem with using fuzzyjoin for your case is that it tries to be
efficient by making two parallel vectors that capture all combinations
of unique regular expressions and unique values to pass to the
stringi::stri_detect() function, but of course with 45k regular
expressions and 6 million values, those are very long vectors!
The challenge with your approach to parallelization is that too much data
is used in the lapply(), and the operations (repeatedly subsetting a
very large data.frame and binding columns) are not efficient.
Instead, write a function that performs the necessary computation with
only the necessary data. Here we iterate over the patterns of
interest, and return a list of integer vectors indexing the str
(model in our case) matching the corresponding regular expression.
fuzzy_match <- function(str, patterns) {
lapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
joins <- fuzzy_match(db1$model, db2$brand_re)
str(head(joins))
For instance, the first element of joins says that the first regular
expression of db2$brand_re matches elements 1 and 2 of db1$model.
We can then construct the joint table by constructing relevant indexes and binding rows
db1_idx <- unlist(joins)
db2_idx <- rep(seq_along(joins), lengths(joins))
## FIXME: not quite a join, since regular expressions in db2 with no
## matches in db1 are not present
cbind(db2[db2_idx,], db1[db1_idx,]) |>
as_tibble() |>
select(brand, model, mpg, cyl, disp, drat, wt)
When str is of length 5 million, this seems to take about 1 second
per regular expression, so 45k regular expressions is still a long time. The first step is to
ask whether there are improvements to the naive implementation of
fuzzy_match() that would speed it up. For instance if the str
contains duplicates, it might be beneficial to only do the match on
unique values.
fuzzy_match_with_duplicates <- function(str, patterns) {
## only iterate on unique values; useful if many duplicates
ustr <- unique(str)
uresult <- lapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, ustr)
## unpack ustr
lapply(uresult, function(i) which(str %in% ustr[i]))
}
It might also pay to study the regular expressions and eliminate redundancies, for instance in the present case we could obtain relevant results (expect for models without brands) with single pass
db1 |>
mutate(brand = sub(" .*", "", model))
In the example of your previous post it seems like one could make progress by replacing specific patterns with SMITH etc. with [[:alpha:]]+ or similar.
If the function is still slow, the goal is now to parallelize a simple
lapply() with modest data. On Linux or macOS the easiest solution is
to use parallel::mclapply()
## save some cores for other things...
options(mc.cores = parallel::detectCores() - 2L)
fuzzy_match_parallel_mclapply <- function(str, patterns) {
mcapply(patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
On Windows the easiest approach is to use parallel::parLapply() with
a socket-based back end
fuzzy_match_parallel_sockets <- function(str, patterns) {
parLapply(cl, patterns, function(pattern, str) {
which(stringi::stri_detect_regex(str, pattern))
}, str)
}
cl <- makePSOCKcluster(detectCores() - 2L)
joins <- fuzzy_match_parallel_sockets(db1$model, db2$brand_re)
stopCluster(cl)
Subsequent processing steps are unchanged.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
