'How to fork an environment from inside a function for parallel computing in R
I have a custom R function that creates a large data object, which needs to be shared to parallel workers. The toy minimal example below shows the structure of my problem. The issue is that when I set up a forking cluster (using parallel::makeForkCluster), it forks the global environment - however my data object is created inside the function so does not have global scope. Does anyone have a solution to efficiently share this object ('obj' in my example) with the workers?
I have tried clusterExport, but as the data object is large, this is a significant hit on computation time and memory. I only need read access to the object, so thread-safe write access is not an issue. I can push the object to the global environment (my current hack solution), but this will not pass CRAN checks. The existing shared memory solutions seem to be restricted to matrix-like objects - but I have a complex nested list. This code only needs to be run on *nix, so forking is fine and runs in very acceptable time/memory footprint. I just need to either fork the local function environment or otherwise make the 'obj' object visible to the workers somehow.
cluster_call = function(par){
obj = list(a = par, b = c("x", "y", "z")) # This object only created _inside_ the 'cluster_call' environment.
#assign("obj", obj, envir = .GlobalEnv) # Uncommenting this line makes the code work.
cluster = parallel::makeForkCluster(2) # Only forking the _global_ environment.
result = parallel::parSapply(cluster, par, worker_call)
parallel::stopCluster(cluster)
return(result)
}
worker_call = function(i){
paste(obj$a[i], obj$b, sep = ".")
}
cluster_call(par = 1:3)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
