Using parallel package functions inside own R package - r

I created own R package with functions that use parallel functions like makeCluster, parLapply etc.
However, they are much slower inside the package as used outside. There are slower initialization of cluster, and exporting objects...
Do you have tips how to use properly parallel functions inside an own R package?
Example of using parallel:
cl <- parallel::makeCluster(parallel::detectCores()-1)
parallel::clusterExport(cl, varlist = c("data"), envir = environment())
parallel::clusterCall(cl, function() {library(myPackage)})
data_res <- parallel::parLapply(cl, 1:nrow(data), function(i) {
tryCatch(myFun(data[i,]), error = function(err) {return(data.table(row = i))})
})
if(!is.null(cl)) {
parallel::stopCluster(cl)
cl <- c()
}
gc()
Thanks

Related

Showing progress_bar with doParallel + foreach

I am using the example code posted here to show a progress_bar (from the progress package) with doParallel + foreach. Solutions there however make use of doSNOW (e.g. code by Dewey Brooke that I am using for testing), which is more outdated than doParallel and returns this NOTE when building a package with CRAN flags:
Uses the superseded package: ‘doSNOW (>= 1.0.19)’
Change this does not seems that as straightforward as expected. If only registerDoSNOW is replaced by registerDoParallel, and .options.snow by .options.doparallel the code will run, but in the second case will not show any progress bar at all.
I think this might relate to the use of .options.X. This part of the code is very obscure to me, since .options.snow works indeed when using doSNOW, but there is no documentation of the foreach man page about the use of this argument. Therefore, it is not surprising that .options.doparallel does not work, since it was just a wild guess of mine.
Including the call to pb$tick() within the foreach loop will not work either, and will actually cause the result to be wrong. So I am really out of ideas on where should I include this in the code.
Where .options.snow comes from? where should go pb$tick(), how to show the progress_bar object using doParallel here?
I paste below the code (doSNOW replaced by doParallel) for convenience, but credit again the original source:
library(parallel)
library(doParallel)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoParallel(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.doparallel = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
doParallel still uses the .options.snow argument for whatever reason. Found this little tidbit in the doParallel documentation.
The doParallel backend supports both multicore and snow options passed through the foreach function. The supported multicore options are 1st preschedule, set.seed, silent, and cores, which are analogous to the similarly named arguments to mclapply, and are passed using the .options.multicore argument to foreach. The supported snow options are preschedule, which like its multicore analog can be used to chunk the tasks so that each worker gets a prescheduled chunk of tasks, and attachExportEnv, which can be used to attach the export environment in certain cases where R’s lexical scoping is unable to find a needed export. The snow options are passed to foreach using the .options.snow argument.
foreach is powerful package but whoever is maintaining it makes odd decisions.
EDIT
doParallel does not support the progress multicore option. Therefore, a progress bar will NOT display if registerDoParallel is used instead of registerDoSNOW.
While doSNOW has been superseded, it's unclear if one is more outdated than the other since both have undergone very few changes, either than updating the current Maintainer (doParallel | doSNOW).
doSNOW
doSNOW:::doSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
progressWrapper <- function(...) NULL # <- CRITICAL DIFFERENCE
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv",
"progress") # <- CRITICAL DIFFERENCE
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...
doParallel
doParallel:::doParallelSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
# MISSING: progressWrapper <- function(...) NULL
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv" #MISSING , "progress")
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...

How to initialize workers to use package functions in parallel

I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.
I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.
You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}

R foreach parallel and package variables

I have the following code where i get an error like:
object 'a' could not be found
This file is inside a package.
It works when i use %do% instead of %dopar%
a <- 2
fun1 <- function(x)
{
y <- x*a
return(y)
}
fun2 <- function(n)
{
foreach(data = 1:n, .combine = rbind, .multicombine = TRUE, .export = c("a","fun1")) %dopar%
{
load_all() # Works, get error if line is removed
y <- fun1(data)
return(y)
}
}
In my main file where i use devtools::load-all() to load the package and have used doParallel to run my foreach on multiple cores i execute fun2(5) when getting the error.
If i use a directly in the foreach function body everything works. But when i use a function that uses the a variable, i get the error.
My main issue is i wan't to be able to call the function fun1 from both fun2 aswell as stand alone from the package.
Cluster is created as
cl <- makeCluster(16)
registerDoParallel(cl)
clusterCall(cl, function(x) .libPaths(x), .libPaths())
# %dopar% code
stopCluster(cl)

How does foreach package scope R Environments when using as.formula, SE dplyr, and lapply?

I have a function where I dynamically build multiple formulas as strings and cast them to a formulas with as.formula. I then call that function in a parallel process using doSNOW and foreach and use those formulas through dplyr::mutate_.
When I use lapply(formula_list, as.formula) I get the error could not find function *custom_function* when run in parallel, though it works fine when run locally. However, when I use lapply(formula_list, function(x) as.formula(x) it works both in parallel and locally.
Why? What's the correct way to understand the environments here and the "right" way to code it?
I do get a warning that says: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): *custom_func*
A minimal reproducible example is below.
# Packages
library(dplyr)
library(doParallel)
library(doSNOW)
library(foreach)
# A simple custom function
custom_sum <- function(x){
sum(x)
}
# Functions that call create formulas and use them with nse dplyr:
dplyr_mut_lapply_reg <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), as.formula),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
dplyr_mut_lapply_lambda <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), function(x) as.formula(x)),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
#1. CALLING BOTH LOCALLY
dplyr_mut_lapply_lambda(iris) #works
dplyr_mut_lapply_reg(iris) #works
#2. CALLING IN PARALLEL
#Faux Parallel Setup
cl <- makeCluster(1, outfile="")
registerDoSNOW(cl)
# Call Lambda Version WORKS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_lambda(iris)
}
# Call Regular Version FAILS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_reg(iris)
}
# Close Cluster
stopCluster(cl)
EDIT: In my original post title I wrote that I was using nse, but I really meant using standard evaluation. Whoops. I have changed this accordingly.
I don't have an exact answer to why here, but the future package (I'm the author) handles these type of "tricky" globals - they are tricky because they are not part of a package and they are nested, i.e. one global calls another global. For example, if you use:
library("doFuture")
cl <- parallel::makeCluster(1, outfile = "")
plan(cluster, workers = cl)
registerDoFuture()
that problematic "Call Regular Version FAILS" case should now work.
Now, the above uses parallel::makeCluster() which defaults to type = "PSOCK", whereas if you load doSNOW you get snow::makeCluster() which defaults to type = "MPI". Unfortunately, a full MPI backend is yet not implemented for the future package. Thus, if you're looking for an MPI solution, this won't help you (yet).

do parallel combine progress bar and process

I'm having issues to combine the process that I want to run in parallel and the creation of the progress bar.
My code for the process is:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst,.packages = c('tcltk', 'foreach'), .combine = rbind)
%dopar%
{ mypb <- tkProgressBar(title = "R progress bar", label = "",
min = 0, max = max(jSeq), initial = 0, width = 300)
foreach(j = jSeq) %do% {Sys.sleep(.1)
setTkProgressBar(mypb, j, title = "pb", label = NULL)
}
library(pnn)
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
I combined my code and the one that comes form here
but didn't compile. I get a syntax error, but I can't find it.
I tried this other code:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar%
{library(pnn)
cat(i, '\n')
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
But I get an error too.
The approach that you're trying to use might work under certain circumstances, but it isn't a good general solution. What I would want to do is to create a progress bar in the master process (outside of the foreach loop) and then have foreach update that progress bar as tasks are returned. Unfortunately, none of the backends support that. It's possible to do that using combine function tricks, but only if you're using a backend that supports calling the combine function on-the-fly, which doParallel, doSNOW and doMC do not. Those backends don't call combine on the fly because they are implemented using functions such as clusterApplyLB and mclapply which don't support a hook to allow user supplied code to be executed when tasks are returned.
Because I've seen interest in progress bar support in foreach, I modified the doSNOW package to add support for a doSNOW-specific "progress" option, and I checked the code into the R-Forge website. It makes use of some lower level functions in the snow package which unfortunately are not exported by the parallel package.
If you want to try out this new feature, you will need to install doSNOW from R-Forge. I did this on my MacBook Pro using the command:
install.packages("doSNOW", repos="http://R-Forge.R-project.org", type="source")
Here is a simple example script that demonstrates the experimental "progess" option:
library(doSNOW)
library(tcltk)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
pb <- tkProgressBar(max=100)
progress <- function(n) setTkProgressBar(pb, n)
opts <- list(progress=progress)
r <- foreach(i=1:100, .options.snow=opts) %dopar% {
Sys.sleep(1)
sqrt(i)
}
Update
The progress option is now available in the latest version of doSNOW on CRAN.

Resources