Knowing what objects to clusterExport beforehand - r

I'm new to using the parallel packages and have started exploring them in a bid to speed up some of my work. An annoyance I often encounter is that the foreach command will throw up problems when I have not clusterExport the relevant functions/variables.
Example
I know that the example below does not necessarily need foreach to make it fast, but for illustration sake, I'll use it.
library(doParallel)
library(parallel)
library(lubridate)
library(foreach)
cl <- makeCluster(c("localhost", "localhost", "localhost","localhost"), type = "SOCK")
registerDoParallel(cl, cores = 4)
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d-%m-%Y')), 500, replace = TRUE)
foreach(i = seq_along(Dates), .combine = rbind) %dopar% dmy(Dates[i])
Error in dmy(Dates[i]) : task 1 failed - "could not find function "dmy""
As you can see, there is an error that says that the dmy function is not found. I then have to go on and add the following:
clusterExport(cl, c("dmy"))
So my question is, besides looking at the error for clues on what to export, is there a more elegant way of knowing beforehand what objects to export or is there a way to share the global environment with all the slaves before running the foreach?

No need to export individual package functions manually like that. You can use the .packages argument to the foreach function to load the required packages, so all package functions will be available to your %dopar% expression.

Related

error loading package inside foreach when using testthat

I am trying to a debug an issue with using unit tests with testthat. The code runs fine if run manually, however it seems when running test(), the workers inside the foreach don't seem to have access to the package or functions inside the package I am testing. The code is quite complex so I don't have a great working example, but here is the outline of the structure:
unit test in tests/testthat:
test_that("dataset runs successful", {
expect_snapshot_output(myFunc(dataset, params))
})
MyFunc calls another func, and inside that func, creates workers to run some code:
final_out <- foreach(i = 1:nrow(data),
.combine = c,
.export = c("func1", "func2", "params"),
.packages = c("fields", "dplyr")) %dopar% {
output = func1(stuff)
more = func2(stuff)
out = rbind(output, more)
return (out)
}
The workers don't seem to have access to func1, func2 etc..
I tried adding the name of the package to packages in this line, but it doesn't work either.
Any ideas?
As I mentioned, this is only an issue when trying to run the unit tests and I suspect it is somehow related to how the package I am testing is being loaded?
When the workers are started they do not have the full set of packages a normal session has; pass all package names of the packages in the search path when the tests are running in the the local session to the .packages argument.

How to export many variables and functions from global environment to foreach loop?

How can I export the global environment for the beginning of each parallel simulation in foreach? The following code is part of a function that is called to run the simulations.
num.cores <- detectCores()-1
cluztrr <- makeCluster(num.cores)
registerDoParallel(cl = cluztrr)
sim.result.list <- foreach(r = 1:simulations,
.combine = list,
.multicombine = TRUE,
) %dopar% {
#...tons of calculations using many variables...
list(vals1,
vals2,
vals3)
}
stopCluster(cluztrr)
Is it necessary to use .export with a character vector of every variable and function that I use? Would that be slow in execution time?
If the foreach loop is in the global environment, variables should be exported automatically. If not, you can use .export = ls(globalenv()) (or .GlobalEnv).
For functions from other packages, you just need to use the syntax package::function.
The "If [...] in the global environment, ..." part of F. Privé reply is very important here. The foreach framework will only identify global variables in that case. It will not do so if the foreach() call is done within a function.
However, if you use the doFuture backend (disclaimer: I'm the author);
library("doFuture")
registerDoFuture()
plan(cluster, workers = cl)
global variables that are needed will be automatically identified and exported (which is then done by the future framework and not the foreach framework). Now, if you rely on this, and don't explicitly specify .export, then your code will only work with doFuture and none of the other backends. That's a decision you need to make as a developer.
Also, automatic exporting of globals is neat, but be careful that you know how much is exported; exporting too many too large objects can be quite costly and introduce lots of overhead in your parallel code.

R Parallel computing: select which objects to be distributed into cores

I have a question related to r-parallel computing. I am using something like:
cl.tmp = makeCluster(10, type="SOCK")
registerDoParallel(cl.tmp)
AA <- foreach(i = 1:48, .inorder = TRUE, .combine = rbind, .verbose=TRUE) %dopar% {
# A function that depends on some selected objects in current environment
}
stopCluster(cl.tmp)
How to identify some particular objects in the current environment to be distributed into processor cores and so be used for some functions? I do not need R to copy the whole objects in the environment into different cores, but only some selected objects. In my project, I have big R objects and I do not need to copy/distribute them into cores, to avoid ram problems. Is there a solution for that?
Thanks
Take a look at this: reading global variables using foreach in R
Only variables referenced inside the foreach loop are copied from the global environment.

Error when using %dopar% instead of %do% in R (package doParallel)

I've come up with a strange error.
Suppose I have 10 xts objects in a list called data. I now search for every three combinations using
data_names <- names(data)
combs <- combn(data_names, 3)
My basic goal is to do a PCA on those 1080 triples.
To speed things up I wanted do use the package doParallel. So here is the snippet shortened till the point where the error occurs:
list <- foreach(i=1:ncol(combs)) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Here, the merge function seems to be the problem. The error is
task 1 failed - "cannot coerce class 'c("xts", "zoo")' into a data.frame"
However, when changing %dopar% to a normal serial %do% everything works as accepted.
Till now I was not able to find any solution to this problem and I'm not even sure what to look for.
A better solution rather than explicitly loading the libraries within the function would be to utilise the .packages argument of the foreach() function:
list <- foreach(i=1:ncol(combs),.packages=c("xts","zoo")) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
The problem is likely that you haven't called library(xts) on each of the workers. You don't say what backend you're using, so I can't be 100% sure.
If that's the problem, then this code will fix it:
list <- foreach(i=1:ncol(combs)) %dopar% {
library(xts)
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
These are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. It worked in my case.

Why doesn't the plyr package use my parallel backend?

I'm trying to use the parallel package in R for parallel operations rather than doSNOW since it's built-in and ostensibly the way the R Project wants things to go. I'm doing something wrong that I can't pin down though. Take for example this:
a <- rnorm(50)
b <- rnorm(50)
arr <- matrix(cbind(a,b),nrow=50)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=F)
This works just fine, producing the sums of my two columns. But if I try to bring in the parallel package:
library(parallel)
nodes <- detectCores()
cl <- makeCluster(nodes)
setDefaultCluster(cl)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=T)
It throws the error
2: In setup_parallel() : No parallel backend registered
3: executing %dopar% sequentially: no parallel backend registered
Am I initializing the backend wrong?
Try this setup:
library(doParallel)
library(plyr)
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
aaply(ozone, 1, mean,.parallel=TRUE)
stopCluster(cl)
Since I have never used plyr for parallel computing I have no idea why this issues warnings. The result is correct anyway.
The documentation for aaply states
.parallel: if ‘TRUE’, apply function in parallel, using parallel
backend provided by foreach
so presumably you need to use the foreach package rather than the parallel package.

Resources