Running doRedis- Object not found even when it's been exported - r

I'm testing the doRedis package by running a worker one machine and the master/server on another. The code on my master looks like this:
#Register ...
r <- foreach(a=1:numreps, .export(...)) %dopar% {
train <- func1(..)
best <- func2(...)
weights <- func3(...)
return ...
}
In every function, a global variable is accessed, but not modified. I've exported the global variable in the .export portion of the foreach loop, but whenever I run the code, an error occurs stating that the variable was not found. Interestingly, the code works when all my workers on one machine, but crashes when I have an "outside" worker. Any ideas why this error is occurring, and how to correct it?
Thanks!
UPDATE: I have a gist of some code here: https://gist.github.com/liangricha/fbf29094474b67333c3b
UPDATE2: I asked a another to doRedis related question: "Would it be possible allow each worker machine to utilize all of its cores?
#Steve Weston responded: "Starting one redis worker per core will often fully utilize a machine."

This kind of code was a problem for the doParallel, doSNOW, and doMPI packages in the past, but they were improved in the last year or so to handle it better. The problem is that variables are exported to a special "export" environment, not to the global environment. That is preferable in various ways, but it means that the backend has to do more work so that the exported variables are in the scope of the exported functions. It looks like doRedis hasn't been updated to use these improvements.
Here is a simple example that illustrates the problem:
library(doRedis)
registerDoRedis('jobs')
startLocalWorkers(3, 'jobs')
glob <- 6
f1 <- function() {
glob
}
f2 <- function() {
foreach(1:3, .export=c('f1', 'glob')) %dopar% {
f1()
}
}
f2() # fails with the error: "object 'glob' not found"
If the doParallel backend is used, it succeeds:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
f2() # works with doParallel
One workaround is to define the function "f1" inside function "f2":
f2 <- function() {
f1 <- function() {
glob
}
foreach(1:3, .export=c('glob')) %dopar% {
f1()
}
}
f2() # works with doParallel and doRedis
Another solution is to use some mechanism to export the variables to the global environment of each of the workers. With doParallel or doSNOW, you could do that with the clusterExport function, but I'm not sure how to do that with doRedis.
I'll report this issue to the author of the doRedis package and suggest that he update doRedis to handle exported functions like doParallel.

Related

error loading package inside foreach when using testthat

I am trying to a debug an issue with using unit tests with testthat. The code runs fine if run manually, however it seems when running test(), the workers inside the foreach don't seem to have access to the package or functions inside the package I am testing. The code is quite complex so I don't have a great working example, but here is the outline of the structure:
unit test in tests/testthat:
test_that("dataset runs successful", {
expect_snapshot_output(myFunc(dataset, params))
})
MyFunc calls another func, and inside that func, creates workers to run some code:
final_out <- foreach(i = 1:nrow(data),
.combine = c,
.export = c("func1", "func2", "params"),
.packages = c("fields", "dplyr")) %dopar% {
output = func1(stuff)
more = func2(stuff)
out = rbind(output, more)
return (out)
}
The workers don't seem to have access to func1, func2 etc..
I tried adding the name of the package to packages in this line, but it doesn't work either.
Any ideas?
As I mentioned, this is only an issue when trying to run the unit tests and I suspect it is somehow related to how the package I am testing is being loaded?
When the workers are started they do not have the full set of packages a normal session has; pass all package names of the packages in the search path when the tests are running in the the local session to the .packages argument.

Working with environments in the parallel package in R

I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?

Parallelization in R: how to "source" on every node?

I have created parallel workers (all running on the same machine) using:
MyCluster = makeCluster(8)
How can I make every of these 8 nodes source an R-file I wrote?
I tried:
clusterCall(MyCluster, source, "myFile.R")
clusterCall(MyCluster, 'source("myFile.R")')
And several similar versions. But none worked.
Can you please help me to find the mistake?
Thank you very much!
The following code serves your purpose:
library(parallel)
cl <- makeCluster(4)
clusterCall(cl, function() { source("test.R") })
## do some parallel work
stopCluster(cl)
Also you can use clusterEvalQ() to do the same thing:
library(parallel)
cl <- makeCluster(4)
clusterEvalQ(cl, source("test.R"))
## do some parallel work
stopCluster(cl)
However, there is subtle difference between the two methods. clusterCall() runs a function on each node while clusterEvalQ() evaluates an expression on each node. If you have a variable list of files to source, clusterCall() will be easier to use since clusterEvalQ(cl,expr) will regard any expr as an expression so it's not convenient to put a variable there.
If you use a command to source a local file, ensure the file is there.
Else place the file on a network share or NFS, and source the absolute path.
Better still, and standard answers, write a package and have that package installed on each node and then just call library() or require().

parallel R execution problem in R

I am using doSMP as a parallel backend in Windows 7, with R 2.12.2. I incur in an error, and would like to understand the likely cause. Here is some sample code to reproduce the error.
require(foreach)
require(doSMP)
require(data.table)
wrk <- startWorkers(workerCount = 2)
registerDoSMP(wrk)
DF = data.table(x=c("b","b","b","a","a"),v=rnorm(5))
setkey(DF,x)
foreach( i=1:2) %dopar% {
DF[J("a"),]
}
The error message is
Error in { : task 1 failed - "could not find function "J""
I've not used doSMP, but I did some digging around and it looks like this post gets at a similar issue.
so it looks like you should be able to do:
foreach( i=1:2, .packages="data.table") %dopar% {
DF[J("a"),]
}
I can't test as I don't have a Windows machine handy.
OK, I asked Revolution computing, and Steve Weller (of RC) replied:
The problem is a R scoping issue. By
default, foreach() will look for
variables defined in it's own
'environment'. Any objects defined
outside of it's scope need to be
explicitly passed to it via the
'.export' argument.
In your case, you will need to modify
your 'foreach()' call to pass in the
objects 'DF' and 'J':
...
foreach(i=1:2, .export=c("DF","J")) %dopar% {
...
I haven't tried either solution yet, but I trust both JD and RC...

could not find function inside foreach loop

I'm trying to use foreach to do multicore computing in R.
A <-function(....) {
foreach(i=1:10) %dopar% {
B()
}
}
then I call function A in the console. The problem is I'm calling a function Posdef inside B that is defined in another script file which I source. I had to put Posdef in the list of export argument of foreach: .export=c("Posdef"). However I get the following error:
Error in { : task 3 failed - "could not find function "Posdef""
Why cant R find this defined function?
So I can reproduce this, for the curious:
require(doSNOW)
registerDoSNOW(makeCluster(5, type="SOCK"))
getDoParWorkers()
getDoParName()
getDoParVersion()
fib <- function(n) {
if (n <= 1) { return(1) }
return(fib(n-1) + fib(n-2))
}
my.matrix <- matrix(runif(2500, 10, 50), nrow=50)
calcLotsaFibs <- function() {
result <- foreach(row.num=1:nrow(my.matrix), .export=c("fib", "my.matrix")) %dopar% {
return(Vectorize(fib)(my.matrix[row.num,]))
}
return(result)
}
lotsa.fibs <- calcLotsaFibs()
I have been able to get around this by putting the function in another file and loading that file in the body of the foreach. You could also obviously move the function definition into the body of the foreach itself.
[EDIT -- I had previously suggested that perhaps .export doesn't work properly with function names, but was corrected below.]
The short answer is that this was a bug in parallel backends such as doSNOW, doParallel and doMPI, but it has since been fixed.
The slightly longer answer is that foreach exports functions to the workers using a special "export" environment, not the global environment. That used to cause problems for functions that were created in the global environment, because the "export" environment wasn't in their scope, even though they were now defined in that same "export" environment. Thus, they couldn't see any other functions or variables defined in the "export" environment, such as "Posdef" in your case.
The doSNOW, doParallel and doMPI backends now change the associated environment from the global to the "export" environment for functions exported via ".export", and seems to have resolved these issues.
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
It worked in my case.

Resources