Parallelization in R: how to "source" on every node? - r

I have created parallel workers (all running on the same machine) using:
MyCluster = makeCluster(8)
How can I make every of these 8 nodes source an R-file I wrote?
I tried:
clusterCall(MyCluster, source, "myFile.R")
clusterCall(MyCluster, 'source("myFile.R")')
And several similar versions. But none worked.
Can you please help me to find the mistake?
Thank you very much!

The following code serves your purpose:
library(parallel)
cl <- makeCluster(4)
clusterCall(cl, function() { source("test.R") })
## do some parallel work
stopCluster(cl)
Also you can use clusterEvalQ() to do the same thing:
library(parallel)
cl <- makeCluster(4)
clusterEvalQ(cl, source("test.R"))
## do some parallel work
stopCluster(cl)
However, there is subtle difference between the two methods. clusterCall() runs a function on each node while clusterEvalQ() evaluates an expression on each node. If you have a variable list of files to source, clusterCall() will be easier to use since clusterEvalQ(cl,expr) will regard any expr as an expression so it's not convenient to put a variable there.

If you use a command to source a local file, ensure the file is there.
Else place the file on a network share or NFS, and source the absolute path.
Better still, and standard answers, write a package and have that package installed on each node and then just call library() or require().

Related

Dynamic library dependencies not recognized when run in parallel under R foreach()

I'm using the Rfast package, which imports the package RcppZiggurat. I'm running R 3.6.3 on a Linux cluster (Red Hat 6.1). The packages are installed on my local directory but R is installed system-wide.
The Rfast functions (e.g. colsums()) work well when I call them directly. But when I call them in a foreach() loop like the following (EDIT: I added the code to register the cluster as pointed out by Rui Barradas but it didn't fix the problem).
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4, .packages = 'Rfast') %dopar% colmeans(A)
stopCluster(cl)
then I get an error:
unable to load shared object '/home/users/sutd/R/x86_64-pc-linux-gnu-library/3.6/RcppZiggurat/libs/RcppZiggurat.so':
libgsl.so.0: cannot open shared object file: No such file or directory
Somehow, the dynamic library is recognized when called directly but not when called under foreach().
I know that libgsl.so is located in /usr/lib64/, so I've added the following line at the beginning of my R script
Sys.setenv(LD_LIBRARY_PATH=paste("/usr/lib64/", Sys.getenv("LD_LIBRARY_PATH"), sep = ":"))
But it didn't work.
I have also tried to do dyn.load('/usr/lib64/libgsl.so') but I get the following error:
Error in dyn.load("/usr/lib64/libgsl.so") : unable to load shared object '/usr/lib64/libgsl.so':
/usr/lib64/libgsl.so: undefined symbol: cblas_ctrmv
How do I make the dependencies available in the foreach() parallel loops?
NOTE
In the actual use case I am using the genetic algorithm package GA, and have GA::ga() which handles the foreach() loop, and within the loop I use a function in my own package which calls the Rfast functions. So I'm hoping that there is a solution where I don't have to modify the foreach() call.
The following works with no problems. Unlike the code in the question, it starts by detecting the number of available cores, create a cluster and make it available to foreach.
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
set.seed(2020)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4,
.combine = rbind,
.packages = "Rfast") %dopar% {
colmeans(A)
}
stopCluster(cl)
str(cm)
#num [1:4, 1:1000] -0.02668 -0.02668 -0.02668 -0.02668 0.00172 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:4] "result.1" "result.2" "result.3" "result.4"
# ..$ : NULL
The foreach package was great for its time. But, now, parallel computations should be done with future just for the static code analysis handling the correct export to workers. As a result, under the future approach, registering a package with .packages= isn't needed. Moreover, the future mirrors usual R code with only a slight change regarding the setup of an output variable as a listenv. For example, we have:
library("future")
library("listenv")
library("Rfast")
plan(tweak(multiprocess , workers = 2L))
# For all cores, directly use:
# plan(multiprocess)
# Generate matrix once
A <- matrix(rnorm(1e6), 1000, 1000)
# Setup output
x <- listenv()
# Iterate 4 times
for(i in 1:4) {
# On each core, compute the colmeans()
x[[i]] %<-% {
colmeans(A)
# For better control over function applies, use a namespace call
# e.g. Rfast::colmeans(A)
}
}
# Switch from listenv to list
output <- as.list(x)
Thanks to the answers by #RuiBarradas and #coatless, I realize that the problem is not with foreach(), because (1) the problem occurred when I ran the code with future too, and (2) it occurred with the foreach() code even with the wrong call, when I didn't register the cluster. When there is no cluster registered, foreach() will throw a warning and runs in sequential mode instead. But that didn't happen.
Therefore, I realize that the problem must have occurred even before the foreach() call. In the log, it appeared right after the message Loading package RcppZiggurat. Something must have gone wrong when this package is loaded.
I then checked the dependencies of RcppZiggurat, and found that it depends on another package called RcppGSL, which interfaces R and the GSL library. Bingo, that's where libgsl.so.0 is needed when RcppZiggurat is called.
So I made an R script named test-gsl.R, which has the following two lines.
library(RcppZiggurat)
print(‘OK’)
Now, I run the following on the head node
$ module load R/3.6.3
$ Rscript test-gsl.R
And everything works fine. The ‘OK’ is printed.
But this doesn’t work if I submit the job on the compute node. First, the PBS script, called test.sh, is as follows
### Resources request
#PBS -l select=1:ncpus=1:mem=1GB
### Walltime
#PBS -l walltime=00:01:00
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
### Run R
module load R/3.6.3
Rscript test-gsl.R
Then I ran
qsub test.sh
And the error popped out. This means that there is something different between the compute node and the head node on my system, and nothing to do with the packages. I contacted the system administrator, who explained to me that the GSL library is available on the head node at the default path, but not on the compute node. So in my shell script, I need to add module load gsl/2.1 before running my R script. I tested that and everything worked.
The solution seems simple enough, but I know very little about Linux administration to realize it. Only after asking around and trying (rather blindly) many things did I finally come to this solution. So thanks to those who've offered help, and mea culpa for not being able to describe the problem accurately at the beginning.

Exposing warnings from nodes using snow

I have an lapply operation that I've parallelised using snow. This works fine except that any warnings generated seem to just get ignored and are hence never shown to the user. Is there a way of exposing warnings on individual nodes so they come through in the main R process?
My best idea at the moment is to have all nodes write their warnings to files, and read those at the end, but there must be a better way!
Here's a reprex:
library(snow)
f <- function(x){
warning("mywarning")
return(NULL)
}
cl <- makeCluster(2, type="SOCK")
lapply(1:2, f) # Gives me warnings, as desired
clusterApply(cl, 1:2, f) # Gives me the same output, faster, but with no warnings
In the end I ended up switching from snow to the future.apply package (in conjunction with parallel). future.apply now has this behaviour by default.
Unfortunately in most cases the messages/warnings don't appear until the whole run has finished, but that's a whole new issue.

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

Running doRedis- Object not found even when it's been exported

I'm testing the doRedis package by running a worker one machine and the master/server on another. The code on my master looks like this:
#Register ...
r <- foreach(a=1:numreps, .export(...)) %dopar% {
train <- func1(..)
best <- func2(...)
weights <- func3(...)
return ...
}
In every function, a global variable is accessed, but not modified. I've exported the global variable in the .export portion of the foreach loop, but whenever I run the code, an error occurs stating that the variable was not found. Interestingly, the code works when all my workers on one machine, but crashes when I have an "outside" worker. Any ideas why this error is occurring, and how to correct it?
Thanks!
UPDATE: I have a gist of some code here: https://gist.github.com/liangricha/fbf29094474b67333c3b
UPDATE2: I asked a another to doRedis related question: "Would it be possible allow each worker machine to utilize all of its cores?
#Steve Weston responded: "Starting one redis worker per core will often fully utilize a machine."
This kind of code was a problem for the doParallel, doSNOW, and doMPI packages in the past, but they were improved in the last year or so to handle it better. The problem is that variables are exported to a special "export" environment, not to the global environment. That is preferable in various ways, but it means that the backend has to do more work so that the exported variables are in the scope of the exported functions. It looks like doRedis hasn't been updated to use these improvements.
Here is a simple example that illustrates the problem:
library(doRedis)
registerDoRedis('jobs')
startLocalWorkers(3, 'jobs')
glob <- 6
f1 <- function() {
glob
}
f2 <- function() {
foreach(1:3, .export=c('f1', 'glob')) %dopar% {
f1()
}
}
f2() # fails with the error: "object 'glob' not found"
If the doParallel backend is used, it succeeds:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
f2() # works with doParallel
One workaround is to define the function "f1" inside function "f2":
f2 <- function() {
f1 <- function() {
glob
}
foreach(1:3, .export=c('glob')) %dopar% {
f1()
}
}
f2() # works with doParallel and doRedis
Another solution is to use some mechanism to export the variables to the global environment of each of the workers. With doParallel or doSNOW, you could do that with the clusterExport function, but I'm not sure how to do that with doRedis.
I'll report this issue to the author of the doRedis package and suggest that he update doRedis to handle exported functions like doParallel.

R parallel computing with snowfall - writing to files from separate workers

I am using the snowfall 1.84 package for parallel computing and would like each worker to write data to its own separate file during the computation. Is this possible ? if so how ?
I am using the "SOCK" type connection e.g., sfInit( parallel=TRUE, ...,type="SOCK" ) and would like the code to be platform independent (unix/windows).
I know it is possible to Use the "slaveOutfile" option in sfInit to define a file where to write the log files. But this is intended for debugging purposes and all slaves/workers must use the same file. I need each worker to have its OWN output file !!!
The data i need to write are large dataframes, and NOT simple diagnostic messages. These dataframes need be output by the slaves and could not be sent back to the master process.
Anyone knows how i can get this done?
Thanks
A simple solution is to use sfClusterApply to execute a function that opens a different file on each of the workers, assigning the resulting file object to a global variable so you can write to it in subsequent parallel operations:
library(snowfall)
nworkers <- 3
sfInit(parallel=TRUE, cpus=nworkers, type='SOCK')
workerinit <- function(datfile) {
fobj <<- file(datfile, 'w')
NULL
}
sfClusterApply(sprintf('worker_%02d.dat', seq_len(nworkers)), workerinit)
work <- function(i) {
write.csv(data.frame(x=1:3, i=i), file=fobj)
i
}
sfLapply(1:10, work)
sfStop()

Resources