Working with environments in the parallel package in R - r

I have an R API that makes use of 5 different R files that define different metrics that I use. Each of those files has a number of tasks that I run using the parallel package since they all use the same data, but with different groupings. To avoid having to create and close the clusters in each file, I took out those commands and put them into a cluster.R file. So the structure I have is basically:
cluster.R —
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(computeCluster, {
library(‘dplyr’)
source(‘helpers.R’)
})
.Last <- function() {
stopCluster(cl)
}
Metric1.R —
metric1.function <- function(x,y,z) {
dplyr transformations
}
some_date <- date_from_api_input
tasks <- list(job1 = function() {metric1.function(data, grouping1, some_date)},
job2 = function() {metric1.function(data, grouping2, some_date)},
job3 = function() {metric1.function(data, grouping3, some_date)}
)
clusterExport(cl, c('data', 'metric1.function', 'some_date'), envir = environment())
out <- clusterApplyLB(
cl,
tasks,
function(f) f()
)
bind_rows(out)
This API just creates different metrics that then fills a database table that holds them all. So each metric file contains different functions and inputs but output the same columns and groupings.
Metric 2-5 are all the same except the custom function is different for each file and defined at the beginning of each file. The problem I’m having is that all metrics are also ran in parallel and I’m having issues working with the environments. What ends up happening is that the job will say that some_date isn’t found or that metric2.function isn’t found in metric5.R.
I use plumber to expose R and each time it starts, it sources the cluster.R file, starts up the clusters with their initializations, and listens for any requests that come in.
When running in series, it works just fine for testing and everything passes as expected but in production when our server runs all the scripts in parallel, the variables and functions I've exported in the clusterExport function either don't get passed in or are getting mixed up.
Should I be structuring it in a different fashion or am I using the parallel package incorrectly for my purpose?

Related

Is there a way to combine Rmpi & mclapply?

I have some R code that applies a function to a list of objects. The function is simple but involves a bootstrapping calculation, which can be easily sped up using mclapply. When run on a single node, everything is fine.
However, I have a cluster and what I've been trying to do is to distribute the application of the function to the list of objects across multiple nodes. To do this I've been using Rmpi (0.6-6).
The code below runs fine
library(Rmpi)
cl <- parallel::makeCluster(10, type='MPI')
parallel::clusterExport(cl, varlist=c('as.matrix'), envir=environment())
descriptor <- parallel::parLapply(1:5, function(am) {
val <- mean(unlist(lapply(1:120, function(x) mean(rnorm(1e7)))))
return(c(val, Rmpi::mpi.universe.size()))
}, cl=cl)
print(do.call(rbind, descriptor))
snow::stopCluster(cl)
However, if I convert the lapply to mclapply and set mc.cores=10, MPI warns that forking will lead to bad things, and the job hangs.
(In all cases jobs are being submitted via SLURM)
Based on the MPI warning, it seems that I should not be using mclapply within Rpmi jobs. Is this a correct assessment?
If so, does anybody have suggestions on how I can parallelize the function that is being run on each node?

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

Using rm(list=ls()) in a parallel environment in R

I am running code in R that runs a function in parallel. The code sets some parameters initially and loads libraries etc, then calls a function (called calibrate) and this runs across several workers using different input parameters on each worker in parallel and returns the result back to the centre. It works, and a number of iterations take place (sometimes more than a 100 over a couple of hours) but crashes after a while, and I suspect that it is a memory resource issue. Hence I want to include an rm type command to reduce memory usage:
At first the function looked like this:
Calibrate <- function() {
rm(list = ls())
gc()
...rest of code calling other functions
}
but this had very little effect. When looking closely and running the code line by line I realised that rm(list=ls()) will do very little inside a function.
So, I thought I would change the code to:
Calibrate <- function() {
ENV <- globalenv()
ll <- ls(envir = ENV)
lf <- lsf.str(envir = ENV)
ll <- ll[ll != lf]
rm(list = ll, envir = ENV)
....rest of code calling other functions
}
This will now get rid of all the variables but not the functions. However, I am worried that this will get rid of all the variables on all the other workers which will still be running. The code runs in parallel but the code does not necessarily run on all the workers at the same speed. So the code is effectively staggered. I only want to remove the variables for an individual worker when the calibrate function is called.
So my question, what should I be doing to clear the variables (rm) for one worker and not the whole system when running in parallel?
Help, really appreciated.

Running doRedis- Object not found even when it's been exported

I'm testing the doRedis package by running a worker one machine and the master/server on another. The code on my master looks like this:
#Register ...
r <- foreach(a=1:numreps, .export(...)) %dopar% {
train <- func1(..)
best <- func2(...)
weights <- func3(...)
return ...
}
In every function, a global variable is accessed, but not modified. I've exported the global variable in the .export portion of the foreach loop, but whenever I run the code, an error occurs stating that the variable was not found. Interestingly, the code works when all my workers on one machine, but crashes when I have an "outside" worker. Any ideas why this error is occurring, and how to correct it?
Thanks!
UPDATE: I have a gist of some code here: https://gist.github.com/liangricha/fbf29094474b67333c3b
UPDATE2: I asked a another to doRedis related question: "Would it be possible allow each worker machine to utilize all of its cores?
#Steve Weston responded: "Starting one redis worker per core will often fully utilize a machine."
This kind of code was a problem for the doParallel, doSNOW, and doMPI packages in the past, but they were improved in the last year or so to handle it better. The problem is that variables are exported to a special "export" environment, not to the global environment. That is preferable in various ways, but it means that the backend has to do more work so that the exported variables are in the scope of the exported functions. It looks like doRedis hasn't been updated to use these improvements.
Here is a simple example that illustrates the problem:
library(doRedis)
registerDoRedis('jobs')
startLocalWorkers(3, 'jobs')
glob <- 6
f1 <- function() {
glob
}
f2 <- function() {
foreach(1:3, .export=c('f1', 'glob')) %dopar% {
f1()
}
}
f2() # fails with the error: "object 'glob' not found"
If the doParallel backend is used, it succeeds:
library(doParallel)
cl <- makePSOCKcluster(3)
registerDoParallel(cl)
f2() # works with doParallel
One workaround is to define the function "f1" inside function "f2":
f2 <- function() {
f1 <- function() {
glob
}
foreach(1:3, .export=c('glob')) %dopar% {
f1()
}
}
f2() # works with doParallel and doRedis
Another solution is to use some mechanism to export the variables to the global environment of each of the workers. With doParallel or doSNOW, you could do that with the clusterExport function, but I'm not sure how to do that with doRedis.
I'll report this issue to the author of the doRedis package and suggest that he update doRedis to handle exported functions like doParallel.

R parallel computing with snowfall - writing to files from separate workers

I am using the snowfall 1.84 package for parallel computing and would like each worker to write data to its own separate file during the computation. Is this possible ? if so how ?
I am using the "SOCK" type connection e.g., sfInit( parallel=TRUE, ...,type="SOCK" ) and would like the code to be platform independent (unix/windows).
I know it is possible to Use the "slaveOutfile" option in sfInit to define a file where to write the log files. But this is intended for debugging purposes and all slaves/workers must use the same file. I need each worker to have its OWN output file !!!
The data i need to write are large dataframes, and NOT simple diagnostic messages. These dataframes need be output by the slaves and could not be sent back to the master process.
Anyone knows how i can get this done?
Thanks
A simple solution is to use sfClusterApply to execute a function that opens a different file on each of the workers, assigning the resulting file object to a global variable so you can write to it in subsequent parallel operations:
library(snowfall)
nworkers <- 3
sfInit(parallel=TRUE, cpus=nworkers, type='SOCK')
workerinit <- function(datfile) {
fobj <<- file(datfile, 'w')
NULL
}
sfClusterApply(sprintf('worker_%02d.dat', seq_len(nworkers)), workerinit)
work <- function(i) {
write.csv(data.frame(x=1:3, i=i), file=fobj)
i
}
sfLapply(1:10, work)
sfStop()

Resources