R: writing to disk using parallel

R: writing to disk using parallel - r

It is possible to use the package R.cache with the package parallel.
I'm doing some computation that are very time comsuming and I would like to use a cache along with going parallel.
The the parrallel jobs are independent for each other. Yet I cannot load the R.cache package on the clusters.
library(parallel)
library(R.cache)
cl <- makeCluster(getOption("cl.cores", 2))
clusterExport(cl,varlist = ls())
clusterEvalQ(cl, library(R.cache))
## Error in checkForRemoteErrors(lapply(cl, recvResult)) :
## 2 nodes produced errors; first error: there is no package called ‘R.cache’

Nothing particular about R.cache here. You need to make sure R.cache is installed on each of the compute nodes.
And yes, R.cache works and is designed to work on concurrent systems. See also https://github.com/HenrikBengtsson/R.cache/issues/18
(I'm the author of R.cache).

Related

Excessive RAM usage of R's parallel package loading libraries

On certain machines loading packages on all cores eats up all available RAM resulting in an error 137 and my R session is killed. On my laptop (Mac) and a Linux computer it works fine. On the Linux computer that I want to run this on, a 32 core with 32 * 6GB RAM it does not. The sysadmin told me memory is limited on the compute nodes. However, as per my edit below, my memory requirements are not excessive by any stretch of the imagination.
How can I debug this and find out what is different? I am new to the parallel package.
Here is an example (it assumes the command install.packages(c(“tidyverse”,”OpenMx”)) has been run in R under version 4.0.3):
I also note that it seems to be only true for the OpenMx and the mixtools packages. I excluded mixtools from the MWE because OpenMx is enough to generate the problem. tidyverse alone works fine.
A workaround I tried was to not load packages on the cluster and just evaluate .libPaths("~/R/x86_64-pc-linux-gnu-library/4.0/") in the body of expr of clusterEvalQ and use the namespace commands like OpenMx::vec in my functions but that produced the same error. So I am stuck because on two out of three machines it worked fine, just not on the one I am supposed to use (a compute node).
.libPaths("~/R/x86_64-pc-linux-gnu-library/4.0/")
library(parallel)
num_cores <- detectCores()
cat("Number of cores found:")
print(num_cores)
working_mice <- makeCluster(num_cores)
clusterExport(working_mice, ls())
clusterEvalQ(working_mice, expr = {
library("OpenMx")
library("tidyverse")
})
It seems to consume all available RAM resulting in an error 137 by simply loading packages. That is a problem because I need the libraries loaded in each available core where their functions are performing tasks.
Subsequently, I am using DEoptim but loading packages was enough to generate the error.
Edit
I have profiled the code using profmem and found that the part in the example code asks for about 2MB of memory and the whole script I am trying to run 94.75MB. I then also checked using my OS (Catalina) and caught the following processes as seen on the screenshot.
None of these numbers strike me as excessive, especially not on a node that has ~6GB per CPU and 32 cores. Unless, I am missing something major here.

I want to start by saying I'm not sure what is causing this issue for you. The following example may help you debug how much memory is being used by each child processes.
Using mem_used from the pryr package will help you track how much RAM is used by an R session. The following shows the results of doing this on my local computer with 8 Cores and 16 GB of RAM.
library(parallel)
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
num_cores <- detectCores()
cat("Number of cores found:")
#> Number of cores found:
print(num_cores)
#> [1] 8
working_mice <- makeCluster(num_cores)
clusterExport(working_mice, ls())
clust_memory <- clusterEvalQ(working_mice, expr = {
start <- pryr::mem_used()
library("OpenMx")
mid <- pryr::mem_used()
library("tidyverse")
end <- pryr::mem_used()
data.frame(mem_state = factor(c("start","mid","end"), levels = c("start","mid","end")),
mem_used = c(start, mid, end), stringsAsFactors = F)
})
to_GB <- function(x) paste(x/1e9, "GB")
tibble(
clust_indx = seq_len(num_cores),
mem = clust_memory
) %>%
unnest(mem) %>%
ggplot(aes(mem_state, mem_used, group = clust_indx)) +
geom_line(position = 'stack') +
scale_y_continuous(labels = to_GB) #approximately
As you can see, each process uses about the same amount of RAM ~ 160MG on my machine. According to pryr::mem_used(), the amount of RAM used is always the same per core after each library step.
In whatever environment you are working in, I'd recommend you do this on just 10 workers and see if it is using a reasonable amount of memory.
I also confirmed with htop top that all the child processes are only using about 4.5 GB of Virtual memory and approximately a similar amount of RAM each.
The only thing I can think of that may be the issue is clusterExport(working_mice, ls()). This would only be an issue if you are not doing this in a fresh R session. For example, if you had 5 GB of data sitting in your global environment, each socket would be getting a copy.

Dynamic library dependencies not recognized when run in parallel under R foreach()

I'm using the Rfast package, which imports the package RcppZiggurat. I'm running R 3.6.3 on a Linux cluster (Red Hat 6.1). The packages are installed on my local directory but R is installed system-wide.
The Rfast functions (e.g. colsums()) work well when I call them directly. But when I call them in a foreach() loop like the following (EDIT: I added the code to register the cluster as pointed out by Rui Barradas but it didn't fix the problem).
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4, .packages = 'Rfast') %dopar% colmeans(A)
stopCluster(cl)
then I get an error:
unable to load shared object '/home/users/sutd/R/x86_64-pc-linux-gnu-library/3.6/RcppZiggurat/libs/RcppZiggurat.so':
libgsl.so.0: cannot open shared object file: No such file or directory
Somehow, the dynamic library is recognized when called directly but not when called under foreach().
I know that libgsl.so is located in /usr/lib64/, so I've added the following line at the beginning of my R script
Sys.setenv(LD_LIBRARY_PATH=paste("/usr/lib64/", Sys.getenv("LD_LIBRARY_PATH"), sep = ":"))
But it didn't work.
I have also tried to do dyn.load('/usr/lib64/libgsl.so') but I get the following error:
Error in dyn.load("/usr/lib64/libgsl.so") : unable to load shared object '/usr/lib64/libgsl.so':
/usr/lib64/libgsl.so: undefined symbol: cblas_ctrmv
How do I make the dependencies available in the foreach() parallel loops?
NOTE
In the actual use case I am using the genetic algorithm package GA, and have GA::ga() which handles the foreach() loop, and within the loop I use a function in my own package which calls the Rfast functions. So I'm hoping that there is a solution where I don't have to modify the foreach() call.

The following works with no problems. Unlike the code in the question, it starts by detecting the number of available cores, create a cluster and make it available to foreach.
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
set.seed(2020)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4,
.combine = rbind,
.packages = "Rfast") %dopar% {
colmeans(A)
}
stopCluster(cl)
str(cm)
#num [1:4, 1:1000] -0.02668 -0.02668 -0.02668 -0.02668 0.00172 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:4] "result.1" "result.2" "result.3" "result.4"
# ..$ : NULL

The foreach package was great for its time. But, now, parallel computations should be done with future just for the static code analysis handling the correct export to workers. As a result, under the future approach, registering a package with .packages= isn't needed. Moreover, the future mirrors usual R code with only a slight change regarding the setup of an output variable as a listenv. For example, we have:
library("future")
library("listenv")
library("Rfast")
plan(tweak(multiprocess , workers = 2L))
# For all cores, directly use:
# plan(multiprocess)
# Generate matrix once
A <- matrix(rnorm(1e6), 1000, 1000)
# Setup output
x <- listenv()
# Iterate 4 times
for(i in 1:4) {
# On each core, compute the colmeans()
x[[i]] %<-% {
colmeans(A)
# For better control over function applies, use a namespace call
# e.g. Rfast::colmeans(A)
}
}
# Switch from listenv to list
output <- as.list(x)

Thanks to the answers by #RuiBarradas and #coatless, I realize that the problem is not with foreach(), because (1) the problem occurred when I ran the code with future too, and (2) it occurred with the foreach() code even with the wrong call, when I didn't register the cluster. When there is no cluster registered, foreach() will throw a warning and runs in sequential mode instead. But that didn't happen.
Therefore, I realize that the problem must have occurred even before the foreach() call. In the log, it appeared right after the message Loading package RcppZiggurat. Something must have gone wrong when this package is loaded.
I then checked the dependencies of RcppZiggurat, and found that it depends on another package called RcppGSL, which interfaces R and the GSL library. Bingo, that's where libgsl.so.0 is needed when RcppZiggurat is called.
So I made an R script named test-gsl.R, which has the following two lines.
library(RcppZiggurat)
print(‘OK’)
Now, I run the following on the head node
$ module load R/3.6.3
$ Rscript test-gsl.R
And everything works fine. The ‘OK’ is printed.
But this doesn’t work if I submit the job on the compute node. First, the PBS script, called test.sh, is as follows
### Resources request
#PBS -l select=1:ncpus=1:mem=1GB
### Walltime
#PBS -l walltime=00:01:00
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
### Run R
module load R/3.6.3
Rscript test-gsl.R
Then I ran
qsub test.sh
And the error popped out. This means that there is something different between the compute node and the head node on my system, and nothing to do with the packages. I contacted the system administrator, who explained to me that the GSL library is available on the head node at the default path, but not on the compute node. So in my shell script, I need to add module load gsl/2.1 before running my R script. I tested that and everything worked.
The solution seems simple enough, but I know very little about Linux administration to realize it. Only after asking around and trying (rather blindly) many things did I finally come to this solution. So thanks to those who've offered help, and mea culpa for not being able to describe the problem accurately at the beginning.

How can I evaluate a C function from a dynamic library in an R package?

I’m trying to implement parallel computing in an R package that calls C from R with the .C function. It seems that the nodes of the cluster can’t access the dynamic library. I have made a parallel socket cluster, like this:
cl <- makeCluster(2)
I would like to evaluate a C function called valgrad from my R package on each of the nodes in my cluster using clusterEvalQ, from the R package parallel. However, my code is producing an error. I compile my package, but when I run
out <- clusterEvalQ(cl, cresults <- .C(C_valgrad, …))
where … represents the arguments in the C function valgrad. I get this error:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: object 'C_valgrad' not found
I suspect there is a problem with clusterEvalQ’s ability to access the dynamic library. I attempted to fix this problem by loading the glmm package into the cluster using
clusterEvalQ(cl, library(glmm))
but that did not fix the problem.
I can evaluate valgrad on each of the clusters using the foreach function from the foreach R package, like this:
out <- foreach(1:no_cores) %dopar% {.C(C_valgrad, …)}
no_cores is the number of nodes in my cluster. However, this function doesn’t allow any of the results of the evaluation of valgrad to be accessed in any subsequent calculation on the cluster.
How can I either
(1) make the results of the evaluation of valgrad accessible for later calculations on the cluster or
(2) use clusterEvalQ to evaluate valgrad?

You have to load the external library. But this is not done with library calls, it's done with dyn.load.
The following two functions are usefull if you work with more than one operating system, they use the built-in variable .Platform$dynlib.ext.
Note also the unload function. You will need it if you develop a C functions library. If you change a C function before testing it the dynamic library has to be unloaded, then (the new version) reloaded.
See Writing R Extensions, file R-exts.pdf in the doc folder, section 5 or on CRAN.
dynLoad <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.load(dynlib)
}
dynUnload <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.unload(dynlib)
}

Parallelization in R: how to "source" on every node?

I have created parallel workers (all running on the same machine) using:
MyCluster = makeCluster(8)
How can I make every of these 8 nodes source an R-file I wrote?
I tried:
clusterCall(MyCluster, source, "myFile.R")
clusterCall(MyCluster, 'source("myFile.R")')
And several similar versions. But none worked.
Can you please help me to find the mistake?
Thank you very much!

The following code serves your purpose:
library(parallel)
cl <- makeCluster(4)
clusterCall(cl, function() { source("test.R") })
## do some parallel work
stopCluster(cl)
Also you can use clusterEvalQ() to do the same thing:
library(parallel)
cl <- makeCluster(4)
clusterEvalQ(cl, source("test.R"))
## do some parallel work
stopCluster(cl)
However, there is subtle difference between the two methods. clusterCall() runs a function on each node while clusterEvalQ() evaluates an expression on each node. If you have a variable list of files to source, clusterCall() will be easier to use since clusterEvalQ(cl,expr) will regard any expr as an expression so it's not convenient to put a variable there.

If you use a command to source a local file, ensure the file is there.
Else place the file on a network share or NFS, and source the absolute path.
Better still, and standard answers, write a package and have that package installed on each node and then just call library() or require().

Passing an entire package to a snow cluster

I'm trying to parallelize (using snow::parLapply) some code that depends on a package (ie, a package other than snow). Objects referenced in the function called by parLapply must be explicitly passed to the cluster using clusterExport. Is there any way to pass an entire package to the cluster rather than having to explicitly name every function (including a package's internal functions called by user functions!) in clusterExport?

Install the package on all nodes, and have your code call library(thePackageYouUse) on all nodes via one the available commands, egg something like
clusterApply(cl, library(thePackageYouUse))
I think the parallel package which comes with recent R releases has examples -- see for example here from help(clusterApply) where the boot package is loaded everywhere:
## A bootstrapping example, which can be done in many ways:
clusterEvalQ(cl, {
## set up each worker. Could also use clusterExport()
library(boot)
cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
NULL
})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: writing to disk using parallel - r

Nothing particular about R.cache here. You need to make sure R.cache is installed on each of the compute nodes. And yes, R.cache works and is designed to work on concurrent systems. See also https://github.com/HenrikBengtsson/R.cache/issues/18 (I'm the author of R.cache).

Related

Excessive RAM usage of R's parallel package loading libraries

Dynamic library dependencies not recognized when run in parallel under R foreach()

How can I evaluate a C function from a dynamic library in an R package?

Parallelization in R: how to "source" on every node?

Passing an entire package to a snow cluster

Categories

Resources