Passing an entire package to a snow cluster - r

I'm trying to parallelize (using snow::parLapply) some code that depends on a package (ie, a package other than snow). Objects referenced in the function called by parLapply must be explicitly passed to the cluster using clusterExport. Is there any way to pass an entire package to the cluster rather than having to explicitly name every function (including a package's internal functions called by user functions!) in clusterExport?

Install the package on all nodes, and have your code call library(thePackageYouUse) on all nodes via one the available commands, egg something like
clusterApply(cl, library(thePackageYouUse))
I think the parallel package which comes with recent R releases has examples -- see for example here from help(clusterApply) where the boot package is loaded everywhere:
## A bootstrapping example, which can be done in many ways:
clusterEvalQ(cl, {
## set up each worker. Could also use clusterExport()
library(boot)
cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v)
cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
NULL
})

Related

Is there a way to perform actions every time I call library(package) in R?

I've already learned that including a script "zzz.R" in my R package with a .onAttach function provides me performing actions every time I load an R package for the first time. The package name is utils_department.
The thing is: if I do not detach the package, it is not possible to go for library(utils_department) for the 2nd time and have the same actions as a result.
Here are my functions:
load_useful_objects <- function(){
ex_number <<- 2
ex_string <<- "test"
}
.onAttach <- function(libname, pkgname){
packageStartupMessage("Welcome to the package utils_department! Please report any bugs or improvement ideas to the package maintainer.\n")
packageStartupMessage("Loading useful objects into the global environment...")
load_useful_objects()
}
My goal is: running a function that loads useful objects into the global environment every time that library(utils_department) is run and perform the same actions even if the package was already loaded (without having to detach it).
Is it possible?

Dynamic library dependencies not recognized when run in parallel under R foreach()

I'm using the Rfast package, which imports the package RcppZiggurat. I'm running R 3.6.3 on a Linux cluster (Red Hat 6.1). The packages are installed on my local directory but R is installed system-wide.
The Rfast functions (e.g. colsums()) work well when I call them directly. But when I call them in a foreach() loop like the following (EDIT: I added the code to register the cluster as pointed out by Rui Barradas but it didn't fix the problem).
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4, .packages = 'Rfast') %dopar% colmeans(A)
stopCluster(cl)
then I get an error:
unable to load shared object '/home/users/sutd/R/x86_64-pc-linux-gnu-library/3.6/RcppZiggurat/libs/RcppZiggurat.so':
libgsl.so.0: cannot open shared object file: No such file or directory
Somehow, the dynamic library is recognized when called directly but not when called under foreach().
I know that libgsl.so is located in /usr/lib64/, so I've added the following line at the beginning of my R script
Sys.setenv(LD_LIBRARY_PATH=paste("/usr/lib64/", Sys.getenv("LD_LIBRARY_PATH"), sep = ":"))
But it didn't work.
I have also tried to do dyn.load('/usr/lib64/libgsl.so') but I get the following error:
Error in dyn.load("/usr/lib64/libgsl.so") : unable to load shared object '/usr/lib64/libgsl.so':
/usr/lib64/libgsl.so: undefined symbol: cblas_ctrmv
How do I make the dependencies available in the foreach() parallel loops?
NOTE
In the actual use case I am using the genetic algorithm package GA, and have GA::ga() which handles the foreach() loop, and within the loop I use a function in my own package which calls the Rfast functions. So I'm hoping that there is a solution where I don't have to modify the foreach() call.
The following works with no problems. Unlike the code in the question, it starts by detecting the number of available cores, create a cluster and make it available to foreach.
library(Rfast)
library(doParallel)
library(foreach)
cores <- detectCores()
cl <- makeCluster(cores)
registerDoParallel(cl)
set.seed(2020)
A <- matrix(rnorm(1e6), 1000, 1000)
cm <- foreach(n = 1:4,
.combine = rbind,
.packages = "Rfast") %dopar% {
colmeans(A)
}
stopCluster(cl)
str(cm)
#num [1:4, 1:1000] -0.02668 -0.02668 -0.02668 -0.02668 0.00172 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:4] "result.1" "result.2" "result.3" "result.4"
# ..$ : NULL
The foreach package was great for its time. But, now, parallel computations should be done with future just for the static code analysis handling the correct export to workers. As a result, under the future approach, registering a package with .packages= isn't needed. Moreover, the future mirrors usual R code with only a slight change regarding the setup of an output variable as a listenv. For example, we have:
library("future")
library("listenv")
library("Rfast")
plan(tweak(multiprocess , workers = 2L))
# For all cores, directly use:
# plan(multiprocess)
# Generate matrix once
A <- matrix(rnorm(1e6), 1000, 1000)
# Setup output
x <- listenv()
# Iterate 4 times
for(i in 1:4) {
# On each core, compute the colmeans()
x[[i]] %<-% {
colmeans(A)
# For better control over function applies, use a namespace call
# e.g. Rfast::colmeans(A)
}
}
# Switch from listenv to list
output <- as.list(x)
Thanks to the answers by #RuiBarradas and #coatless, I realize that the problem is not with foreach(), because (1) the problem occurred when I ran the code with future too, and (2) it occurred with the foreach() code even with the wrong call, when I didn't register the cluster. When there is no cluster registered, foreach() will throw a warning and runs in sequential mode instead. But that didn't happen.
Therefore, I realize that the problem must have occurred even before the foreach() call. In the log, it appeared right after the message Loading package RcppZiggurat. Something must have gone wrong when this package is loaded.
I then checked the dependencies of RcppZiggurat, and found that it depends on another package called RcppGSL, which interfaces R and the GSL library. Bingo, that's where libgsl.so.0 is needed when RcppZiggurat is called.
So I made an R script named test-gsl.R, which has the following two lines.
library(RcppZiggurat)
print(‘OK’)
Now, I run the following on the head node
$ module load R/3.6.3
$ Rscript test-gsl.R
And everything works fine. The ‘OK’ is printed.
But this doesn’t work if I submit the job on the compute node. First, the PBS script, called test.sh, is as follows
### Resources request
#PBS -l select=1:ncpus=1:mem=1GB
### Walltime
#PBS -l walltime=00:01:00
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
### Run R
module load R/3.6.3
Rscript test-gsl.R
Then I ran
qsub test.sh
And the error popped out. This means that there is something different between the compute node and the head node on my system, and nothing to do with the packages. I contacted the system administrator, who explained to me that the GSL library is available on the head node at the default path, but not on the compute node. So in my shell script, I need to add module load gsl/2.1 before running my R script. I tested that and everything worked.
The solution seems simple enough, but I know very little about Linux administration to realize it. Only after asking around and trying (rather blindly) many things did I finally come to this solution. So thanks to those who've offered help, and mea culpa for not being able to describe the problem accurately at the beginning.

R testthat and devtools: why does a minimal unit test break my package?

I'm working on an R package for sparse matrix handling. It kinda works; here's a minimal example to set the stage for my question.
devtools::install_github("ekernf01/MatrixLazyEval", ref = "eef5593ad")
library(Matrix)
library(MatrixLazyEval)
data(CAex)
M = rbind(CAex, CAex)
M = matrix(stats::rnorm(prod(dim(M))), nrow = nrow(M))
M_lazy = AsLazyMatrix( M )
svd_lazy = RandomSVDLazyMatrix(M_lazy)
But, when I run even a minimal unit test, it breaks the package permanently (I have to restart my R session or reinstall the package). The immediate cause is that R can't find some S4 methods from packages I depend on (e.g. for matrix transpose t or colSums from the Matrix package). I run the unit test like this:
devtools::test(filter = "minimal")
svd_lazy = RandomSVDLazyMatrix(M_lazy)
Here's the contents of the test files.
> cat tests/testthat.R
library(testthat)
testthat::test_check("MatrixLazyEval")
> cat tests/testthat/testthat_minimal.R
context("minimal")
Why does this happen? Maybe this is naive, but the unit test shouldn't even do anything.
Edit
Possibly related:
r - data.table and testthat package
https://github.com/r-lib/devtools/issues/192
R data.table breaks in exported functions
You need to import all the generics you're using in your package namespace:
#' #importFrom Matrix t tcrossprod colSums rowMeans
NULL
This will fix the issue that you're observing and you'll be able the tests multiple times in the same session.
Also this will allow other packages that import Matrix::t to consistently use your custom methods. Currently, since you're calling setMethod() in your package, you're creating a new t() generic local to your namespace whenever Matrix is not attached to the search path at load-time (this is why it worked the first time you ran the tests). This prevents other packages using Matrix::t() to access your methods. Importing Matrix::t() explicitly will fix this because you'll never create a local generic for t().

How can I evaluate a C function from a dynamic library in an R package?

I’m trying to implement parallel computing in an R package that calls C from R with the .C function. It seems that the nodes of the cluster can’t access the dynamic library. I have made a parallel socket cluster, like this:
cl <- makeCluster(2)
I would like to evaluate a C function called valgrad from my R package on each of the nodes in my cluster using clusterEvalQ, from the R package parallel. However, my code is producing an error. I compile my package, but when I run
out <- clusterEvalQ(cl, cresults <- .C(C_valgrad, …))
where … represents the arguments in the C function valgrad. I get this error:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
2 nodes produced errors; first error: object 'C_valgrad' not found
I suspect there is a problem with clusterEvalQ’s ability to access the dynamic library. I attempted to fix this problem by loading the glmm package into the cluster using
clusterEvalQ(cl, library(glmm))
but that did not fix the problem.
I can evaluate valgrad on each of the clusters using the foreach function from the foreach R package, like this:
out <- foreach(1:no_cores) %dopar% {.C(C_valgrad, …)}
no_cores is the number of nodes in my cluster. However, this function doesn’t allow any of the results of the evaluation of valgrad to be accessed in any subsequent calculation on the cluster.
How can I either
(1) make the results of the evaluation of valgrad accessible for later calculations on the cluster or
(2) use clusterEvalQ to evaluate valgrad?
You have to load the external library. But this is not done with library calls, it's done with dyn.load.
The following two functions are usefull if you work with more than one operating system, they use the built-in variable .Platform$dynlib.ext.
Note also the unload function. You will need it if you develop a C functions library. If you change a C function before testing it the dynamic library has to be unloaded, then (the new version) reloaded.
See Writing R Extensions, file R-exts.pdf in the doc folder, section 5 or on CRAN.
dynLoad <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.load(dynlib)
}
dynUnload <- function(dynlib){
dynlib <- paste(dynlib, .Platform$dynlib.ext, sep = "")
dyn.unload(dynlib)
}

Search all existing functions for package dependencies?

I have a package that I wrote while learning R and its dependency list is quite long. I'm trying to trim it down, for two cases:
I switched to other approaches, and packages listed in Suggests simply aren't used at all.
Only one function out of my whole package relies on a given dependency, and I'd like to switch to an approach where it is loaded only when needed.
Is there an automated way to track down these two cases? I can think of two crude approaches (download the list of functions in all the dependent packages and automate a text search for them through my package's code, or load the package functions without loading the required packages and execute until there's an error), but neither seems particularly elegant or foolproof....
One way to check dependancies in all functions is to use the byte compiler because that will check for functions being available in the global workspace and issue a notice if it does not find said function.
So if you as an example use the na.locf function from the zoo package in any of your functions and then byte compile your function you will get a message like this:
Note: no visible global function definition for 'na.locf'
To correctly address it for byte compiling you would have to write it as zoo::na.locf
So a quick way to test all R functions in a library/package you could do something like this (assuming you didn't write the calls to other functions with the namespace):
Assuming your R files with the functions are in C:\SomeLibrary\ or subfolders there of and then you define a sourceing file as C:\SomeLibrary.r or similar containing:
if (!(as.numeric(R.Version()$major) >=2 && as.numeric(R.Version()$minor) >= 14.0)) {
stop("SomeLibrary needs version 2.14.0 or greater.")
}
if ("SomeLibrary" %in% search()) {
detach("SomeLibrary")
}
currentlyInWorkspace <- ls()
SomeLibrary <- new.env(parent=globalenv())
require("compiler",quietly=TRUE)
pathToLoad <- "C:/SomeLibraryFiles"
filesToSource <- file.path(pathToLoad,dir(pathToLoad,recursive=TRUE)[grepl(".*[\\.R|\\.r].*",dir(pathToLoad,recursive=TRUE))])
for (filename in filesToSource) {
tryCatch({
suppressWarnings(sys.source(filename, envir=SomeLibrary))
},error=function(ex) {
cat("Failed to source: ",filename,"\n")
print(ex)
})
}
for(SomeLibraryFunction in ls(SomeLibrary)) {
if (class(get(SomeLibraryFunction,envir=SomeLibrary))=="function") {
outText <- capture.output(with(SomeLibrary,assign(SomeLibraryFunction,cmpfun(get(SomeLibraryFunction)))))
if(length(outText)>0){
cat("The function ",SomeLibraryFunction," produced the following compile note(s):\n")
cat(outText,sep="\n")
cat("\n")
}
}
}
attach(SomeLibrary)
rm(list=ls()[!ls() %in% currentlyInWorkspace])
invisible(gc(verbose=FALSE,reset=TRUE))
Then start up R with no preloaded packages and source in C:\SomeLibrary.r
And then you should get notes from cmpfun for any call to a function in a package that's not part of the base packages and doesn't have a fully qualified namespace defined.

Resources