How to use the parallel package inside another package, using devtools? - r

When running the following code in an R terminal:
library(parallel)
func <- function(a,b,c) a+b+c
testfun <- function() {
cl <- makeCluster(detectCores(), outfile="parlog.txt")
res <- clusterMap(cl, func, 1:10, 11:20, MoreArgs = list(c=1))
print(res)
stopCluster(cl)
}
testfun()
... it works just fine. However, when I copy the two function definitions into my own package, add a line #' #import parallel, do dev_tools::load_all("mypackage") on the R terminal and then call testfun(), I get an
Error in unserialize(node$con) (from myfile.r#7) :
error reading from connection
where #7 is the line containing the call to clusterMap.
So the exact same code works on the terminal but not inside a package.
If I take a look into parlog.txt, I see the following:
starting worker pid=7204 on localhost:11725 at 13:17:50.784
starting worker pid=4416 on localhost:11725 at 13:17:51.820
starting worker pid=10540 on localhost:11725 at 13:17:52.836
starting worker pid=9028 on localhost:11725 at 13:17:53.849
Error: (converted from warning) namespace 'mypackage' is not available and has been replaced
by .GlobalEnv when processing object ''
Error: (converted from warning) namespace 'mypackage' is not available and has been replaced
by .GlobalEnv when processing object ''
Error: (converted from warning) namespace 'mypackage' is not available and has been replaced
by .GlobalEnv when processing object ''
Error: (converted from warning) namespace 'mypackage' is not available and has been replaced
by .GlobalEnv when processing object ''
What's the root of this problem and how do I resolve it?
Note that I'm doing this with a completely fresh, naked package. (Created by devtools::create.) So no interactions with existing, possibly destructive code.

While writing the question, I actually found the answer and am going to share it here.
The problem here is the combination of the packages devtools and parallel.
Apparently, for some reason, parallel requires the package mypackage to be installed into some local library, even if you do not need to load it in the workers explicitly (e.g. using clusterEvalQ(cl, library(mypackage)) or something similar)!
I was employing the usual devtools workflow, meaning that I was working in dev_mode() all of the time. However, this led to my package being installed just in some special dev mode folders (I do not know exactly how this works internally). These are not searched by the worker processes invoked parallel, since they are not in dev_mode.
So here is my 'workaround':
## turn off dev mode
dev_mode()
## install the package into a 'real' library
install("mypackage")
library(mypackage)
## ... and now the following works:
mypackage:::testfun()
As Hadley just pointed out correctly, another workaround would be to add a line
clusterEvalQ(cl, dev_mode())
right after cluster creation. That way, one can use the dev_mode.

Related

Use uninstalled packages

I'm working in a package (and I'll include more later), and I'm doing a lot of changes and using it, and I'm looking for a method to can load the library without compile it, the lib is written in R.
I think the next structure:
Code
libs/
test_pkg1/
test_pkg2/
more code.R
Where the test_pkg are the dirs with the packages, they are already in the pkg format for R, the wd is set inside "Code", and then add "libs" to the library path in R, but R does not detect it.
I'm trying to use the libs in this both ways:
test_pkg1::func()
test_pkg2::func()
library(test_pkg1)
library(test_pkg2)
for the library, I can use devtools::load_all(path), and works great, but I don't want to mix the env with other funcs, so I need the other way too.
I'm not trying to use both at the same time, but I need both, the most important for this is the :: method.
I tested the libPath from here:
Change R default library path using .libPaths in Rprofile.site fails to work
But I can't do it works...., using that I get the next result:
.libPaths( c( .libPaths(), "./libs") )
test_pkg1::func()
Error in namespaceExport(ns, exports) : undefined exports: func()
Although: Warning message:
In loadNamespace(name) : package ‘test_pkg1’ has no 'package.rds' in Meta/
Note, If I install the package, all works fine, the NAMESPACE file, contains the export functions, in any case, any error there and I should not be able to install and use the libs.
Thx.
Tests from comments:
Try library(pkg, lib.loc)
library("test_pkg1", lib.loc="./libs", verbose=TRUE)
Error in library("test_pkg1", lib.loc = "./libs", verbose = TRUE) :
‘test_pkg1’ is not a valid installed package
Notice, the message is not, there is no package called, the package is detected, but seems, how is not built, I can't load it from library. I try build the package, but don't works either.
Try devtools::dev_mode(on = NULL, path = "libs")
Sadly, this method seems to have the same behavior as the .libsPath method.

devtools::load_all() side effects: C++-class constructor calling base::system.file fails/ returns empty string

Context
I'm developing (for the first time) an R package using Rcpp which implements an interface to another program (maxima). This package defines a C++ class, whose constructor needs to retrieve the path to an initialization script that gets installed with the package (the in-package path is inst/extdata/maxima-init.mac. The path to this script is then used as a parameter to spawn a child process that runs the program.
In order to retrieve the path to the installed initialization script I'm calling the R function base::system.file from within the C++ class constructor definition:
...
Environment env("package:base");
Function f = env["system.file"];
fs::path p(Rcpp::as<std::string>(f("extdata", "maxima-init.mac", Named("package") = "rmaxima")));
std::string utilsDir = p.parent_path().string();
...
# spawn child process using path in utilsDir
My R/zzz.R creates an object of that class when the packages gets attached:
loadModule("Maxima", TRUE)
.onAttach <- function(libname, pkgname) {
"package:base" %in% search()
maxima <<- new(RMaxima)
}
The Problem
I can install.packages(rmaxima) and library(rmaxima) just fine and the package works as expected.
I now want to increase my development efficiency by using devtools::load_all() to avoid having to R CMD build rmaxima, install.packages(rmaxima) and library(rmaxima) each time I want to test changes. However, when calling devtools::load_all() (or similarily devtools::test() (working directory set to package root) my implementation freezes, because the variable utilsDir is empty and therefore the process launching does not return (I guess it keeps waiting for a valid path). I eventually need to manually kill the process. The same thing happens without setting .onAttach()
Apparently devtools::load_all() does not resemble R's default search path on restart. What can I do? Is this the problem or am I missing something else?
Update
I just came across the following notion of in the devtools::load_all() R documentation file which could be a tip in the right direction
Shim files:
‘load_all’ also inserts shim functions into the imports environment of
the loaded package. It presently adds a replacement version of
‘system.file’ which returns different paths from ‘base::system.file’.
This is needed because installed and uninstalled package sources have
different directory structures. Note that this is not a perfect
replacement for base::system.file.
Also I realized, that devtools::load_all() only temporarily installs my package into, but somehow doesn't the files from my inst/
rcst#Velveeta:~$ ls -1R /tmp/RtmpdnvOQg/devtools_install_ee1e82c780/rmaxima/
/tmp/RtmpdnvOQg/devtools_install_ee1e82c780/rmaxima/:
DESCRIPTION
libs
Meta
NAMESPACE
/tmp/RtmpdnvOQg/devtools_install_ee1e82c780/rmaxima/libs:
rmaxima.so
/tmp/RtmpdnvOQg/devtools_install_ee1e82c780/rmaxima/Meta:
features.rds
package.rds
As it turns out devtools provides a solution to exactly this problem.
In short: calling system.file (i.e. from the global environitment and having the devtools package attached) solves the issue. Specifically the modification:
// Environment env("package:base");
// Function f = env["system.file"];
Function f("system.file");
fs::path p(Rcpp::as<std::string>(f("extdata", "maxima-init.mac", Named("package") = "rmaxima")));
std::string utilsDir = p.parent_path().string();
Explanation
base::system.file(..., mustWork = FALSE) returns an empty string if no match is found. devtools::load_all() temporarily installs the packages inside /tmp/ (on my linux machine). The directory structure of the temporary installation differs from the one of the regular installation, i.e. the one created by install.packages(). In my case, most notably, devtools::load_all() does not copy the inst/ directory, which contains the initialization file.
Now, calling base::system.file("maxima-init.mac", package="rmaxima", mustWork=FALSE) naturally fails, since it searches inside the temporary installation. Having devtools attached masks system.file() with devtools::system.file(), which as mentioned above is "... meant to intercept calls to base::sysem.file() " and behaves differently from base::system.file(). Practically, I think this means, that it will search for the package's source directory instead of the temporary installation.
This way, simply calling system.file() from the global environment calls the right function, either from devtools or base, for either the development or user version of the package automatically.
Nonetheless, using ccache additionally (thanks #dirk) substantially speeds up my development workflow.

Difference between loading and attaching in [R]

In RStudio, when I check and uncheck a package, I see the following commands.
library("ggplot2", lib.loc="~/R/win-library/3.4")
detach("package:ggplot2", unload=TRUE)
Can someone explain what is unload=TRUE does?
Conceptually is there a difference between loading/unloading vs attaching/detaching?
From R's official help pages (see also R Packages - Namespaces):
Anything needed for the functioning of the namespace should be handled at load/unload times by the .onLoad and .onUnload hooks.
For example, DLLs can be loaded (unless done by a useDynLib directive in the ‘NAMESPACE’ file) and initialized in .onLoad and unloaded in .onUnload.
Use .onAttach only for actions that are needed only when the package becomes visible to the user (for example a start-up message) or need to be run after the package environment has been created.
 
attaching and .onAttach
thus means that a package is attached to the user space
aka the global environment
usually this is done via library(pkg)
and you can use normal fun() syntax
 
loading and .onLoad
thus means that package is (in any way) made available to the current R-session
(e.g. by loading/attaching another package that depends on it or by using pkg::fun() syntax the first time)
though you will not find functions in the global environment
you can use pkg::fun()
detach related to package environment(which more related to user)
unload = TRUE related to namespace environment(which more related to other package)
after detach, you can not use any function inside that package directly
but unloadnamespace won't prevent you from calling that function,but the other packages can't use its function directly

run Rmpi on cluster, specify library path

I'm trying to run an analysis in parallel on our computing cluster.
Unfortunately I've had to set up Rmpi myself and may not have done so properly.
Because I had to install all necessary packages into my home folder, I always have to call
.libPaths('/home/myfolder/Rlib');
before I can load packages.
However, it appears that doMPI attempts to load itself, before I can set the library path.
.libPaths('/home/myfolder/Rlib');
cat("Step 1")
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
cat("Step 2")
Children_mcmc1 = foreach(i=1:2) %dopar% {
cat("Step 3")
.libPaths('/home/myfolder/Rlib');
library(MCMCglmm)
cat("Step 4")
load("krmh_married.rdata")
nitt = 1000; thin = 50; burnin = 100
MCMCglmm( children ~ paternalage.factor ,
random=~idParents,
family="poisson",
data=krmh_married,
pr = F, saveX = T, saveZ = T,
nitt=nitt,thin=thin,burnin=burnin)
}
closeCluster(cl)
mpi.quit()
If I do
mpirun -H localhost -n 3 R --slave -f "3 - krmh mcmcglmm scc test 2.r"
I get (after removing some boilerplate messages)
During startup - Warning message:
Step 1
Step 1
Step 1
Step 2Error in { : task 2 failed - "cannot open the connection"
Calls: %dopar% ->
Execution halted
If I do
R --slave -f "3 - krmh mcmcglmm scc test 2.r"
I get
Step 1
Error in library(doMPI) : there is no package called 'doMPI'
Calls: local ... eval -> suppressMessages -> withCallingHandlers -> library
Execution halted
Error in library(doMPI) : there is no package called 'doMPI'
Calls: local ... eval -> suppressMessages -> withCallingHandlers -> library
Execution halted
I've tried installing doMPI on the run, but even though Step 2 isn't printed, it seems as if the error results from the loop.
And of course, with all this I'm still testing on our frontend, I haven't it made it to submitting the job to the intended cluster yet.
I tried to specify the .libPaths call in my .Rprofile, but I'm not sure this would get read on the cluster and I can't even get it to get read on the frontend (and I couldn't figure out where R is looking for the file).
It's much easier to install R packages into a "personal library", since it is used automatically so you don't have to call .libPaths in your scripts. You can determine what directory this is by executing:
> Sys.getenv('R_LIBS_USER')
This will automatically be the first directory returned by .libPaths if it exists, so you don't have to worry about calling .libPaths at all.
Note that there's no point in calling .libPaths in the body of the foreach loop since doMPI must be loaded by the cluster workers before they can execute any tasks.
I'm not sure what's going wrong in your "mpirun" case, because mpirun is starting all of the workers, so the first four lines of your script are executed by all of them. That is why "Step 1" is displayed three times. But in your second case, the cluster workers are being spawned, so the doMPI package is loaded by the RMPIworker.R script, resulting in the error loading doMPI.
I suggest that you use the mpirun approach to solve the .libPaths problem, but call startMPIcluster with the verbose=TRUE option. That will create some files in your working directory named "MPI_*.log" which may contain some useful error messages that will provide a clue to the problem.

error: object '.doSnowGlobals' not found?

I'm trying to parallelize a code on 4 nodes(type = "SOCK"). Here is my code.
library(itertools)
library(foreach)
library(doParallel)
library(parallel)
workers <- ip address of 4 nodes
cl = makePSOCKcluster(workers, master="ip address of master")
registerDoParallel(cl)
z <- read.csv("ProcessedData.csv", header=TRUE, as.is=TRUE)
z <- as.matrix(z)
system.time({
chunks <- getDoParWorkers()
b <- foreach (these = isplitIndices(nrow(z),
chunks=chunks),
.combine = c) %dopar% {
a <- rep(0, length(these))
for (i in 1:length(these)) {
a[i] <- mean(z[these[i],])
}
a
}
})
I get this error:
4 nodes produced errors; first error: object '.doSnowGlobals' not
found.
This code runs fine if I'm using doMC i.e using the same machine's cores. But when I try to use other computers for parallel computing I get the above error. When I change it to registerDoSNOW the error persists.
Does snow and DoSNOW work in a cluster? I could create nodes on the localhost using snow but not on the cluster. Anyone out there using snow?
To set the library path on each worker you can run:
clusterEvalQ(cl, .libPaths("Your library path"))
You can get this error if any of the workers are unable to load the doParallel package. You can make that happen by installing doParallel into some directory and pointing the master to it via ".libPaths":
> .libPaths('~/R/lib.test')
> library(doParallel)
> cl <- makePSOCKcluster(3, outfile='')
starting worker pid=26240 on localhost:11566 at 13:47:59.470
starting worker pid=26248 on localhost:11566 at 13:47:59.667
starting worker pid=26256 on localhost:11566 at 13:47:59.864
> registerDoParallel(cl)
> foreach(i=1:10) %dopar% i
Warning: namespace ‘doParallel’ is not available and has been replaced
by .GlobalEnv when processing object ‘’
Warning: namespace ‘doParallel’ is not available and has been replaced
by .GlobalEnv when processing object ‘’
Warning: namespace ‘doParallel’ is not available and has been replaced
by .GlobalEnv when processing object ‘’
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
3 nodes produced errors; first error: object '.doSnowGlobals' not found
The warning happens when a function from doParallel is deserialized on a worker. The error happens when the function is executed and tries to access .doSnowGlobal which is defined in the doParallel namespace, not in .GlobalEnv.
You can also verify that doParallel is available on the workers by executing:
> clusterEvalQ(cl, library(doParallel))
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
3 nodes produced errors; first error: there is no package called ‘doParallel’
A specific case of #Steve Weston's answer is when your workers aren't able to load a given package (eg doParallel) because the package is inside a Packrat project. Install the packages to the system library, or somewhere else that a worker will be able to find them.
I encountered the same problem today, and I tried all the answers above, none of which worked for me. Then I simply reinstalled the doSNOW package, and magically, the problem was solved.
So none of these fixes worked for me at all. In my particular case, I use a custom R library location. Parallel processing worked if my working directory was the base directory where my custom libraries folder was located, but it failed if I used setwd() to change the working directory.
This custom library location was not being passed on to worker nodes, so they were looking in R's default library directory for packages that were not there. The fix by #Nat did not work for me; the worker nodes still could not find my custom library folder. What did work was:
Before sending jobs to nodes:
paths <- .libPaths()
I then sent jobs to nodes, along with the argument paths. Then, inside the worker function I simply called:
.libPaths(paths)

Resources