Passing a C++ package library to a doParallel cluster - r

I have a some plyr code that is meant to run a Rcpp function that I have written:
nodes = detectCores()
cl = makeCluster(nodes)
registerDoParallel(cl)
l = llply(mylist, function(x) {
.Call("myfancyfunction", PACKAGE = "mypackage", ...)
}, .parallel = TRUE, .paropts = list(.packages = "mypackage"))
However, even when I include the package I get the error:
Error in do.ply(i) :
task 1 failed - ""myfancyfunction" not available for .Call() for package "mypackage""
How do I make my libraries accessible to the parallel processes?

The error doesn't indicate any issue with plyr or parallel processing; it means you haven't properly registered your C/C++ routine in your package. For example, this works:
library(plyr)
library(doParallel)
nodes = detectCores()
cl = makeCluster(nodes)
registerDoParallel(cl)
l <- llply(list(1,2), function(x) {
.Call("endpoints", 1L, 1L, 1L, TRUE, PACKAGE="xts")
}, .parallel=TRUE, .paropts=list(.packages="xts"))
library(mypackage); .Call("myfancyfunction", x, PACKAGE="mypackage") in a normal R session likely throws the same error.
See Registering native routines in Writing R Extensions for details on ways you can register myfancyfunction.

Related

`doParallel` vs `future` while using `Seurat` package

Here is the story.
From Seurat vignette, FindMarkers() can be accelerated by utilizing future package, future::plan("multiprocess", workers = 4)
However, I am running a simulation that I need to use FindAllMarkers() inside a doParallel::foreach() loop after doParallel::registerDoParallel(numCores=10).
What's the parallelization that happened behind the scene?
How to leverage the most power of HPC under this setup?
How many CPUs should I allocate for this job to maximize the parallelization?
Any idea is welcome.
Below is a minimum example. pbmc.rds is here.
library(Seurat)
# Enable parallelization for `FindAllMarkers()`
library(future)
plan("multiprocess", workers = 4)
# Enable parallelization for `foreach()` loop
library(doParallel)
registerDoParallel(cores = 10)
pbmc <- readRDS("pbmc.rds")
rst <- foreach(i = 1:10/10, .combine = "cbind") %doPar% {
pbmc <- FindClusters(pbmc, resolution = i)
# should put future command here instead?
# plan("multiprocess", workers = 4)
DEgenes <- FindAllMarkers(pbmc)
write.csv(DEgenes, paste0("DEgenes_resolution_", i, "csv"))
pbmc$seurat_clusters
}

Error running Rmpi when doing parallel computing

I'm trying to running parallel computing in R with below lines
library(parallel)
library(snow)
library(snowFT)
library(VGAM)
library(dplyr)
library(Rmpi)
nCores <- detectCores() - 1
cl <- makeCluster(nCores)
Then R returns an error
Error in Rmpi::mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = count, : Internal MPI error!, error stack: MPI_Comm_spawn(cmd="C:/R/R-40~1.2/bin/x64/Rscript.exe", argv=0x00000223DB137530, maxprocs=11, MPI_INFO_NULL, root=0, MPI_COMM_SELF, intercomm=0x00000223DCFCD998, errors=0x00000223DA9FC9E8) failed Internal MPI error! FAILspawn not supported without process manager
3. Rmpi::mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = count, intercomm = intercomm)
2. makeMPIcluster(spec, ...)
1. makeCluster(nCores)
I've tried to install MPICH2 on Windows from here, but the final cmd command mpiexec -validate always returns FAIL.
Could you please elaborate on how to solve this issue?
The problem is that makeCluster(nCores) is used by more than one package. As such, I use parallel::makeCluster(nCores) to solve the issue.

How to clusterExport in a foreach loop (OS:Windows)

I have the following problem. I need the timeindex of variables in my Global Environment, but when I want to export them into my cluster during the parallel processing from my Global Environment, I'm getting the following message:
Error in { : task 1 failed - "object 'Szenario' not found"
A minimal example of my original code, which produces the error:
Historical <- structure(c(18.5501872568473, 24.3295787432955, 14.9342384460288,
13.0653757599636, 8.67294618896797, 13.4587662721594, 20.792126254714,
17.5162747884424, 28.8253151239752, 23.0568765432192), index = structure(c(-7305,
-7304, -7303, -7302, -7301, -7300, -7299, -7298, -7297, -7296
), class = "Date"), class = "zoo")
Szenario <- structure(c(10.2800684124652, 14.5495871489361, 9.8565852930294,
21.1654540678686, 21.1936990312861, 12.4209005842752, 9.77473132000267,
17.1997402736739, 17.884107611858, 13.622588360244), index = structure(c(13149,
13150, 13151, 13152, 13153, 13154, 13155, 13156, 13157, 13158
), class = "Date"), class = "zoo")
library(doParallel)
library(foreach)
library(raster)
library(zoo)
library(parallel)
# Parallelisation Settings
# Definition of how many cores you want to use
UseCores <- detectCores() -2 # -1 at max because one core has to be used for other tasks
# Register CoreCluster
cl <- makeCluster(UseCores)
registerDoParallel(cl)
foreach(fn=1:1) %dopar% {
library(raster) # needed libraries have to be loaded inside the loop, while parallel processing occurs
library(zoo)
library(base)
library(parallel)
#In my original script, I'm looping through Filenames, which are called like my variables in my Global environment (without .tif at the end of the filename), variables names are saved as characters
file.referenz.name <- c("Historical")
file.szenario.name <- c("Szenario")
#Create timeindex für rasterstacks to subset later on with them (getZ, setZ)
clusterExport(cl, varlist = c(file.szenario.name, file.referenz.name), envir = .GlobalEnv)
time.index.szenario <- index(get(file.szenario.name))
time.index.referenz <- index(get(file.referenz.name))
}
#end cluster
stopCluster(cl)
Try this
foreach(fn=1:1, .export=c("Szenario"),
.packages=c("raster", "zoo", "base", "parallel") %dopar%
And it's confusing to clusterExport inside a %dopar% {}. You can either clusterExport to each cl before foreach, or simply use .export in foreach
You can remove library statements in the %dopar% {}.

Loading CRAN packages to use with emrlapply() from JD Long's 'segue' package?

I'm using JD Long's segue package (https://code.google.com/p/segue/) to do some parallel computing, and am running into an issue loading CRAN packages on the EC2 instances.
First, I created an EMR cluster like so:
myCluster <- createCluster(numInstances = 5,
cranPackages = c("RWeka", "tm"),
masterInstanceType="m1.large",
slaveInstanceType="m1.large",
location="us-east-1c",)
Per the documentation, I specified which packages I want to load (in this case, RWeka and tm).
The cluster seems to start properly, with no error messages. I am using RStudio on Linux Mint 17 with R version 3.0.2.
I wrote a function getTerms.jobAd which takes a character string and calls some functions from the packages above, and am using emrlapply() like so:
> jobAdTerms <- emrlapply(myCluster, X = as.list(jobAds[1:2, 3]), FUN = getTerms.jobAd)
RUNNING - 2014-06-24 17:05:19
RUNNING - 2014-06-24 17:05:50
WAITING - 2014-06-24 17:06:20
When I check the jobAdTerms list that is supposed to be returned, I get:
> jobAdTerms
[[1]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
[[2]]
[1] "error caught by Segue: Error in function (txt) : could not find function \"Corpus\"\n"
Obviously, Corpus is one of the functions from the tm package.
What am I doing wrong? And how can I remedy this situation? Thanks!!
EDIT
Here's the function I am calling:
nGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4))
getTerms.jobAd <- function(txt) {
tmp <- tolower(txt)
tmp <- gsub('\\s*<.*?>|[:;,#$%^&*()?]|(?<=[a-zA-Z])\\.(?= |$)', '', tmp, perl = TRUE)
txt.Corpus <- Corpus(VectorSource(tmp))
txt.Corpus <- tm_map(txt.Corpus, stripWhitespace)
txt.TFV <- termFreq(txt.Corpus[[1]], control = list(dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
txt.TFV2 <- termFreq(txt.Corpus[[1]], control = list(tokenize = nGramTokenizer, dictionary = jobTags[, 1], wordLengths = c(1, Inf)))
jobTerms <- rowSums(as.matrix(c(txt.TFV, txt.TFV2)))
return(jobTerms)
}
EDIT 2
Here's how you can reproduce the error:
data(crude)
jobAdTerms <- emrlapply(myCluster, X = as.list(crude), FUN = getTerms.jobAd)

using package snow's parRapply: argument missing error

I want to find documents whose similarity between other doucuments are larger than a given value(0.1) by cutting documents into blocks.
library(tm)
data("crude")
sample.dtm <- DocumentTermMatrix(
crude, control=list(
weighting=function(x) weightTfIdf(x, normalize=FALSE),
stopwords=TRUE
)
)
step = 5
n = nrow(sample.dtm)
block = n %/% step
start = (c(1:block)-1)*step+1
end = start+step-1
j = unlist(lapply(1:(block-1),function(x) rep(((x+1):block),times=1)))
i = unlist(lapply(1:block,function(x) rep(x,times=(block-x))))
ij <- cbind(i,j)
library(skmeans)
getdocs <- function(k){
ci <- c(start[k[[1]]]:end[k[[1]]])
cj <- c(start[k[[2]]]:end[k[[2]]])
combi <- sample.dtm[ci]
combj < -sample.dtm[cj]
rownames(combi)<-ci
rownames(combj)<-cj
comb<-c(combi,combj)
sim<-1-skmeans_xdist(comb)
cat("Block", k[[1]], "with Block", k[[2]], "\n")
flush.console()
tri.sim<-upper.tri(sim,diag=F)
results<-tri.sim & sim>0.1
docs<-apply(results,1,function(x) length(x[x==TRUE]))
docnames<-names(docs)[docs>0]
gc()
return (docnames)
}
It works well when using apply
system.time(rmdocs<-apply(ij,1,getdocs))
When using parRapply
library(snow)
library(skmeans)
cl<-makeCluster(2)
clusterExport(cl,list("getdocs","sample.dtm","start","end"))
system.time(rmdocs<-parRapply(cl,ij,getdocs))
Error:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: attempt to set 'rownames' on an object with no dimensions
Timing stopped at: 0.01 0 0.04
It seems that sample.dtm coundn't be used in parRapply. I'm confused. Can anyone help me? Thanks!
In addition to exporting objects, you need to load the necessary packages on the cluster workers. In your case, the result of not doing so is that there isn't a dimnames method defined for "DocumentTermMatrix" objects, causing rownames<- to fail.
You can load packages on the cluster workers with the clusterEvalQ function:
clusterEvalQ(cl, { library(tm); library(skmeans) })
After doing that, rownames(combi)<-ci will work correctly.
Also, if you want to see the output from cat, you should use the makeCluster outfile argument:
cl <- makeCluster(2, outfile='')

Resources