I'm trying to use the parallel package in R for parallel operations rather than doSNOW since it's built-in and ostensibly the way the R Project wants things to go. I'm doing something wrong that I can't pin down though. Take for example this:
a <- rnorm(50)
b <- rnorm(50)
arr <- matrix(cbind(a,b),nrow=50)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=F)
This works just fine, producing the sums of my two columns. But if I try to bring in the parallel package:
library(parallel)
nodes <- detectCores()
cl <- makeCluster(nodes)
setDefaultCluster(cl)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=T)
It throws the error
2: In setup_parallel() : No parallel backend registered
3: executing %dopar% sequentially: no parallel backend registered
Am I initializing the backend wrong?
Try this setup:
library(doParallel)
library(plyr)
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
aaply(ozone, 1, mean,.parallel=TRUE)
stopCluster(cl)
Since I have never used plyr for parallel computing I have no idea why this issues warnings. The result is correct anyway.
The documentation for aaply states
.parallel: if ‘TRUE’, apply function in parallel, using parallel
backend provided by foreach
so presumably you need to use the foreach package rather than the parallel package.
Related
I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a cluster created with the parallel package. My question is can I connect to a spark cluster using something like SparklyR, and then use that spark cluster as part of a makeCluster command to pass into my function?
I have successfully gotten the cluster working with parallel, but I do not know how or if it is possible to leverage the spark clusters.
library(bnlearn)
library(parallel)
my_cluster <- makeCluster(3)
...
pc_structure <- pc.stable(train[,-1], cluster = my_cluster)
My question is can I connect to a spark cluster as follows:
sc <- spark_connect(master = "yarn-client", config = config, version = '1.6.2')
and then leverage the connection (the sc object) in the makeCluster() function?
If that would solve your problem (and if I understand you correctly), I'd wrap your code that uses parallel package into a sparkR function, e.g. spark.lapply (or something similar in sparklyr, don't have experience with that).
I assume your Spark cluster is Linux based, hence the mcapply function from the parallel package should be used (instead of makeCluster and consequent clusterExport on Windows).
For example a locally executed task of summing up numbers in each element of a list would be (on Linux):
library(parallel)
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
res = mclapply(X=input, FUN=sum, mc.cores=3)
and doing the same task 10000 times using a Spark cluster:
input = list(c(1,2,3), c(1,2,3,4), c(1,2,3,4,5))
save(input, file="/path/testData.RData")
res = spark.lapply(1:10000, function(x){
library(parallel)
load("/path/testData.RData")
mclapply(X=input, FUN=sum, mc.cores=3)
})
Question is whether your code be tweaked that way.
What's the difference between using the "doParallel" package with type = MPI and using doMPI directly?
library(foreach)
library(doParallel)
cl <- makeCluster(mpi.universe.size(), type='MPI')
registerDoParallel(cl)
system.time(foreach(i = 1:3) %dopar% {Sys.sleep(i); i})
VS
library(doMPI)
cl <- startMPIcluster(count=2)
registerDoMPI(cl)
system.time(foreach(i = 1:3) %dopar% {Sys.sleep(i); i})
The "doParallel" package acts as a wrapper around the "clusterApplyLB" function which is implemented by calling functions from the "Rmpi" package when using an MPI cluster.
The "doMPI" package uses "Rmpi" functions directly and includes some features that aren't available in "clusterApplyLB":
supports fetching inputs and combining outputs on-the-fly to efficiently handle a large number of loop iterations;
supports MPI broadcast to initialize workers;
allows workers to be started either by mpirun or MPI spawn function.
How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.
Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.
library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)
You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.
This worked for me, I used package doParallel, required 3 lines of code:
# process in parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# turn parallel processing off and run sequentially again:
registerDoSEQ()
Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).
Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.
Original code:
#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
}
Parallelised code:
library("parallel")
workerFunc <- function(nc) {
set.seed(1234)
return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }
num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame"))
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
result <- parLapply(cl, values, workerFunc) ) # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.
I think these libraries will help you:
foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)
May these article also help you
http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/
http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/
I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.
# read in the spreadsheet
parallel_read <- function(file){
# detect available cores and use 70%
numCores = round(parallel::detectCores() * .70)
# check if os is windows and use parLapply
if(.Platform$OS.type == "windows") {
cl <- makePSOCKcluster(numCores)
parLapply(cl, file, readxl::read_excel)
stopCluster(cl)
return(dfs)
# if not Windows use mclapply
} else {
dfs <-parallel::mclapply(excel_sheets(file),
readxl::read_excel,
path = file,
mc.cores=numCores)
return(dfs)
}
}
For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/
I've come up with a strange error.
Suppose I have 10 xts objects in a list called data. I now search for every three combinations using
data_names <- names(data)
combs <- combn(data_names, 3)
My basic goal is to do a PCA on those 1080 triples.
To speed things up I wanted do use the package doParallel. So here is the snippet shortened till the point where the error occurs:
list <- foreach(i=1:ncol(combs)) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Here, the merge function seems to be the problem. The error is
task 1 failed - "cannot coerce class 'c("xts", "zoo")' into a data.frame"
However, when changing %dopar% to a normal serial %do% everything works as accepted.
Till now I was not able to find any solution to this problem and I'm not even sure what to look for.
A better solution rather than explicitly loading the libraries within the function would be to utilise the .packages argument of the foreach() function:
list <- foreach(i=1:ncol(combs),.packages=c("xts","zoo")) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
The problem is likely that you haven't called library(xts) on each of the workers. You don't say what backend you're using, so I can't be 100% sure.
If that's the problem, then this code will fix it:
list <- foreach(i=1:ncol(combs)) %dopar% {
library(xts)
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
These are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. It worked in my case.
I'm new to using the parallel packages and have started exploring them in a bid to speed up some of my work. An annoyance I often encounter is that the foreach command will throw up problems when I have not clusterExport the relevant functions/variables.
Example
I know that the example below does not necessarily need foreach to make it fast, but for illustration sake, I'll use it.
library(doParallel)
library(parallel)
library(lubridate)
library(foreach)
cl <- makeCluster(c("localhost", "localhost", "localhost","localhost"), type = "SOCK")
registerDoParallel(cl, cores = 4)
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d-%m-%Y')), 500, replace = TRUE)
foreach(i = seq_along(Dates), .combine = rbind) %dopar% dmy(Dates[i])
Error in dmy(Dates[i]) : task 1 failed - "could not find function "dmy""
As you can see, there is an error that says that the dmy function is not found. I then have to go on and add the following:
clusterExport(cl, c("dmy"))
So my question is, besides looking at the error for clues on what to export, is there a more elegant way of knowing beforehand what objects to export or is there a way to share the global environment with all the slaves before running the foreach?
No need to export individual package functions manually like that. You can use the .packages argument to the foreach function to load the required packages, so all package functions will be available to your %dopar% expression.