I've come up with a strange error.
Suppose I have 10 xts objects in a list called data. I now search for every three combinations using
data_names <- names(data)
combs <- combn(data_names, 3)
My basic goal is to do a PCA on those 1080 triples.
To speed things up I wanted do use the package doParallel. So here is the snippet shortened till the point where the error occurs:
list <- foreach(i=1:ncol(combs)) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Here, the merge function seems to be the problem. The error is
task 1 failed - "cannot coerce class 'c("xts", "zoo")' into a data.frame"
However, when changing %dopar% to a normal serial %do% everything works as accepted.
Till now I was not able to find any solution to this problem and I'm not even sure what to look for.
A better solution rather than explicitly loading the libraries within the function would be to utilise the .packages argument of the foreach() function:
list <- foreach(i=1:ncol(combs),.packages=c("xts","zoo")) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
The problem is likely that you haven't called library(xts) on each of the workers. You don't say what backend you're using, so I can't be 100% sure.
If that's the problem, then this code will fix it:
list <- foreach(i=1:ncol(combs)) %dopar% {
library(xts)
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
These are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. It worked in my case.
Related
I have an lapply operation that I've parallelised using snow. This works fine except that any warnings generated seem to just get ignored and are hence never shown to the user. Is there a way of exposing warnings on individual nodes so they come through in the main R process?
My best idea at the moment is to have all nodes write their warnings to files, and read those at the end, but there must be a better way!
Here's a reprex:
library(snow)
f <- function(x){
warning("mywarning")
return(NULL)
}
cl <- makeCluster(2, type="SOCK")
lapply(1:2, f) # Gives me warnings, as desired
clusterApply(cl, 1:2, f) # Gives me the same output, faster, but with no warnings
In the end I ended up switching from snow to the future.apply package (in conjunction with parallel). future.apply now has this behaviour by default.
Unfortunately in most cases the messages/warnings don't appear until the whole run has finished, but that's a whole new issue.
I would like to parallelize a portion of a package I am working on. Which packages and what syntax should I use to make the package flexible and usable on different architectures? My problem sits in a single sapply() call as shown in this mock code:
.heavyStuff <- function(x) {
# do a lot of work
Sys.sleep(1)
}
listOfX <- 1:20
userFunc1 <- function(listOfX) {
res <- sapply(listOfX, .heavyStuff)
return(res)
}
Based on different guides, I have concocted the following:
userFunc2 <- function(listOfX, dopar.arg=2) {
if(requireNamespace("doParallel")) {
doParallel::registerDoParallel(dopar.arg)
res <- foreach(i=1:length(listOfX)) %dopar% {
.heavyStuff(listOfX[[i]])
}
names(res) <- names(listOfX)
} else {
res <- sapply(listOfX, .heavyStuff)
}
return(res)
}
Questions:
Can I safely use such a code in a package? Will it work well on a range of platforms?
Is there a way to avoid the foreach() construct? I'd much prefer to use a sapply- or lapply-like function. However, the constructs in the parallel library appear to be much more platform specific.
The above code doesn't work if dopar.arg==NULL, even though the introduction to doParallel says that without any arguments "you will get three workers and on Unix-like systems
you will get a number of workers equal to approximately half the number of cores on your system."
As the author of the future framework, I suggest that you have a look at the future.apply package, e.g.
library(future.apply)
userFunc2 <- function(listOfX) {
res <- future_sapply(listOfX, .heavyStuff)
return(res)
}
The default is that things runs sequentially, but if the user wishes, they can use whatever parallel future backend they'd like, e.g.
library(future)
plan(multiprocess) # parallel on local machine - all cores by default
library(future.batchtools)
plan(batchtools_sge) # parallel on an SGE compute cluster
library(future)
plan(sequential) # sequentially
The design pattern is that you decide what to parallelize whereas the user how to parallelize.
I'm running 200 simulations, varying one of my six parameters and using just the standard r-setting and a normal for loop. It takes 2 hours pr. variable I vary.
I was recommended running the loop on parallel cores and I found the function foreach and the doSnow library. I've been able to use simple examples posted on different r-blogs and on stack-overflow and ran them on my computer. But so far I've problems with my own written function.
I get the the following error:
Error in { : task 1 failed - "object 'delta' not found"
Here is a generic code describing the function:
simulation <- function(x){
#Parameter guesses
alpha <- x[1]
mean_ability <-x[2]
delta <- x[3]
var <- x[4]
lambda_0 <- x[5]
lambda_1 <- x[6]
HERE THE SIMULATION PART IS DONE
#Put moments together
c(lam_1_hat,lam_0_hat,delta_hat,mean_within,between_var,average_wage)
}
I put this function inside the foreach function:
foreach(kk=1:length(alpha_vec), .combine = 'c',.packages=#the packages...) %dopar% {
simulation(c(lambda_1[3],lambda_0[3],delta[3],alpha[kk],var[3],mean_abil[3]))[4]
}
So I keep every element fixed except alpha in this case.
During the simulation part I compute random numbers. The set seed command is defined outside the foreach loop. I tried to include it but the error was the same.
I have also tried to include the packages I use, using the .package specification in the foreach-package.
It could make it work by including the whole function code inside the foreach function, but this is surely not the optimal way.
Any suggestions on how to solve my problem?
I think you should include another parameter to your foreach loop, namely .export
Like this:
foreach(kk=1:length(alpha_vec), .combine = 'c',.packages=c(the packages), .export= c("simulation")) %dopar% {
simulation(c(lambda_1[3],lambda_0[3],delta[3],alpha[kk],var[3],mean_abil[3]))[4]
}
How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.
Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.
library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)
You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.
This worked for me, I used package doParallel, required 3 lines of code:
# process in parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# turn parallel processing off and run sequentially again:
registerDoSEQ()
Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).
Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.
Original code:
#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
}
Parallelised code:
library("parallel")
workerFunc <- function(nc) {
set.seed(1234)
return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }
num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame"))
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
result <- parLapply(cl, values, workerFunc) ) # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.
I think these libraries will help you:
foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)
May these article also help you
http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/
http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/
I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.
# read in the spreadsheet
parallel_read <- function(file){
# detect available cores and use 70%
numCores = round(parallel::detectCores() * .70)
# check if os is windows and use parLapply
if(.Platform$OS.type == "windows") {
cl <- makePSOCKcluster(numCores)
parLapply(cl, file, readxl::read_excel)
stopCluster(cl)
return(dfs)
# if not Windows use mclapply
} else {
dfs <-parallel::mclapply(excel_sheets(file),
readxl::read_excel,
path = file,
mc.cores=numCores)
return(dfs)
}
}
For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/
I'm trying to use the parallel package in R for parallel operations rather than doSNOW since it's built-in and ostensibly the way the R Project wants things to go. I'm doing something wrong that I can't pin down though. Take for example this:
a <- rnorm(50)
b <- rnorm(50)
arr <- matrix(cbind(a,b),nrow=50)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=F)
This works just fine, producing the sums of my two columns. But if I try to bring in the parallel package:
library(parallel)
nodes <- detectCores()
cl <- makeCluster(nodes)
setDefaultCluster(cl)
aaply(arr,.margin=1,function(x){x[1]+x[2]},.parallel=T)
It throws the error
2: In setup_parallel() : No parallel backend registered
3: executing %dopar% sequentially: no parallel backend registered
Am I initializing the backend wrong?
Try this setup:
library(doParallel)
library(plyr)
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
aaply(ozone, 1, mean,.parallel=TRUE)
stopCluster(cl)
Since I have never used plyr for parallel computing I have no idea why this issues warnings. The result is correct anyway.
The documentation for aaply states
.parallel: if ‘TRUE’, apply function in parallel, using parallel
backend provided by foreach
so presumably you need to use the foreach package rather than the parallel package.