Exposing warnings from nodes using snow - r

I have an lapply operation that I've parallelised using snow. This works fine except that any warnings generated seem to just get ignored and are hence never shown to the user. Is there a way of exposing warnings on individual nodes so they come through in the main R process?
My best idea at the moment is to have all nodes write their warnings to files, and read those at the end, but there must be a better way!
Here's a reprex:
library(snow)
f <- function(x){
warning("mywarning")
return(NULL)
}
cl <- makeCluster(2, type="SOCK")
lapply(1:2, f) # Gives me warnings, as desired
clusterApply(cl, 1:2, f) # Gives me the same output, faster, but with no warnings

In the end I ended up switching from snow to the future.apply package (in conjunction with parallel). future.apply now has this behaviour by default.
Unfortunately in most cases the messages/warnings don't appear until the whole run has finished, but that's a whole new issue.

Related

'mc.cores' > 1 is not supported on Windows

I am new to programming and I am trying to use parallel processing for R in windows, using an existing code.
Following is the snippet of my code:
if (length(grep("linux", R.version$os)) == 1){
num_cores = detectCores()
impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
}
# else if(length(grep("mingw32", R.version$os)) == 1){
# num_cores = detectCores()
# impact_list <- mclapply(len_a, impact_func, mc.cores = (num_cores - 1))
#
# }
else{
impact_list <- lapply(len_a, impact_func)
}
return(sum(unlist(impact_list, use.names = F)))
This works fine, I am using R on windows so the code enters in 'else' statement and it runs the code using lapply() and not by parallel processing.
I have added the 'else if' statement to make it work for windows. So when I un-comment 'else if' block of code and run it, I am getting an error "'mc.cores' > 1 is not supported on Windows".
Please suggest how can I use parallel processing in windows, so that less time is taken to run the code.
Any help will be appreciated.
(disclaimer: I'm author of the future framework here)
The future.apply package provides parallel versions of R's built-in "apply" functions. It's cross platform, i.e. it works on Linux, macOS, and Windows. The package allows you to often just replace an existing lapply() with a future_lapply() call, e.g.
library(future.apply)
plan(multisession)
your_fcn <- function(len_a) {
impact_list <- future_lapply(len_a, impact_func)
sum(unlist(impact_list, use.names = FALSE))
}
Regarding mclapply() per se: If you use parallel::mclapply() in your code, make sure that there is always an option not to use it. The reason is that it is not guaranteed to work in all environment, that is, it might be unstable and crash R. In R-devel thread 'mclapply returns NULLs on MacOS when running GAM' (https://stat.ethz.ch/pipermail/r-devel/2020-April/079384.html), the author of mclapply() wrote on 2020-04-28:
Do NOT use mcparallel() in packages except as a non-default option that user can set for the reasons Henrik explained. Multicore is intended for HPC applications that need to use many cores for computing-heavy jobs, but it does not play well with RStudio and more importantly you don't know the resource available so only the user can tell you when it's safe to use. Multi-core machines are often shared so using all detected cores is a very bad idea. The user should be able to explicitly enable it, but it should not be enabled by default.

How can I run multiple independent and unrelated functions in parallel without larger code do-over?

I've been searching around the internet, trying to understand parallel processing.
What they all seem to assume is that I have some kind of loop function operating on e.g. every Nth row of a data set divided among N cores and combined afterwards, and I'm pointed towards a lot of parallelized apply() functions.
(Warning, ugly code below)
My situation though is that I have is on form
tempJob <- myFunction(filepath, string.arg1, string.arg2)
where the path is a file location, and the string arguments are various ways of sorting my data.
My current workflow is simply amassing a lot of
tempjob1 <- myFunction(args)
tempjob2 <- myFunction(other args)
...
tempjobN <- myFunction(some other args here)
# Make a list of all temporary outputs in the global environment
temp.list <- lapply(ls(pattern = "temp"), get)
# Stack them all
df <- rbindlist(temp.list)
# Remove all variables from workspace matching "temp"
rm(list=ls(pattern="temp"))
These jobs are entirely independent, and could in principle be run in 8 separate instances of R (although that would be a bother to manage I guess). How can I separate the first 8 jobs out to 8 cores, and whenever a core finishes its job and returns a treated dataset to the global environment it'll simply take whichever job is next in line.
With the future package (I'm the author) you can achieve what you want with a minor modification to your code - use "future" assignments %<-% instead of regular assignments <- for the code you want to run asynchronously.
library("future")
plan(multisession)
tempjob1 %<-% myFunction(args)
tempjob2 %<-% myFunction(other args)
...
tempjobN %<-% myFunction(some other args here)
temp.list <- lapply(ls(pattern = "temp"), get)
EDIT 2022-01-04: plan(multiprocess) -> plan(multisession) since multiprocess is deprecated and will eventually be removed.
Unless you are unfortunate enough to be using Windows, you could maybe try with GNU Parallel like this:
parallel Rscript ::: script1.R script2.R JOB86*.R
and that would keep 8 scripts running at a time, if your CPU has 8 cores. You can change it with -j 4 if you just want 4 at a time. The JOB86 part is just random - I made it up.
You can also add switches for a progress bar, for how to handle errors, for adding parameters and distributing jobs across multiple machines.
If you are on a Mac, you can install GNU Parallel with homebrew:
brew install parallel
I think the easiest way is to use one of the parallelized apply functions. Those will do all the fiddly work of separating out the jobs, taking whichever job is next in line, etc.
Put all your arguments into a list:
args <- list(
list(filePath1, stringArgs11, stringArgs21),
list(filePath2, stringArgs12, stringArgs22),
...
list(filePath8, stringArgs18, stringArgs28)
)
Then do something like
library(parallel)
cl <- makeCluster(detectCores())
df <- parSapply(cl, args, myFunction)
I'm not sure about parSapply, and I can't check as R isn't working on my machine just now. If that doesn't work, use parLapply and then manipulate the result.

How do I parallelize in r on windows - example?

How do I get parallelizaton of code to work in r in Windows? Include a simple example. Posting this self-answered question because this was rather unpleasant to get working. You'll find package parallel does NOT work on its own, but package snow works very well.
Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.
library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)
You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.
This worked for me, I used package doParallel, required 3 lines of code:
# process in parallel
library(doParallel)
cl <- makeCluster(detectCores(), type='PSOCK')
registerDoParallel(cl)
# turn parallel processing off and run sequentially again:
registerDoSEQ()
Calculation of a random forest decreased from 180 secs to 120 secs (on a Windows computer with 4 cores).
Based on the information here I was able to convert the following code into a parallelised version that worked under R Studio on Windows 7.
Original code:
#
# Basic elbow plot function
#
wssplot <- function(data, nc=20, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of clusters",
ylab="Within groups sum of squares")
}
Parallelised code:
library("parallel")
workerFunc <- function(nc) {
set.seed(1234)
return(sum(kmeans(my_data_frame, centers=nc, iter.max=30)$withinss)) }
num_cores <- detectCores()
cl <- makeCluster(num_cores)
clusterExport(cl, varlist=c("my_data_frame"))
values <- 1:20 # this represents the "nc" variable in the wssplot function
system.time(
result <- parLapply(cl, values, workerFunc) ) # paralel execution, with time wrapper
stopCluster(cl)
plot(values, unlist(result), type="b", xlab="Number of clusters", ylab="Within groups sum of squares")
Not suggesting it's perfect or even best, just a beginner demonstrating that parallel does seem to work under Windows. Hope it helps.
I think these libraries will help you:
foreach (facilitates executing the loop in parallel)
doSNOW (I think you already use it)
doMC (multicore functionality of the parallel package)
May these article also help you
http://vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/
http://www.joyofdata.de/blog/parallel-computing-r-windows-using-dosnow-foreach/
I'm posting a cross-platform answer here because all the other answers I found were over-complicated for what I needed to accomplish. I'm using an example where I'm reading in all sheets of an excel workbook.
# read in the spreadsheet
parallel_read <- function(file){
# detect available cores and use 70%
numCores = round(parallel::detectCores() * .70)
# check if os is windows and use parLapply
if(.Platform$OS.type == "windows") {
cl <- makePSOCKcluster(numCores)
parLapply(cl, file, readxl::read_excel)
stopCluster(cl)
return(dfs)
# if not Windows use mclapply
} else {
dfs <-parallel::mclapply(excel_sheets(file),
readxl::read_excel,
path = file,
mc.cores=numCores)
return(dfs)
}
}
For what it is worth. I was running into the same problem but couldn't get any of these to work. I eventually learned that Rstudio has a 'jobs' pane and can run models in the background each on it's own core. so what I did was divy-up my model into 10 segments (it was iterative over 100 vectors so 10 scripts of 10 vectors each) and ran each as a separate job. that way when one finished I could use the output from it immediately and I could keep working on my script without waiting for each model to finish. here is the link all about using jobs https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/

Parallelization in R: how to "source" on every node?

I have created parallel workers (all running on the same machine) using:
MyCluster = makeCluster(8)
How can I make every of these 8 nodes source an R-file I wrote?
I tried:
clusterCall(MyCluster, source, "myFile.R")
clusterCall(MyCluster, 'source("myFile.R")')
And several similar versions. But none worked.
Can you please help me to find the mistake?
Thank you very much!
The following code serves your purpose:
library(parallel)
cl <- makeCluster(4)
clusterCall(cl, function() { source("test.R") })
## do some parallel work
stopCluster(cl)
Also you can use clusterEvalQ() to do the same thing:
library(parallel)
cl <- makeCluster(4)
clusterEvalQ(cl, source("test.R"))
## do some parallel work
stopCluster(cl)
However, there is subtle difference between the two methods. clusterCall() runs a function on each node while clusterEvalQ() evaluates an expression on each node. If you have a variable list of files to source, clusterCall() will be easier to use since clusterEvalQ(cl,expr) will regard any expr as an expression so it's not convenient to put a variable there.
If you use a command to source a local file, ensure the file is there.
Else place the file on a network share or NFS, and source the absolute path.
Better still, and standard answers, write a package and have that package installed on each node and then just call library() or require().

Error when using %dopar% instead of %do% in R (package doParallel)

I've come up with a strange error.
Suppose I have 10 xts objects in a list called data. I now search for every three combinations using
data_names <- names(data)
combs <- combn(data_names, 3)
My basic goal is to do a PCA on those 1080 triples.
To speed things up I wanted do use the package doParallel. So here is the snippet shortened till the point where the error occurs:
list <- foreach(i=1:ncol(combs)) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Here, the merge function seems to be the problem. The error is
task 1 failed - "cannot coerce class 'c("xts", "zoo")' into a data.frame"
However, when changing %dopar% to a normal serial %do% everything works as accepted.
Till now I was not able to find any solution to this problem and I'm not even sure what to look for.
A better solution rather than explicitly loading the libraries within the function would be to utilise the .packages argument of the foreach() function:
list <- foreach(i=1:ncol(combs),.packages=c("xts","zoo")) %dopar% {
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
The problem is likely that you haven't called library(xts) on each of the workers. You don't say what backend you're using, so I can't be 100% sure.
If that's the problem, then this code will fix it:
list <- foreach(i=1:ncol(combs)) %dopar% {
library(xts)
tmp_triple <- combs[,i]
p1<-data[tmp_triple[[1]]][[1]]
p2<-data[tmp_triple[[2]]][[1]]
p3<-data[tmp_triple[[3]]][[1]]
data.merge <- merge(p1,p2,p3,all=FALSE)
}
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
These are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. It worked in my case.

Resources