I am using the example code posted here to show a progress_bar (from the progress package) with doParallel + foreach. Solutions there however make use of doSNOW (e.g. code by Dewey Brooke that I am using for testing), which is more outdated than doParallel and returns this NOTE when building a package with CRAN flags:
Uses the superseded package: ‘doSNOW (>= 1.0.19)’
Change this does not seems that as straightforward as expected. If only registerDoSNOW is replaced by registerDoParallel, and .options.snow by .options.doparallel the code will run, but in the second case will not show any progress bar at all.
I think this might relate to the use of .options.X. This part of the code is very obscure to me, since .options.snow works indeed when using doSNOW, but there is no documentation of the foreach man page about the use of this argument. Therefore, it is not surprising that .options.doparallel does not work, since it was just a wild guess of mine.
Including the call to pb$tick() within the foreach loop will not work either, and will actually cause the result to be wrong. So I am really out of ideas on where should I include this in the code.
Where .options.snow comes from? where should go pb$tick(), how to show the progress_bar object using doParallel here?
I paste below the code (doSNOW replaced by doParallel) for convenience, but credit again the original source:
library(parallel)
library(doParallel)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoParallel(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.doparallel = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
doParallel still uses the .options.snow argument for whatever reason. Found this little tidbit in the doParallel documentation.
The doParallel backend supports both multicore and snow options passed through the foreach function. The supported multicore options are 1st preschedule, set.seed, silent, and cores, which are analogous to the similarly named arguments to mclapply, and are passed using the .options.multicore argument to foreach. The supported snow options are preschedule, which like its multicore analog can be used to chunk the tasks so that each worker gets a prescheduled chunk of tasks, and attachExportEnv, which can be used to attach the export environment in certain cases where R’s lexical scoping is unable to find a needed export. The snow options are passed to foreach using the .options.snow argument.
foreach is powerful package but whoever is maintaining it makes odd decisions.
EDIT
doParallel does not support the progress multicore option. Therefore, a progress bar will NOT display if registerDoParallel is used instead of registerDoSNOW.
While doSNOW has been superseded, it's unclear if one is more outdated than the other since both have undergone very few changes, either than updating the current Maintainer (doParallel | doSNOW).
doSNOW
doSNOW:::doSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
progressWrapper <- function(...) NULL # <- CRITICAL DIFFERENCE
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv",
"progress") # <- CRITICAL DIFFERENCE
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...
doParallel
doParallel:::doParallelSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
# MISSING: progressWrapper <- function(...) NULL
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv" #MISSING , "progress")
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...
Related
I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.
I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.
You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}
few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of
eval(parse(text = Function text))
here is the link: Automatic creation and use of custom made function in R.
This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.
Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.
what I would like to do is run the evaluation of the indices in parallel to reduce the computation time.
here a dummy example to what I would like to emulate, but not in parallel computing.
# create a fake matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4
image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))
#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
"HYPR_IMG[,,3] + HYPR_IMG[,,2]",
"HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
"HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))
# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID=IDX_DF$IDXname[i]
IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
eval(parse(text = IDXFunc_call))
IDX_VAL=IDXfun_tmp (HYPR_IMG)
image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID)
temp_DF=as.vector(IDX_VAL)
Store_DF=cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.
Is it really possible ?
Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?
Thanks a lot!!!
For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial.
Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.
cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)
Please note the stopCluster(cl) which is important to shut down any loose R sessions.
Benchmark results (4 tiny cores):
Unit: milliseconds
expr min lq mean median uq max neval
Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623 100
Parallel 1.382324 1.491634 2.038024 1.554690 1.907728 18.23942 100
For replications of benchmarks the code has been provided below:
cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
Loop = {
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID = IDX_DF$IDXname[i]
IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
IDX_VAL = IDXfun_tmp(HYPR_IMG)
#Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
#image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID)
temp_DF = as.vector(IDX_VAL)
Store_DF = cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
rm(Store_DF)
},
Parallel = {
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
rm(result)
}
)
parallel::stopCluster(cl)
Edit: Using the foreach package
After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:
The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.
Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.
cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row")
library(foreach)
result <-
foreach(
#Supply the iterator
row = rowIterator,
#Specify if the calculations needs to be in order. If not then we can get better performance not doing so
.inorder = FALSE,
#In most foreach loops you will have to export the data you need for the calculations
# it worked without doing so for me, in which case it is faster if the exported stuff is large
#.export = c("HYPR_IMG"),
#We need to say how the output should be merged. If nothing is given it will be output as a list
#data.table rbindlist is faster than rbind (returns a data.table)
.combine = function(...)data.table::rbindlist(list(...)) ,
#otherwise we could've used:
#.combine = rbind
#if we dont use rbind or cbind (i used data.table::rbindlist for speed)
# we will have to tell if it can take more than 1 argument
.multicombine = TRUE
) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%
IDX_ID <- row[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID],
indx~ID,
value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)
I have a function where I dynamically build multiple formulas as strings and cast them to a formulas with as.formula. I then call that function in a parallel process using doSNOW and foreach and use those formulas through dplyr::mutate_.
When I use lapply(formula_list, as.formula) I get the error could not find function *custom_function* when run in parallel, though it works fine when run locally. However, when I use lapply(formula_list, function(x) as.formula(x) it works both in parallel and locally.
Why? What's the correct way to understand the environments here and the "right" way to code it?
I do get a warning that says: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): *custom_func*
A minimal reproducible example is below.
# Packages
library(dplyr)
library(doParallel)
library(doSNOW)
library(foreach)
# A simple custom function
custom_sum <- function(x){
sum(x)
}
# Functions that call create formulas and use them with nse dplyr:
dplyr_mut_lapply_reg <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), as.formula),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
dplyr_mut_lapply_lambda <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), function(x) as.formula(x)),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
#1. CALLING BOTH LOCALLY
dplyr_mut_lapply_lambda(iris) #works
dplyr_mut_lapply_reg(iris) #works
#2. CALLING IN PARALLEL
#Faux Parallel Setup
cl <- makeCluster(1, outfile="")
registerDoSNOW(cl)
# Call Lambda Version WORKS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_lambda(iris)
}
# Call Regular Version FAILS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_reg(iris)
}
# Close Cluster
stopCluster(cl)
EDIT: In my original post title I wrote that I was using nse, but I really meant using standard evaluation. Whoops. I have changed this accordingly.
I don't have an exact answer to why here, but the future package (I'm the author) handles these type of "tricky" globals - they are tricky because they are not part of a package and they are nested, i.e. one global calls another global. For example, if you use:
library("doFuture")
cl <- parallel::makeCluster(1, outfile = "")
plan(cluster, workers = cl)
registerDoFuture()
that problematic "Call Regular Version FAILS" case should now work.
Now, the above uses parallel::makeCluster() which defaults to type = "PSOCK", whereas if you load doSNOW you get snow::makeCluster() which defaults to type = "MPI". Unfortunately, a full MPI backend is yet not implemented for the future package. Thus, if you're looking for an MPI solution, this won't help you (yet).
I'm having issues to combine the process that I want to run in parallel and the creation of the progress bar.
My code for the process is:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst,.packages = c('tcltk', 'foreach'), .combine = rbind)
%dopar%
{ mypb <- tkProgressBar(title = "R progress bar", label = "",
min = 0, max = max(jSeq), initial = 0, width = 300)
foreach(j = jSeq) %do% {Sys.sleep(.1)
setTkProgressBar(mypb, j, title = "pb", label = NULL)
}
library(pnn)
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
I combined my code and the one that comes form here
but didn't compile. I get a syntax error, but I can't find it.
I tried this other code:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar%
{library(pnn)
cat(i, '\n')
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
But I get an error too.
The approach that you're trying to use might work under certain circumstances, but it isn't a good general solution. What I would want to do is to create a progress bar in the master process (outside of the foreach loop) and then have foreach update that progress bar as tasks are returned. Unfortunately, none of the backends support that. It's possible to do that using combine function tricks, but only if you're using a backend that supports calling the combine function on-the-fly, which doParallel, doSNOW and doMC do not. Those backends don't call combine on the fly because they are implemented using functions such as clusterApplyLB and mclapply which don't support a hook to allow user supplied code to be executed when tasks are returned.
Because I've seen interest in progress bar support in foreach, I modified the doSNOW package to add support for a doSNOW-specific "progress" option, and I checked the code into the R-Forge website. It makes use of some lower level functions in the snow package which unfortunately are not exported by the parallel package.
If you want to try out this new feature, you will need to install doSNOW from R-Forge. I did this on my MacBook Pro using the command:
install.packages("doSNOW", repos="http://R-Forge.R-project.org", type="source")
Here is a simple example script that demonstrates the experimental "progess" option:
library(doSNOW)
library(tcltk)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
pb <- tkProgressBar(max=100)
progress <- function(n) setTkProgressBar(pb, n)
opts <- list(progress=progress)
r <- foreach(i=1:100, .options.snow=opts) %dopar% {
Sys.sleep(1)
sqrt(i)
}
Update
The progress option is now available in the latest version of doSNOW on CRAN.
there are some informative posts on how to create a counter for loops in an R program. However, how do you create a similar function when using the parallelized version with "foreach()"?
Edit: After an update to the doSNOW package it has become quite simple to display a nice progress bar when using %dopar% and it works on Linux, Windows and OS X
doSNOW now officially supports progress bars via the .options.snow argument.
library(doSNOW)
cl <- makeCluster(2)
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(max = iterations, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
result <- foreach(i = 1:iterations, .combine = rbind,
.options.snow = opts) %dopar%
{
s <- summary(rnorm(1e6))[3]
return(s)
}
close(pb)
stopCluster(cl)
Yet another way of tracking progress, if you keep in mind the total number of iterations, is to set .verbose = T as this will print to the console which iterations have been finished.
Previous solution for Linux and OS X
On Ubuntu 14.04 (64 bit) and OS X (El Capitan) the progress bar is displayed even when using %dopar% if in the makeCluster function oufile = "" is set. It does not seem to work under Windows. From the help on makeCluster:
outfile: Where to direct the stdout and stderr connection output from the workers. "" indicates no redirection (which may only be useful for workers on the local machine). Defaults to ‘/dev/null’ (‘nul:’ on Windows).
Example code:
library(foreach)
library(doSNOW)
cl <- makeCluster(4, outfile="") # number of cores. Notice 'outfile'
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(min = 1, max = iterations, style = 3)
result <- foreach(i = 1:iterations, .combine = rbind) %dopar%
{
s <- summary(rnorm(1e6))[3]
setTxtProgressBar(pb, i)
return(s)
}
close(pb)
stopCluster(cl)
This is what the progress bar looks like. It looks a little odd since a new bar is printed for every progression of the bar and because a worker may lag a bit which causes the progress bar to go back and forth occasionally.
You can also get this to work with the progress package.
# loading parallel and doSNOW package and creating cluster ----------------
library(parallel)
library(doSNOW)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoSNOW(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.snow = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
This code is a modified version of the doRedis example, and will make a progress bar even when using %dopar% with a parallel backend:
#Load Libraries
library(foreach)
library(utils)
library(iterators)
library(doParallel)
library(snow)
#Choose number of iterations
n <- 1000
#Progress combine function
f <- function(){
pb <- txtProgressBar(min=1, max=n-1,style=3)
count <- 0
function(...) {
count <<- count + length(list(...)) - 1
setTxtProgressBar(pb,count)
Sys.sleep(0.01)
flush.console()
c(...)
}
}
#Start a cluster
cl <- makeCluster(4, type='SOCK')
registerDoParallel(cl)
# Run the loop in parallel
k <- foreach(i = icount(n), .final=sum, .combine=f()) %dopar% {
log2(i)
}
head(k)
#Stop the cluster
stopCluster(cl)
You have to know the number of iterations and the combination function ahead of time.
This is now possible with the parallel package. Tested with R 3.2.3 on OSX 10.11, running inside RStudio, using a "PSOCK"-type cluster.
library(doParallel)
# default cluster type on my machine is "PSOCK", YMMV with other types
cl <- parallel::makeCluster(4, outfile = "")
registerDoParallel(cl)
n <- 10000
pb <- txtProgressBar(0, n, style = 2)
invisible(foreach(i = icount(n)) %dopar% {
setTxtProgressBar(pb, i)
})
stopCluster(cl)
Strangely, it only displays correctly with style = 3.
You save the start time with Sys.time() before the loop. Loop over rows or columns or something which you know the total of. Then, inside the loop you can calculate the time ran so far (see difftime), percentage complete, speed and estimated time left. Each process can print those progress lines with the message function. You'll get an output something like
1/1000 complete # 1 items/s, ETA: 00:00:45
2/1000 complete # 1 items/s, ETA: 00:00:44
Obviously the looping order will greatly affect how well this works. Don't know about foreach but with multicore's mclapply you'd get good results using mc.preschedule=FALSE, which means that items are allocated to processes one-by-one in order as previous items complete.
This code implements a progress bar tracking a parallelized foreach loop using the doMC backend, and using the excellent progress package in R. It assumes that all cores, specified by numCores, do an approximately equal amount of work.
library(foreach)
library(doMC)
library(progress)
iterations <- 100
numCores <- 8
registerDoMC(cores=numCores)
pbTracker <- function(pb,i,numCores) {
if (i %% numCores == 0) {
pb$tick()
}
}
pb <- progress_bar$new(
format <- " progress [:bar] :percent eta: :eta",
total <- iterations / numCores, clear = FALSE, width= 60)
output = foreach(i=1:iterations) %dopar% {
pbTracker(pb,i,numCores)
Sys.sleep(1/20)
}
The following code will produce a nice progress bar in R for the foreach control structure. It will also work with graphical progress bars by replacing txtProgressBar with the desired progress bar object.
# Gives us the foreach control structure.
library(foreach)
# Gives us the progress bar object.
library(utils)
# Some number of iterations to process.
n <- 10000
# Create the progress bar.
pb <- txtProgressBar(min = 1, max = n, style=3)
# The foreach loop we are monitoring. This foreach loop will log2 all
# the values from 1 to n and then sum the result.
k <- foreach(i = icount(n), .final=sum, .combine=c) %do% {
setTxtProgressBar(pb, i)
log2(i)
}
# Close the progress bar.
close(pb)
While the code above answers your question in its most basic form a better and much harder question to answer is whether you can create an R progress bar which monitors the progress of a foreach statement when it is parallelized with %dopar%. Unfortunately I don't think it is possible to monitor the progress of a parallelized foreach in this way, but I would love for someone to prove me wrong, as it would be very useful feature.