do parallel combine progress bar and process - r

I'm having issues to combine the process that I want to run in parallel and the creation of the progress bar.
My code for the process is:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst,.packages = c('tcltk', 'foreach'), .combine = rbind)
%dopar%
{ mypb <- tkProgressBar(title = "R progress bar", label = "",
min = 0, max = max(jSeq), initial = 0, width = 300)
foreach(j = jSeq) %do% {Sys.sleep(.1)
setTkProgressBar(mypb, j, title = "pb", label = NULL)
}
library(pnn)
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
I combined my code and the one that comes form here
but didn't compile. I get a syntax error, but I can't find it.
I tried this other code:
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar%
{library(pnn)
cat(i, '\n')
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names = NULL)
}
}
But I get an error too.

The approach that you're trying to use might work under certain circumstances, but it isn't a good general solution. What I would want to do is to create a progress bar in the master process (outside of the foreach loop) and then have foreach update that progress bar as tasks are returned. Unfortunately, none of the backends support that. It's possible to do that using combine function tricks, but only if you're using a backend that supports calling the combine function on-the-fly, which doParallel, doSNOW and doMC do not. Those backends don't call combine on the fly because they are implemented using functions such as clusterApplyLB and mclapply which don't support a hook to allow user supplied code to be executed when tasks are returned.
Because I've seen interest in progress bar support in foreach, I modified the doSNOW package to add support for a doSNOW-specific "progress" option, and I checked the code into the R-Forge website. It makes use of some lower level functions in the snow package which unfortunately are not exported by the parallel package.
If you want to try out this new feature, you will need to install doSNOW from R-Forge. I did this on my MacBook Pro using the command:
install.packages("doSNOW", repos="http://R-Forge.R-project.org", type="source")
Here is a simple example script that demonstrates the experimental "progess" option:
library(doSNOW)
library(tcltk)
cl <- makeSOCKcluster(3)
registerDoSNOW(cl)
pb <- tkProgressBar(max=100)
progress <- function(n) setTkProgressBar(pb, n)
opts <- list(progress=progress)
r <- foreach(i=1:100, .options.snow=opts) %dopar% {
Sys.sleep(1)
sqrt(i)
}
Update
The progress option is now available in the latest version of doSNOW on CRAN.

Related

Showing progress_bar with doParallel + foreach

I am using the example code posted here to show a progress_bar (from the progress package) with doParallel + foreach. Solutions there however make use of doSNOW (e.g. code by Dewey Brooke that I am using for testing), which is more outdated than doParallel and returns this NOTE when building a package with CRAN flags:
Uses the superseded package: ‘doSNOW (>= 1.0.19)’
Change this does not seems that as straightforward as expected. If only registerDoSNOW is replaced by registerDoParallel, and .options.snow by .options.doparallel the code will run, but in the second case will not show any progress bar at all.
I think this might relate to the use of .options.X. This part of the code is very obscure to me, since .options.snow works indeed when using doSNOW, but there is no documentation of the foreach man page about the use of this argument. Therefore, it is not surprising that .options.doparallel does not work, since it was just a wild guess of mine.
Including the call to pb$tick() within the foreach loop will not work either, and will actually cause the result to be wrong. So I am really out of ideas on where should I include this in the code.
Where .options.snow comes from? where should go pb$tick(), how to show the progress_bar object using doParallel here?
I paste below the code (doSNOW replaced by doParallel) for convenience, but credit again the original source:
library(parallel)
library(doParallel)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoParallel(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.doparallel = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
doParallel still uses the .options.snow argument for whatever reason. Found this little tidbit in the doParallel documentation.
The doParallel backend supports both multicore and snow options passed through the foreach function. The supported multicore options are 1st preschedule, set.seed, silent, and cores, which are analogous to the similarly named arguments to mclapply, and are passed using the .options.multicore argument to foreach. The supported snow options are preschedule, which like its multicore analog can be used to chunk the tasks so that each worker gets a prescheduled chunk of tasks, and attachExportEnv, which can be used to attach the export environment in certain cases where R’s lexical scoping is unable to find a needed export. The snow options are passed to foreach using the .options.snow argument.
foreach is powerful package but whoever is maintaining it makes odd decisions.
EDIT
doParallel does not support the progress multicore option. Therefore, a progress bar will NOT display if registerDoParallel is used instead of registerDoSNOW.
While doSNOW has been superseded, it's unclear if one is more outdated than the other since both have undergone very few changes, either than updating the current Maintainer (doParallel | doSNOW).
doSNOW
doSNOW:::doSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
progressWrapper <- function(...) NULL # <- CRITICAL DIFFERENCE
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv",
"progress") # <- CRITICAL DIFFERENCE
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...
doParallel
doParallel:::doParallelSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
# MISSING: progressWrapper <- function(...) NULL
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv" #MISSING , "progress")
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...

How to initialize workers to use package functions in parallel

I am developing an R package and trying to use parallel processing in it for an embarrassingly parallel problem. I would like to write a loop or functional that uses the other functions from my package. I am working in Windows, and I have tried using parallel::parLapply and foreach::%dopar%, but cannot get the workers (cores) to access the functions in my package.
Here's an example of a simple package with two functions, where the second calls the first inside a parallel loop using %dopar%:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m) %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
When I load the package with devtools::load_all() and call the slowadd function, Error in { : task 1 failed - "could not find function "add10"" is returned.
I have also tried explicitly initializing the workers with my package:
add10 <- function(x) x + 10
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
}
but I get the error Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : worker initialization failed: there is no package called 'mypackage'.
How can I get the workers to access the functions in my package? A solution using foreach would be great, but I'm completely open to solutions using parLapply or other functions/packages.
I was able to initialize the workers with my package's functions, thanks to people's helpful comments. By making sure that all of the package functions that were needed were exported in the NAMESPACE and installing my package with devtools::install(), foreach was able to find the package for initialization. The R script for the example would look like this:
#' #export
add10 <- function(x) x + 10
#' #export
slowadd <- function(m) {
cl <- parallel::makeCluster(parallel::detectCores() - 1)
doParallel::registerDoParallel(cl)
`%dopar%` <- foreach::`%dopar%` # so %dopar% doesn't need to be attached
out <- foreach::foreach(i = 1:m, .packages = 'mypackage') %dopar% {
Sys.sleep(1)
add10(i)
}
stopCluster(cl)
return(out)
}
This is working, but it's not an ideal solution. First, it makes for a much slower workflow. I was using devtools::load_all() every time I made a change to the package and wanted to test it (before incorporating parallelism), but now I have to reinstall the package every time, which is slow when the package is large. Second, every function that is needed in the parallel loop needs to be exported so that foreach can find it. My actual use case has a lot of small utility functions which I would rather keep internal.
You can use devtools::load_all() inside the foreach loop or load the functions you need with source.
out <- foreach::foreach(i = 1:m ) %dopar% {
Sys.sleep(1)
source("R/some_functions.R")
load("R/sysdata.rda")
add10(i)
}

Is it possible to get a progress bar with foreach and a "multicore-kind" of backend

While using "multicore" parallelism using foreach and the doMC backend (I use doMC as at the time I looked into it other package did not allow logging from the I would like to get a progress bar, using the progress package, but any progress (that works on a linux terminal ie no tcltk popups) could do.
Given it uses forking I can imagine it might not be possible but I am not sure.
The intended use is to indicate progress when I load an concatenate 100's of files in parallel (usually within a #!Rscript)
I've looked at the few posts like How do you create a progress bar when using the “foreach()” function in R?. Happy to award a bounty on this.
EDIT
500 points bounty offered for someone showing me how to
using foreach and a multicore (forking) type of parallelism
get a progress bar
get logging using futile.logger
Reprex
# load packages
library("futile.logger")
library("data.table")
library("foreach")
# create temp dir
tmp_dir <- tempdir()
# create names for 200 files to be created
nb_files <- 200L
file_names <- file.path(tmp_dir, sprintf("file_%s.txt", 1:nb_files))
# make it reproducible
set.seed(1L)
nb_rows <- 1000L
nb_columns <- 10L
# create those 200 files sequentially
foreach(file_i = file_names) %do%
{
DT <- as.data.table(matrix(data = runif(n = nb_rows * nb_columns), nrow = nb_rows))
fwrite(x = DT, file = file_i)
flog.info("Creating file %s", file_i)
} -> tmp
# Load back the files
foreach(file_i = file_names, .final = rbindlist) %dopar%
{
flog.info("Loading file %s", file_i)
# >>> SOME PROGRESS BAR HERE <<<
fread(file_i)
} -> final_data
# show data
final_data
Desired output
Note that the progress bar is not messed up with the print lines)
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_197.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_198.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_199.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_200.txt
[ =======> ] 4%
EDIT 2
After the bounty ended nothing comes close to the expected result.
Logging within the progress bar messes everything.
If someone gets the correct result I'll give another result-based bounty.
Here's a solution (not perfect) using custom function.
This function outputs to console (using message) progress bar.
ii is current iteration.
N is total number of iterations to perform.
per is step (percent) when to update the progress bar. We need this as when multiple iterations are being performed progress bar gets updated too often and output is messed up.
Function:
progBar <- function(ii, N, per = 10) {
if (ii %in% seq(1, N, per)) {
x <- round(ii * 100 / N)
message("[ ",
paste(rep("=", x), collapse = ""),
paste(rep("-", 100 - x), collapse = ""),
" ] ", x, "%", "\r",
appendLF = FALSE)
if (ii == N) cat("\r")
}
}
Code to test:
library(doMC)
library(foreach)
registerDoMC(10)
nIteration <- 1e3
foreach(i = 1:nIteration, ii = icount()) %dopar% {
# For progBar ii I'm using icount(), because
# user might iterate over all kind of objects
progBar(ii, nIteration)
Sys.sleep(1 / 10)
}
PS: It's not perfect, because:
Bar not always runs to 100% (depending on the number of iterations it can stop at 99%)
Sometimes output messes up (depends on number of iterations and how often they switch) - still debugging this part
Console is not flushed if you use print/cat within foreach
You can refer to this link Progress bar parallel for the few insights (May be not the exact solution) which will help in creating a progress bar parallel.
The txtProgressBar only works when the stype is 2 or 3
library("foreach")
library("doParallel")
library("progress")
registerDoParallel(parallel::makeCluster(7, outfile = ""))
pb <- progress_bar$new(
format = " [:bar] :percent in :elapsed",
total = 30, clear = FALSE, width = 80, force = T)
a <- foreach (i = 1:30) %dopar% {
pb$tick()
Sys.sleep(0.5)
}
pb <- txtProgressBar(title = "Iterative training", min = 0, max = 30, style = 3)
foreach (i = 1:30) %dopar% {
setTxtProgressBar(pb, i)
Sys.sleep(0.5)
}
Do refer this link Monitoring the function with progress bar for the different ways a progress bar can be implemented depending on the needs.
Using Multicore:
You can register a different parallel backend later, or deregister doMC by registering the sequential backend by calling the registerDoSEQ function. For example consider the following program
> x <- iris[which(iris[,5] != "setosa"), c(1,5)]
> trials <- 10000
> ptime <- system.time({
+ r <- foreach(icount(trials), .combine=cbind) %dopar% {
+ ind <- sample(100, 100, replace=TRUE)
+ result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
+ coefficients(result1)
+ }
+ })[3]
> ptime
A package I have used that does this in parallel for processing lists is pbmcapply, hope this helps.

How does foreach package scope R Environments when using as.formula, SE dplyr, and lapply?

I have a function where I dynamically build multiple formulas as strings and cast them to a formulas with as.formula. I then call that function in a parallel process using doSNOW and foreach and use those formulas through dplyr::mutate_.
When I use lapply(formula_list, as.formula) I get the error could not find function *custom_function* when run in parallel, though it works fine when run locally. However, when I use lapply(formula_list, function(x) as.formula(x) it works both in parallel and locally.
Why? What's the correct way to understand the environments here and the "right" way to code it?
I do get a warning that says: In e$fun(obj, substitute(ex), parent.frame(), e$data) : already exporting variable(s): *custom_func*
A minimal reproducible example is below.
# Packages
library(dplyr)
library(doParallel)
library(doSNOW)
library(foreach)
# A simple custom function
custom_sum <- function(x){
sum(x)
}
# Functions that call create formulas and use them with nse dplyr:
dplyr_mut_lapply_reg <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), as.formula),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
dplyr_mut_lapply_lambda <- function(df){
my_dots <- setNames(
object = lapply(list("~custom_sum(Sepal.Length)"), function(x) as.formula(x)),
nm = c("Sums")
)
return(
df %>%
group_by(Species) %>%
mutate_(.dots = my_dots)
)
}
#1. CALLING BOTH LOCALLY
dplyr_mut_lapply_lambda(iris) #works
dplyr_mut_lapply_reg(iris) #works
#2. CALLING IN PARALLEL
#Faux Parallel Setup
cl <- makeCluster(1, outfile="")
registerDoSNOW(cl)
# Call Lambda Version WORKS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_lambda(iris)
}
# Call Regular Version FAILS
foreach(j = 1,
.packages = c("dplyr", "tidyr"),
.export = lsf.str()
) %dopar% {
dplyr_mut_lapply_reg(iris)
}
# Close Cluster
stopCluster(cl)
EDIT: In my original post title I wrote that I was using nse, but I really meant using standard evaluation. Whoops. I have changed this accordingly.
I don't have an exact answer to why here, but the future package (I'm the author) handles these type of "tricky" globals - they are tricky because they are not part of a package and they are nested, i.e. one global calls another global. For example, if you use:
library("doFuture")
cl <- parallel::makeCluster(1, outfile = "")
plan(cluster, workers = cl)
registerDoFuture()
that problematic "Call Regular Version FAILS" case should now work.
Now, the above uses parallel::makeCluster() which defaults to type = "PSOCK", whereas if you load doSNOW you get snow::makeCluster() which defaults to type = "MPI". Unfortunately, a full MPI backend is yet not implemented for the future package. Thus, if you're looking for an MPI solution, this won't help you (yet).

How do you create a progress bar when using the "foreach()" function in R?

there are some informative posts on how to create a counter for loops in an R program. However, how do you create a similar function when using the parallelized version with "foreach()"?
Edit: After an update to the doSNOW package it has become quite simple to display a nice progress bar when using %dopar% and it works on Linux, Windows and OS X
doSNOW now officially supports progress bars via the .options.snow argument.
library(doSNOW)
cl <- makeCluster(2)
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(max = iterations, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
result <- foreach(i = 1:iterations, .combine = rbind,
.options.snow = opts) %dopar%
{
s <- summary(rnorm(1e6))[3]
return(s)
}
close(pb)
stopCluster(cl)
Yet another way of tracking progress, if you keep in mind the total number of iterations, is to set .verbose = T as this will print to the console which iterations have been finished.
Previous solution for Linux and OS X
On Ubuntu 14.04 (64 bit) and OS X (El Capitan) the progress bar is displayed even when using %dopar% if in the makeCluster function oufile = "" is set. It does not seem to work under Windows. From the help on makeCluster:
outfile: Where to direct the stdout and stderr connection output from the workers. "" indicates no redirection (which may only be useful for workers on the local machine). Defaults to ‘/dev/null’ (‘nul:’ on Windows).
Example code:
library(foreach)
library(doSNOW)
cl <- makeCluster(4, outfile="") # number of cores. Notice 'outfile'
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(min = 1, max = iterations, style = 3)
result <- foreach(i = 1:iterations, .combine = rbind) %dopar%
{
s <- summary(rnorm(1e6))[3]
setTxtProgressBar(pb, i)
return(s)
}
close(pb)
stopCluster(cl)
This is what the progress bar looks like. It looks a little odd since a new bar is printed for every progression of the bar and because a worker may lag a bit which causes the progress bar to go back and forth occasionally.
You can also get this to work with the progress package.
# loading parallel and doSNOW package and creating cluster ----------------
library(parallel)
library(doSNOW)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoSNOW(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.snow = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
This code is a modified version of the doRedis example, and will make a progress bar even when using %dopar% with a parallel backend:
#Load Libraries
library(foreach)
library(utils)
library(iterators)
library(doParallel)
library(snow)
#Choose number of iterations
n <- 1000
#Progress combine function
f <- function(){
pb <- txtProgressBar(min=1, max=n-1,style=3)
count <- 0
function(...) {
count <<- count + length(list(...)) - 1
setTxtProgressBar(pb,count)
Sys.sleep(0.01)
flush.console()
c(...)
}
}
#Start a cluster
cl <- makeCluster(4, type='SOCK')
registerDoParallel(cl)
# Run the loop in parallel
k <- foreach(i = icount(n), .final=sum, .combine=f()) %dopar% {
log2(i)
}
head(k)
#Stop the cluster
stopCluster(cl)
You have to know the number of iterations and the combination function ahead of time.
This is now possible with the parallel package. Tested with R 3.2.3 on OSX 10.11, running inside RStudio, using a "PSOCK"-type cluster.
library(doParallel)
# default cluster type on my machine is "PSOCK", YMMV with other types
cl <- parallel::makeCluster(4, outfile = "")
registerDoParallel(cl)
n <- 10000
pb <- txtProgressBar(0, n, style = 2)
invisible(foreach(i = icount(n)) %dopar% {
setTxtProgressBar(pb, i)
})
stopCluster(cl)
Strangely, it only displays correctly with style = 3.
You save the start time with Sys.time() before the loop. Loop over rows or columns or something which you know the total of. Then, inside the loop you can calculate the time ran so far (see difftime), percentage complete, speed and estimated time left. Each process can print those progress lines with the message function. You'll get an output something like
1/1000 complete # 1 items/s, ETA: 00:00:45
2/1000 complete # 1 items/s, ETA: 00:00:44
Obviously the looping order will greatly affect how well this works. Don't know about foreach but with multicore's mclapply you'd get good results using mc.preschedule=FALSE, which means that items are allocated to processes one-by-one in order as previous items complete.
This code implements a progress bar tracking a parallelized foreach loop using the doMC backend, and using the excellent progress package in R. It assumes that all cores, specified by numCores, do an approximately equal amount of work.
library(foreach)
library(doMC)
library(progress)
iterations <- 100
numCores <- 8
registerDoMC(cores=numCores)
pbTracker <- function(pb,i,numCores) {
if (i %% numCores == 0) {
pb$tick()
}
}
pb <- progress_bar$new(
format <- " progress [:bar] :percent eta: :eta",
total <- iterations / numCores, clear = FALSE, width= 60)
output = foreach(i=1:iterations) %dopar% {
pbTracker(pb,i,numCores)
Sys.sleep(1/20)
}
The following code will produce a nice progress bar in R for the foreach control structure. It will also work with graphical progress bars by replacing txtProgressBar with the desired progress bar object.
# Gives us the foreach control structure.
library(foreach)
# Gives us the progress bar object.
library(utils)
# Some number of iterations to process.
n <- 10000
# Create the progress bar.
pb <- txtProgressBar(min = 1, max = n, style=3)
# The foreach loop we are monitoring. This foreach loop will log2 all
# the values from 1 to n and then sum the result.
k <- foreach(i = icount(n), .final=sum, .combine=c) %do% {
setTxtProgressBar(pb, i)
log2(i)
}
# Close the progress bar.
close(pb)
While the code above answers your question in its most basic form a better and much harder question to answer is whether you can create an R progress bar which monitors the progress of a foreach statement when it is parallelized with %dopar%. Unfortunately I don't think it is possible to monitor the progress of a parallelized foreach in this way, but I would love for someone to prove me wrong, as it would be very useful feature.

Resources