how to track progress in mclapply in R in parallel package - r

My question is related to this question. However the question referenced above uses multicore package which was replaced by parallel. Most of the functions in the response cannot be replicated in the parallel package. Is there a way to track progress in mclapply. In looking at the mclapply documentation, there is a parameter called mc.silent, I'm not sure if this would be able to track progress, and if so how and where we can see the log file? I'm running on ubuntu linux OS. Please see below for a reproducible example for which I would like to tack progress.
require(parallel)
wait.then.square <- function(xx){
# Wait for one second
Sys.sleep(2);
# Square the argument
xx^2 }
output <- mclapply( 1:10, wait.then.square, mc.cores=4,mc.silent=FALSE)
Any help would be greatly appreciated.

Thanks to the package pbmcapply you can now easily track progress of mclapply and mcmapply jobs. Just replace mclapply by pbmclapply:
wait.then.square <- function(xx) {
Sys.sleep(2)
xx^2
}
library(pbmcapply)
output <- pbmclapply(1:10, wait.then.square, mc.cores = 4)
...which will display a pretty progress bar.
The author has a nice blog post on the technical details and performance benchmarks here.

This is an update of my related answer.
library(parallel)
finalResult <- local({
f <- fifo(tempfile(), open="w+b", blocking=T)
if (inherits(parallel:::mcfork(), "masterProcess")) {
# Child
progress <- 0.0
while (progress < 1 && !isIncomplete(f)) {
msg <- readBin(f, "double")
progress <- progress + as.numeric(msg)
cat(sprintf("Progress: %.2f%%\n", progress * 100))
}
parallel:::mcexit()
}
numJobs <- 100
result <- mclapply(1:numJobs, function(...) {
# Do something fancy here... For this example, just sleep
Sys.sleep(0.05)
# Send progress update
writeBin(1/numJobs, f)
# Some arbitrary result
sample(1000, 1)
})
close(f)
result
})
cat("Done\n")

Related

Showing progress_bar with doParallel + foreach

I am using the example code posted here to show a progress_bar (from the progress package) with doParallel + foreach. Solutions there however make use of doSNOW (e.g. code by Dewey Brooke that I am using for testing), which is more outdated than doParallel and returns this NOTE when building a package with CRAN flags:
Uses the superseded package: ‘doSNOW (>= 1.0.19)’
Change this does not seems that as straightforward as expected. If only registerDoSNOW is replaced by registerDoParallel, and .options.snow by .options.doparallel the code will run, but in the second case will not show any progress bar at all.
I think this might relate to the use of .options.X. This part of the code is very obscure to me, since .options.snow works indeed when using doSNOW, but there is no documentation of the foreach man page about the use of this argument. Therefore, it is not surprising that .options.doparallel does not work, since it was just a wild guess of mine.
Including the call to pb$tick() within the foreach loop will not work either, and will actually cause the result to be wrong. So I am really out of ideas on where should I include this in the code.
Where .options.snow comes from? where should go pb$tick(), how to show the progress_bar object using doParallel here?
I paste below the code (doSNOW replaced by doParallel) for convenience, but credit again the original source:
library(parallel)
library(doParallel)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoParallel(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.doparallel = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
doParallel still uses the .options.snow argument for whatever reason. Found this little tidbit in the doParallel documentation.
The doParallel backend supports both multicore and snow options passed through the foreach function. The supported multicore options are 1st preschedule, set.seed, silent, and cores, which are analogous to the similarly named arguments to mclapply, and are passed using the .options.multicore argument to foreach. The supported snow options are preschedule, which like its multicore analog can be used to chunk the tasks so that each worker gets a prescheduled chunk of tasks, and attachExportEnv, which can be used to attach the export environment in certain cases where R’s lexical scoping is unable to find a needed export. The snow options are passed to foreach using the .options.snow argument.
foreach is powerful package but whoever is maintaining it makes odd decisions.
EDIT
doParallel does not support the progress multicore option. Therefore, a progress bar will NOT display if registerDoParallel is used instead of registerDoSNOW.
While doSNOW has been superseded, it's unclear if one is more outdated than the other since both have undergone very few changes, either than updating the current Maintainer (doParallel | doSNOW).
doSNOW
doSNOW:::doSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
progressWrapper <- function(...) NULL # <- CRITICAL DIFFERENCE
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv",
"progress") # <- CRITICAL DIFFERENCE
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...
doParallel
doParallel:::doParallelSNOW <- function (obj, expr, envir, data)
{
cl <- data
preschedule <- FALSE
attachExportEnv <- FALSE
# MISSING: progressWrapper <- function(...) NULL
if (!inherits(obj, "foreach"))
stop("obj must be a foreach object")
it <- iter(obj)
accumulator <- makeAccum(it)
options <- obj$options$snow
if (!is.null(options)) {
nms <- names(options)
recog <- nms %in% c("preschedule", "attachExportEnv" #MISSING , "progress")
if (any(!recog))
warning(sprintf("ignoring unrecognized snow option(s): %s",
paste(nms[!recog], collapse = ", ")), call. = FALSE)
...

Is it possible to get a progress bar with foreach and a "multicore-kind" of backend

While using "multicore" parallelism using foreach and the doMC backend (I use doMC as at the time I looked into it other package did not allow logging from the I would like to get a progress bar, using the progress package, but any progress (that works on a linux terminal ie no tcltk popups) could do.
Given it uses forking I can imagine it might not be possible but I am not sure.
The intended use is to indicate progress when I load an concatenate 100's of files in parallel (usually within a #!Rscript)
I've looked at the few posts like How do you create a progress bar when using the “foreach()” function in R?. Happy to award a bounty on this.
EDIT
500 points bounty offered for someone showing me how to
using foreach and a multicore (forking) type of parallelism
get a progress bar
get logging using futile.logger
Reprex
# load packages
library("futile.logger")
library("data.table")
library("foreach")
# create temp dir
tmp_dir <- tempdir()
# create names for 200 files to be created
nb_files <- 200L
file_names <- file.path(tmp_dir, sprintf("file_%s.txt", 1:nb_files))
# make it reproducible
set.seed(1L)
nb_rows <- 1000L
nb_columns <- 10L
# create those 200 files sequentially
foreach(file_i = file_names) %do%
{
DT <- as.data.table(matrix(data = runif(n = nb_rows * nb_columns), nrow = nb_rows))
fwrite(x = DT, file = file_i)
flog.info("Creating file %s", file_i)
} -> tmp
# Load back the files
foreach(file_i = file_names, .final = rbindlist) %dopar%
{
flog.info("Loading file %s", file_i)
# >>> SOME PROGRESS BAR HERE <<<
fread(file_i)
} -> final_data
# show data
final_data
Desired output
Note that the progress bar is not messed up with the print lines)
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_197.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_198.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_199.txt
INFO [2018-07-18 19:03:48] Loading file /tmp/RtmpB13Tko/file_200.txt
[ =======> ] 4%
EDIT 2
After the bounty ended nothing comes close to the expected result.
Logging within the progress bar messes everything.
If someone gets the correct result I'll give another result-based bounty.
Here's a solution (not perfect) using custom function.
This function outputs to console (using message) progress bar.
ii is current iteration.
N is total number of iterations to perform.
per is step (percent) when to update the progress bar. We need this as when multiple iterations are being performed progress bar gets updated too often and output is messed up.
Function:
progBar <- function(ii, N, per = 10) {
if (ii %in% seq(1, N, per)) {
x <- round(ii * 100 / N)
message("[ ",
paste(rep("=", x), collapse = ""),
paste(rep("-", 100 - x), collapse = ""),
" ] ", x, "%", "\r",
appendLF = FALSE)
if (ii == N) cat("\r")
}
}
Code to test:
library(doMC)
library(foreach)
registerDoMC(10)
nIteration <- 1e3
foreach(i = 1:nIteration, ii = icount()) %dopar% {
# For progBar ii I'm using icount(), because
# user might iterate over all kind of objects
progBar(ii, nIteration)
Sys.sleep(1 / 10)
}
PS: It's not perfect, because:
Bar not always runs to 100% (depending on the number of iterations it can stop at 99%)
Sometimes output messes up (depends on number of iterations and how often they switch) - still debugging this part
Console is not flushed if you use print/cat within foreach
You can refer to this link Progress bar parallel for the few insights (May be not the exact solution) which will help in creating a progress bar parallel.
The txtProgressBar only works when the stype is 2 or 3
library("foreach")
library("doParallel")
library("progress")
registerDoParallel(parallel::makeCluster(7, outfile = ""))
pb <- progress_bar$new(
format = " [:bar] :percent in :elapsed",
total = 30, clear = FALSE, width = 80, force = T)
a <- foreach (i = 1:30) %dopar% {
pb$tick()
Sys.sleep(0.5)
}
pb <- txtProgressBar(title = "Iterative training", min = 0, max = 30, style = 3)
foreach (i = 1:30) %dopar% {
setTxtProgressBar(pb, i)
Sys.sleep(0.5)
}
Do refer this link Monitoring the function with progress bar for the different ways a progress bar can be implemented depending on the needs.
Using Multicore:
You can register a different parallel backend later, or deregister doMC by registering the sequential backend by calling the registerDoSEQ function. For example consider the following program
> x <- iris[which(iris[,5] != "setosa"), c(1,5)]
> trials <- 10000
> ptime <- system.time({
+ r <- foreach(icount(trials), .combine=cbind) %dopar% {
+ ind <- sample(100, 100, replace=TRUE)
+ result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
+ coefficients(result1)
+ }
+ })[3]
> ptime
A package I have used that does this in parallel for processing lists is pbmcapply, hope this helps.

Parallelized `Find` loop in R

There are several packages in R to simplify running code in parallel, like foreach and future. Most of these have constructs which are like lapply or a for loop: they carry on until all the tasks have finished.
Is there a simple parallel version of Find? That is, I would like to run several tasks in parallel. I don't need all of them to finish, I just need to get the first one that finishes (maybe with a particular result). After that the other tasks can be killed, or left to finish on their own.
Conceptual code:
hunt_needle <- function (x, y) x %in% (y-1000):y
x <- sample.int(1000000, 1)
result <- parallel_find(seq(1000, 1000000, 1000), hunt_needle)
# should return the first value for which hunt_needle is true
You can use shared memory so that processes can communicate with one another.
For that, you can use package bigstatsr (disclaimer: I'm the author).
Choose a block size and do:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
# Data example
cond <- logical(1e6)
cond[sample(length(cond), size = 1)] <- TRUE
ind.block <- bigstatsr:::CutBySize(length(cond), block.size = 1000)
cl <- parallel::makeCluster(nb_cores())
doParallel::registerDoParallel(cl)
# This value (in an on-disk matrix) is shared by processes
found_it <- FBM(1, 1, type = "integer", init = 0L)
library(foreach)
res <- foreach(ic = sample(rows_along(ind.block)), .combine = 'c') %dopar% {
if (found_it[1]) return(NULL)
ind <- bigstatsr:::seq2(ind.block[ic, ])
find <- which(cond[ind])
if (length(find)) {
found_it[1] <- 1L
return(ind[find[1]])
} else {
return(NULL)
}
}
parallel::stopCluster(cl)
# Verification
all.equal(res, which(cond))
Basically, when a solution is found, you don't need to do some computations anymore, and others know it because you put a 1 in found_it which is shared between all processes.
As your question is not reproducible and I don't understand everything you need, you may have to adapt this solution a little bit.

How do I run a function every second

I want to run a function that takes less than one second to execute. I want to run it in a loop every second. I do not want to wait one second between running the function like Sys.sleep would do.
while(TRUE){
# my function that takes less than a second to run
Sys.sleep(runif(1, min=0, max=.8))
# wait for the remaining time until the next execution...
# something here
}
I could record a starttime <- Sys.time() and do a comparison every iteration through the loop, something like this...
starttime <- Sys.time()
while(TRUE){
if(abs(as.numeric(Sys.time() - starttime) %% 1) < .001){
# my function that takes less than a second to run
Sys.sleep(runif(1, min=0, max=.8))
print(paste("it ran", Sys.time()))
}
}
But my function never seems to be executed.
I know python has a package to do this sort of thing. Does R also have one that I don't know about? Thanks.
You can keep track of the time with system.time
while(TRUE)
{
s = system.time(Sys.sleep(runif(1, min = 0, max = 0.8)))
Sys.sleep(1 - s[3]) #basically sleep for whatever is left of the second
}
You can also use proc.time directly (which system.time calls), which for some reasons got better results for me:
> system.time(
for(i in 1:10)
{
p1 = proc.time()
Sys.sleep(runif(1, min = 0, max = 0.8))
p2 = proc.time() - p1
Sys.sleep(1 - p2[3]) #basically sleep for whatever is left of the second
})
user system elapsed
0.00 0.00 10.02
Here are some alternatives. These do not block. That is you can still use the console to run other code while they are running.
1) tcltk Try after in the tcltk package:
library(tcltk)
run <- function () {
.id <<- tcl("after", 1000, run) # after 1000 ms execute run() again
cat(as.character(.id), "\n") # replace with your code
}
run()
Running this on a fresh R session gives:
after#0
after#1
after#2
after#3
after#4
after#5
after#6
after#7
...etc...
To stop it tcl("after", "cancel", .id) .
2) tcltk2 Another possibility is tclTaskSchedule in the tcltk2 package:
library(tcltk2)
test <- function() cat("Hello\n") # replace with your function
tclTaskSchedule(1000, test(), id = "test", redo = TRUE)
Stop it with:
tclTaskDelete("test")
or redo= can specify the number of times it should run.
Another non-blocking alternative worth mentioning is provided by library(later), via using later() recursive:
print_time = function(interval = 10) {
timestamp()
later::later(print_time, interval)
}
print_time()
The example is taken from here.
The shiny package has a function invalidateLater() which can be use to trigger functions. Have a look at http://shiny.rstudio.com/gallery/timer.html
Although its very late.
As an alternative we can use recursion. I don't know if its the solution you are looking for. But it executes function at regular interval.
ssc <- function(){
x <- rnorm(30,20,2)
print(hist(x))
Sys.sleep(4)
ssc()
}
ssc()

How do you create a progress bar when using the "foreach()" function in R?

there are some informative posts on how to create a counter for loops in an R program. However, how do you create a similar function when using the parallelized version with "foreach()"?
Edit: After an update to the doSNOW package it has become quite simple to display a nice progress bar when using %dopar% and it works on Linux, Windows and OS X
doSNOW now officially supports progress bars via the .options.snow argument.
library(doSNOW)
cl <- makeCluster(2)
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(max = iterations, style = 3)
progress <- function(n) setTxtProgressBar(pb, n)
opts <- list(progress = progress)
result <- foreach(i = 1:iterations, .combine = rbind,
.options.snow = opts) %dopar%
{
s <- summary(rnorm(1e6))[3]
return(s)
}
close(pb)
stopCluster(cl)
Yet another way of tracking progress, if you keep in mind the total number of iterations, is to set .verbose = T as this will print to the console which iterations have been finished.
Previous solution for Linux and OS X
On Ubuntu 14.04 (64 bit) and OS X (El Capitan) the progress bar is displayed even when using %dopar% if in the makeCluster function oufile = "" is set. It does not seem to work under Windows. From the help on makeCluster:
outfile: Where to direct the stdout and stderr connection output from the workers. "" indicates no redirection (which may only be useful for workers on the local machine). Defaults to ‘/dev/null’ (‘nul:’ on Windows).
Example code:
library(foreach)
library(doSNOW)
cl <- makeCluster(4, outfile="") # number of cores. Notice 'outfile'
registerDoSNOW(cl)
iterations <- 100
pb <- txtProgressBar(min = 1, max = iterations, style = 3)
result <- foreach(i = 1:iterations, .combine = rbind) %dopar%
{
s <- summary(rnorm(1e6))[3]
setTxtProgressBar(pb, i)
return(s)
}
close(pb)
stopCluster(cl)
This is what the progress bar looks like. It looks a little odd since a new bar is printed for every progression of the bar and because a worker may lag a bit which causes the progress bar to go back and forth occasionally.
You can also get this to work with the progress package.
# loading parallel and doSNOW package and creating cluster ----------------
library(parallel)
library(doSNOW)
numCores<-detectCores()
cl <- makeCluster(numCores)
registerDoSNOW(cl)
# progress bar ------------------------------------------------------------
library(progress)
iterations <- 100 # used for the foreach loop
pb <- progress_bar$new(
format = "letter = :letter [:bar] :elapsed | eta: :eta",
total = iterations, # 100
width = 60)
progress_letter <- rep(LETTERS[1:10], 10) # token reported in progress bar
# allowing progress bar to be used in foreach -----------------------------
progress <- function(n){
pb$tick(tokens = list(letter = progress_letter[n]))
}
opts <- list(progress = progress)
# foreach loop ------------------------------------------------------------
library(foreach)
foreach(i = 1:iterations, .combine = rbind, .options.snow = opts) %dopar% {
summary(rnorm(1e6))[3]
}
stopCluster(cl)
This code is a modified version of the doRedis example, and will make a progress bar even when using %dopar% with a parallel backend:
#Load Libraries
library(foreach)
library(utils)
library(iterators)
library(doParallel)
library(snow)
#Choose number of iterations
n <- 1000
#Progress combine function
f <- function(){
pb <- txtProgressBar(min=1, max=n-1,style=3)
count <- 0
function(...) {
count <<- count + length(list(...)) - 1
setTxtProgressBar(pb,count)
Sys.sleep(0.01)
flush.console()
c(...)
}
}
#Start a cluster
cl <- makeCluster(4, type='SOCK')
registerDoParallel(cl)
# Run the loop in parallel
k <- foreach(i = icount(n), .final=sum, .combine=f()) %dopar% {
log2(i)
}
head(k)
#Stop the cluster
stopCluster(cl)
You have to know the number of iterations and the combination function ahead of time.
This is now possible with the parallel package. Tested with R 3.2.3 on OSX 10.11, running inside RStudio, using a "PSOCK"-type cluster.
library(doParallel)
# default cluster type on my machine is "PSOCK", YMMV with other types
cl <- parallel::makeCluster(4, outfile = "")
registerDoParallel(cl)
n <- 10000
pb <- txtProgressBar(0, n, style = 2)
invisible(foreach(i = icount(n)) %dopar% {
setTxtProgressBar(pb, i)
})
stopCluster(cl)
Strangely, it only displays correctly with style = 3.
You save the start time with Sys.time() before the loop. Loop over rows or columns or something which you know the total of. Then, inside the loop you can calculate the time ran so far (see difftime), percentage complete, speed and estimated time left. Each process can print those progress lines with the message function. You'll get an output something like
1/1000 complete # 1 items/s, ETA: 00:00:45
2/1000 complete # 1 items/s, ETA: 00:00:44
Obviously the looping order will greatly affect how well this works. Don't know about foreach but with multicore's mclapply you'd get good results using mc.preschedule=FALSE, which means that items are allocated to processes one-by-one in order as previous items complete.
This code implements a progress bar tracking a parallelized foreach loop using the doMC backend, and using the excellent progress package in R. It assumes that all cores, specified by numCores, do an approximately equal amount of work.
library(foreach)
library(doMC)
library(progress)
iterations <- 100
numCores <- 8
registerDoMC(cores=numCores)
pbTracker <- function(pb,i,numCores) {
if (i %% numCores == 0) {
pb$tick()
}
}
pb <- progress_bar$new(
format <- " progress [:bar] :percent eta: :eta",
total <- iterations / numCores, clear = FALSE, width= 60)
output = foreach(i=1:iterations) %dopar% {
pbTracker(pb,i,numCores)
Sys.sleep(1/20)
}
The following code will produce a nice progress bar in R for the foreach control structure. It will also work with graphical progress bars by replacing txtProgressBar with the desired progress bar object.
# Gives us the foreach control structure.
library(foreach)
# Gives us the progress bar object.
library(utils)
# Some number of iterations to process.
n <- 10000
# Create the progress bar.
pb <- txtProgressBar(min = 1, max = n, style=3)
# The foreach loop we are monitoring. This foreach loop will log2 all
# the values from 1 to n and then sum the result.
k <- foreach(i = icount(n), .final=sum, .combine=c) %do% {
setTxtProgressBar(pb, i)
log2(i)
}
# Close the progress bar.
close(pb)
While the code above answers your question in its most basic form a better and much harder question to answer is whether you can create an R progress bar which monitors the progress of a foreach statement when it is parallelized with %dopar%. Unfortunately I don't think it is possible to monitor the progress of a parallelized foreach in this way, but I would love for someone to prove me wrong, as it would be very useful feature.

Resources