Print outputs in foreach loop in R - r

I have been trying to get some output displayed from the foreach loop R. A reproducible example is
cl <- makeCluster(2)
registerDoParallel(cl)
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
cat("Time taken", proc.time() - ptm1)
}
}
I expect the time taken to be displayed. But, this does not display anything. Can you please suggest ways of capturing the messages in the foreach loop and displaying at the end of the loop.

I'm not sure if there's a way to output to the screen, but you can easily output to a log file using the sink function like so
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
sink("Report.txt", append=TRUE) #open sink file and add output
cat("Time taken", proc.time() - ptm1)
}
}
EDIT : As #Roland points out, this can be dangerous if you want to capture output from every iteration and not just the final one, because you don't want the workers to clobber each other. He links to a better alternative for this scenario in his comment.

Related

Keep track of "standard" loop nested in a parallelized loop [duplicate]

I have been trying to get some output displayed from the foreach loop R. A reproducible example is
cl <- makeCluster(2)
registerDoParallel(cl)
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
cat("Time taken", proc.time() - ptm1)
}
}
I expect the time taken to be displayed. But, this does not display anything. Can you please suggest ways of capturing the messages in the foreach loop and displaying at the end of the loop.
I'm not sure if there's a way to output to the screen, but you can easily output to a log file using the sink function like so
ptm1 <- proc.time()
foreach (i = 1:50, .packages = c("MASS"), .combine='+') %dopar% {
ginv(matrix(rexp(1000000, rate=.001), ncol=1000))
if (i >49){
sink("Report.txt", append=TRUE) #open sink file and add output
cat("Time taken", proc.time() - ptm1)
}
}
EDIT : As #Roland points out, this can be dangerous if you want to capture output from every iteration and not just the final one, because you don't want the workers to clobber each other. He links to a better alternative for this scenario in his comment.

Nested foreach with changing index size

I'm trying to obtain the return of daily prices for each stock I have. The data is cross-sectionnal and very large. Thus I use doParallel and nested foreach.
Here is the code I've been using so far. (this is a reproduceable example)
and here is a reproduce-able example
stock_name <- as.data.frame(sample(x = 1:100, size = 250, replace=TRUE))
price <- as.data.frame(sample(x = 1:100, size = 1000,replace=TRUE))
## Calculating daily returns.
stock_list<-as.tbl(distinct(stock_name))
numStock<- as.integer(count(stock_list)) #150 #as.integer(count(stock_list))
nCPUcores = detectCores()
if (nCPUcores < 3) {
registerDoSEQ()
}else{
cl = makeCluster(nCPUcores - 1)
registerDoParallel(cl)
}
d_ret<-c()
foreach (stock=1:numStock, .packages = c("doParallel","foreach","data.table","plyr","dplyr")) %dopar%{
s<-as.integer(unlist(stock_list[stock,]))
stock_price <- as.matrix(price[which(stock_name[1,]==s),])
u<-nrow(stock_price)
d_ret<-foreach (p=2:u) %:%{
c(d_ret,(stock_price[p,]-stock_price[p-1,])/stock_price[p-1,])
}
}
stopCluster(cl)
##--
But the code doesn't work. After Florian Prive's remark, I checked the library and it seems that I should write nested foreach loops like this:
x <- foreach(b=bvec, .combine='cbind') %:%
foreach(a=avec, .combine='c') %dopar% {
sim(a, b)
}
So what I understand is I shouldn't be writing anything between %:% and the second foreach.
However, in my case, the second loop would change with the first foreach because there aren't the same number of prices for each stocks. Therefore I can't just write ' foreach(a=avec) '.
The second foreach would ideally depend on variable u
u<-nrow(stock_price)
Is this even possible with the foreach library?
Thank you for the help

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

How to avoid 'sink stack is full' error when sink() is used to capture messages in foreach loop

In order to see the console messages output by a function running in a foreach() loop I followed the advice of this guy and added a sink() call like so:
library(foreach)
library(doMC)
cores <- detectCores()
registerDoMC(cores)
X <- foreach(i=1:100) %dopar%{
sink("./out/log.branchpies.txt", append=TRUE)
cat(paste("\n","Starting iteration",i,"\n"), append=TRUE)
myFunction(data, argument1="foo", argument2="bar")
}
However, at iteration 77 I got the error 'sink stack is full'. There are well-answered questions about avoiding this error when using for-loops, but not foreach. What's the best way to write the otherwise-hidden foreach output to a file?
This runs without errors on my Mac:
library(foreach)
library(doMC)
cores <- detectCores()
registerDoMC(cores)
X <- foreach(i=1:100) %dopar%{
sink("log.branchpies.txt", append=TRUE)
cat(paste("\n","Starting iteration",i,"\n"))
sink() #end diversion of output
rnorm(i*1e4)
}
This is better:
library(foreach)
library(doMC)
cores <- detectCores()
registerDoMC(cores)
sink("log.branchpies.txt", append=TRUE)
X <- foreach(i=1:100) %dopar%{
cat(paste("\n","Starting iteration",i,"\n"))
rnorm(i*1e4)
}
sink() #end diversion of output
This works too:
library(foreach)
library(doMC)
cores <- detectCores()
registerDoMC(cores)
X <- foreach(i=1:100) %dopar%{
cat(paste("\n","Starting iteration",i,"\n"),
file="log.branchpies.txt", append=TRUE)
rnorm(i*1e4)
}
As suggested by this guy , it is quite tricky to keep track of the sink stack. It is, therefore advised to use ability of cat to write to file, such as suggested in the answer above:
cat(..., file="log.txt", append=TRUE)
To save some typing you could create a wrapper function that diverts output to file every time cat is called:
catf <- function(..., file="log.txt", append=TRUE){
cat(..., file=file, append=append)
}
So that at the end, when you call foreach you would use something like this:
library(foreach)
library(doMC)
cores <- detectCores()
registerDoMC(cores)
X <- foreach(i=1:100) %dopar%{
catf(paste("\n","Starting iteration",i,"\n"))
rnorm(i*1e4)
}
Hope it helps!
Unfortunately, none of the abovementioned approaches worked for me: With sink() within the foreach()-loop, it did not stop to throw the "sink stack is full"-error. With sink() outside the loop, the file was created, but never updated.
To me, the easiest way of creating a log-file to keep track of a parallelised foreach()-loop's progress is by applying the good old write.table()-function.
library(foreach)
library(doParallel)
availableClusters <- makeCluster(detectCores() - 1) #use all cpu-threads but one (i.e. one is reserved for the OS)
registerDoParallel(availableClusters) #register the available cores for the parallisation
x <- foreach (i = 1 to 100) %dopar% {
log.text <- paste0(Sys.time(), " processing loop run ", i, "/100")
write.table(log.text, "loop-log.txt", append = TRUE, row.names = FALSE, col.names = FALSE)
#your statements here
}
And don't forget (as I did several times...) to use append = TRUE within write.table().
Call sink() with no arguments once inside the for loop to reset it to end the file writing at the end of each iteration and you will not get this error again.

Saving multiple outputs of foreach dopar loop

I would like to know if/how it would be possible to return multiple outputs as part of foreach dopar loop.
Let's take a very simplistic example. Let's suppose I would like to do 2 operations as part of the foreach loop, and would like to return or save the results of both operations for each value of i.
For only one output to return, it would be as simple as:
library(foreach)
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
oper1 <- foreach(i=1:100000) %dopar% {
i+2
}
oper1 would be a list with 100000 elements, each element is the result of the operation i+2 for each value of i.
Suppose now I would like to return or save the results of two different operations separately, e.g. i+2 and i+3. I tried the following:
oper1 = list()
oper2 <- foreach(i=1:100000) %dopar% {
oper1[[i]] = i+2
return(i+3)
}
hoping that the results of i+2 will be saved in the list oper1, and that the results of the second operation i+3 will be returned by foreach. However, nothing gets populated in the list oper1! In this case, only the result of i+3 gets returned from the loop.
Is there any way of returning or saving both outputs in two separate lists?
Don't try to use side-effects with foreach or any other parallel program package. Instead, return all of the values from the body of the foreach loop in a list. If you want your final result to be a list of two lists rather than a list of 100,000 lists, then specify a combine function that transposes the results:
comb <- function(x, ...) {
lapply(seq_along(x),
function(i) c(x[[i]], lapply(list(...), function(y) y[[i]])))
}
oper <- foreach(i=1:10, .combine='comb', .multicombine=TRUE,
.init=list(list(), list())) %dopar% {
list(i+2, i+3)
}
oper1 <- oper[[1]]
oper2 <- oper[[2]]
Note that this combine function requires the use of the .init argument to set the value of x for the first invocation of the combine function.
I prefer to use a class to hold multiple results for a %dopar% loop.
This example spins up 3 cores, calculates multiple results on each core, then returns the list of results to the calling thread.
Tested under RStudio, Windows 10, and R v3.3.2.
library(foreach)
library(doParallel)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $result1 and $result2.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
multiResultClass <- function(result1=NULL,result2=NULL)
{
me <- list(
result1 = result1,
result2 = result2
)
## Set the name for the class
class(me) <- append(class(me),"multiResultClass")
return(me)
}
cl <- makeCluster(3)
registerDoParallel(cl)
oper <- foreach(i=1:10) %dopar% {
result <- multiResultClass()
result$result1 <- i+1
result$result2 <- i+2
return(result)
}
stopCluster(cl)
oper1 <- oper[[1]]$result1
oper2 <- oper[[1]]$result2
This toy example shows how to return multiple results from a %dopar% loop.
This example:
Spins up 3 cores.
Renders a graph on each core.
Returns the graph and an attached message.
Prints the graphs and it's attached message out.
I found this really useful to speed up using Rmarkdown to print 1,800 graphs into a PDF document.
Tested under Windows 10, RStudio, and R v3.3.2.
R code:
# Demo of returning multiple results from a %dopar% loop.
library(foreach)
library(doParallel)
library(ggplot2)
cl <- makeCluster(3)
registerDoParallel(cl)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $resultPlot and $resultMessage.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
plotAndMessage <- function(resultPlot=NULL,resultMessage="?")
{
me <- list(
resultPlot = resultPlot,
resultMessage = resultMessage
)
# Set the name for the class
class(me) <- append(class(me),"plotAndMessage")
return(me)
}
oper <- foreach(i=1:5, .packages=c("ggplot2")) %dopar% {
x <- c(i:(i+2))
y <- c(i:(i+2))
df <- data.frame(x,y)
p <- ggplot(df, aes(x,y))
p <- p + geom_point()
message <- paste("Hello, world! i=",i,"\n",sep="")
result <- plotAndMessage()
result$resultPlot <- p
result$resultMessage <- message
return(result)
}
# Print resultant plots and messages. Despite running on multiple cores,
# 'foreach' guarantees that the plots arrive back in the original order.
foreach(i=1:5) %do% {
# Print message attached to plot.
cat(oper[[i]]$resultMessage)
# Print plot.
print(oper[[i]]$resultPlot)
}
stopCluster(cl)

Resources