I would like to know if/how it would be possible to return multiple outputs as part of foreach dopar loop.
Let's take a very simplistic example. Let's suppose I would like to do 2 operations as part of the foreach loop, and would like to return or save the results of both operations for each value of i.
For only one output to return, it would be as simple as:
library(foreach)
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
oper1 <- foreach(i=1:100000) %dopar% {
i+2
}
oper1 would be a list with 100000 elements, each element is the result of the operation i+2 for each value of i.
Suppose now I would like to return or save the results of two different operations separately, e.g. i+2 and i+3. I tried the following:
oper1 = list()
oper2 <- foreach(i=1:100000) %dopar% {
oper1[[i]] = i+2
return(i+3)
}
hoping that the results of i+2 will be saved in the list oper1, and that the results of the second operation i+3 will be returned by foreach. However, nothing gets populated in the list oper1! In this case, only the result of i+3 gets returned from the loop.
Is there any way of returning or saving both outputs in two separate lists?
Don't try to use side-effects with foreach or any other parallel program package. Instead, return all of the values from the body of the foreach loop in a list. If you want your final result to be a list of two lists rather than a list of 100,000 lists, then specify a combine function that transposes the results:
comb <- function(x, ...) {
lapply(seq_along(x),
function(i) c(x[[i]], lapply(list(...), function(y) y[[i]])))
}
oper <- foreach(i=1:10, .combine='comb', .multicombine=TRUE,
.init=list(list(), list())) %dopar% {
list(i+2, i+3)
}
oper1 <- oper[[1]]
oper2 <- oper[[2]]
Note that this combine function requires the use of the .init argument to set the value of x for the first invocation of the combine function.
I prefer to use a class to hold multiple results for a %dopar% loop.
This example spins up 3 cores, calculates multiple results on each core, then returns the list of results to the calling thread.
Tested under RStudio, Windows 10, and R v3.3.2.
library(foreach)
library(doParallel)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $result1 and $result2.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
multiResultClass <- function(result1=NULL,result2=NULL)
{
me <- list(
result1 = result1,
result2 = result2
)
## Set the name for the class
class(me) <- append(class(me),"multiResultClass")
return(me)
}
cl <- makeCluster(3)
registerDoParallel(cl)
oper <- foreach(i=1:10) %dopar% {
result <- multiResultClass()
result$result1 <- i+1
result$result2 <- i+2
return(result)
}
stopCluster(cl)
oper1 <- oper[[1]]$result1
oper2 <- oper[[1]]$result2
This toy example shows how to return multiple results from a %dopar% loop.
This example:
Spins up 3 cores.
Renders a graph on each core.
Returns the graph and an attached message.
Prints the graphs and it's attached message out.
I found this really useful to speed up using Rmarkdown to print 1,800 graphs into a PDF document.
Tested under Windows 10, RStudio, and R v3.3.2.
R code:
# Demo of returning multiple results from a %dopar% loop.
library(foreach)
library(doParallel)
library(ggplot2)
cl <- makeCluster(3)
registerDoParallel(cl)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $resultPlot and $resultMessage.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
plotAndMessage <- function(resultPlot=NULL,resultMessage="?")
{
me <- list(
resultPlot = resultPlot,
resultMessage = resultMessage
)
# Set the name for the class
class(me) <- append(class(me),"plotAndMessage")
return(me)
}
oper <- foreach(i=1:5, .packages=c("ggplot2")) %dopar% {
x <- c(i:(i+2))
y <- c(i:(i+2))
df <- data.frame(x,y)
p <- ggplot(df, aes(x,y))
p <- p + geom_point()
message <- paste("Hello, world! i=",i,"\n",sep="")
result <- plotAndMessage()
result$resultPlot <- p
result$resultMessage <- message
return(result)
}
# Print resultant plots and messages. Despite running on multiple cores,
# 'foreach' guarantees that the plots arrive back in the original order.
foreach(i=1:5) %do% {
# Print message attached to plot.
cat(oper[[i]]$resultMessage)
# Print plot.
print(oper[[i]]$resultPlot)
}
stopCluster(cl)
Related
the following is a parallel loop I am trying to run in R:
cl <- makeCluster(30,type="SOCK")
registerDoSNOW(cl)
results <- foreach (i = 1:30, .combine='bindlist', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
stopCluster(cl)
The output of my code is always a list and I want to combine the list into one large list. Thus I wrote the following .combine function:
bindlist <- function(x,y,...){
append(list(x),list(y),list(...))
}
As I am doing multiple runs and the number of variables change I tried to use .... However it does not work. How can I rewrite the .combine function so it can work with changing numbers of variables?
Have you considered using 'c'
results <- foreach (i = 1:4, .combine='c', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
If this adds an additional unwanted 'level' to your results, you could use 'unlist' to remove that level.
unlist(results, recursive = FALSE)
I'd like to make parallel processing in R by using packages 'doParallel' and 'foreach'. And, the idea is to make parallel only computations without any outcomes. What I've found looks like 'foreach' operator always return some kind of result that takes memory in the RAM. So, I need any help to have an empty result for parallel processing loops.
# 1. Packages
library(doParallel)
library(foreach)
# 2. Create and run app cluster
cluster_app <- makeCluster(detectCores())
registerDoParallel(cluster_app)
# 3. Loop with result
list_i <- foreach(i = 1:100) %dopar% {
print(i)
}
# 4. List is not empty
list_i
# 5. How make loop with empty 'list_i' ?
# TODO: make 'list' equal NULL or NA
# 6. Stop app cluster
stopCluster(cluster_app)
Here is the solution I found:
# 1. Packages
library(doParallel)
library(foreach)
# 2. Create and run app cluster
cluster_app <- makeCluster(detectCores())
registerDoParallel(cluster_app)
# 3. Loop with result
list_i <- foreach(i = 1:100) %dopar% {
print(i)
}
list_i
# 4. Mock data processing
mock_data <- function(x) {
data.frame(matrix(NA, nrow = x, ncol = x))
}
# 4. How make loop with empty 'list_i' ?
foreach(i = 1:10, .combine = 'c') %dopar% {
# 1. Calculations
mock_data(x)
# 2. Result
NULL
}
# The results has only one value 'NULL' (not a data set)
list_i
# 5. Stop app cluster
stopCluster(cluster_app)
I am trying to learn foreach to parallelise my task
My for-loop looks like this:
# create an empty matrix to store results
mat <- matrix(-9999, nrow = unique(dat$mun), ncol = 2)
for(mun in unique(dat$mun)) {
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
unique(dat$mun) has a length of 5563.
I want to use foreach to pararellise my task.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
foreach(icount(iters)) %dopar% {
mun <- unique(dat$mun)[mun] # this is where I cannot figure out how to assing mun so that it read the data for mun
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
This could be one solution.
Do note that I'm using windows here, and i specified registerDoParallel() for it to work.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
registerDoParallel()
mun <- unique(dat$mun)
tableList <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
tot.dat <- sum(dat$x)
}
unlist(tableList)
Essentially, whatever result inside {...} will be stored inside a list.
In this case, the result (tot.dat which is a number) is compiled in tableList, and by performing unlist() we can convert it to a vector for further use.
The result inside {...} can be anything, a single number, a vector, a dataframe, or anything.
Another approach for your problem would be to combine all existing data together, labelling it with its appropriate source file, so the middle component will look something like
library(plyr)
tableAll <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
dat$source = mun[i]
}
rbind.fill(tableAll)
Then we can use it for further analysis.
I'm using a foreach loop to combine results in a list. The code below shows two loops one using for and one foreach. Both work fine but after executing the foreach version R prints the contents of the list to the console. Why is it doing this, is my syntax wrong?
Regards
Dave
inputs <- list(
list()
,list()
,list()
)
# prints list
results <- list()
foreach(input = inputs) %do% {
results[[length(results)+1]] <- input
}
# does not print list
results <- list()
for (i in 1: 10) {
results[[length(results)+1]] <- inputs[[i]]
}
The way you should use foreach is a bit different than the way you did. Take this example (which doesn't print the results):
require(foreach)
inputs <- list(list(), list(), list())
results <- foreach(input = inputs) %do% {
one.result <- input
return(one.result)
}
I am trying to convert the following code into parallel using foreach and %dopar%.
library(doSNOW)
library(foreach)
cl<- makeCluster(4, type = "SOCK")
registerDoSNOW(cl)
min_subid <- c()
max_subid <- c()
p_typ <- c()
p_nm <- c()
st_tm<-c()
end_tm <- c()
supp <- c()
chart_type <- c()
foreach(j =1:noOfPhases) %dopar%
{
start_time <-phases[j, colnames(phases)=="StartTime"]
end_time <-phases[j, colnames(phases)=="StopTime"]
phase_type <-phases[j, colnames(phases)=="Phase_Type_Id"]
phase_name <-phases[j, colnames(phases)=="Phase_Name"]
suppress <-phases[j, colnames(phases)=="Suppression_Time"]
chart_typ <-phases[j, colnames(phases)=="chartType"]
conft<-(masterData$Time.Subgroup>=start_time & masterData$Time.Subgroup<=end_time)
masterData[which(conft), colnames(masterData)=="Phase_Type"]<-phase_type
masterData[which(conft), colnames(masterData)=="Phase_Name"]<-phase_name
min_subid <- rbind(min_subid, min(which(conft)))
max_subid <- rbind(max_subid, max(which(conft)))
p_typ <- rbind( p_typ, masterData$Phase_Type[min(which(conft))])
p_nm <- rbind( p_nm, masterData$Phase_Name[min(which(conft))])
st_tm <- rbind( st_tm, as.character(start_time))
end_tm <- rbind( end_tm, as.character(end_time))
supp <- rbind(supp,as.character(suppress))
chart_type <- rbind(chart_type,as.character(chart_typ))
phase_info <- data.frame(Subgrp_No_Start=min_subid, Subgrp_No_End=max_subid, Phase_Type=p_typ,
Phase_Name=p_nm, Start_Time=st_tm, Stop_Time=end_tm,
Suppression_Time=supp,ChartType=chart_type)
}
phase_output<-merge(phase_info, phases, by.x=c("Start_Time",
"Stop_Time","ChartType"), by.y=c("StartTime", "StopTime","chartType"))
The above code executes successfully when %do% is included instead of %dopar%. can anyone help me in understanding why I get the following error when it runs parallel (%dopar%) and runs successfully on sequential (%do%)
Error in merge(phase_info, phases, by.x = c("Start_Time", "Stop_Time", :
object 'phase_info' not found
The solution is really simple, but I start off with an explanation of what is happening when you execute the code to explain the error.
What happens in your foreach block is that the one data frame (phase_info) is created for each value of j and they are returned together in a list. However, since your assignment phase_info <- data.frame(...) is located inside the foreach rather than outside, the list is not stored anywhere and gets discarded. The cause for confusion is that when using %do% you create all the data frames sequentially on the master node and when using %dopar% the frames are being created in parallel on the worker nodes. The following merge command is executed on the master node causing an error if you used %dopar% since phase_info does not exist in its workspace. Also note that when using %do% like above, each iterations of foreach overwrites the result of the previous ones (i.e. you get only the result of the last iteration).
This minor change fixes it:
phase_info <- foreach(...) %dopar% {
...
data.frame(Subgrp_No_Start=min_subid, Subgrp_No_End=max_subid, Phase_Type=p_typ,
Phase_Name=p_nm, Start_Time=st_tm, Stop_Time=end_tm,
Suppression_Time=supp,ChartType=chart_type)
# No need to give it a name as it will be returned and the name forgotten
}
phase_output <- merge(phase_info, ...)
As I mentioned above, phase_info will now be a list where each element is a data frame. I am just guessing now but you probably want to execute the merge elementwise then, like this:
phase_output <- lapply(phase_info, merge, phases, by.x=c("Start_Time",
"Stop_Time","ChartType"), by.y=c("StartTime", "StopTime","chartType"))