I am using
library(foreach)
library(doSNOW)
And I have a function mystoploss(data,n=14)
I then call it like that (I want to loop over n=14 for now):
registerDoSNOW(makeCluster(4, type = "SOCK"))
foreach(i = 14) %dopar% {assign(paste("Performance",i,sep=""),
mystoploss(data=mydata,n=i))}
I then try to find Performance14 from above, but it is not assigned.
Is there some way to assign so the output will be in Performance14?
And if I use
foreach(i = 14) %dopar% {assign(paste("Performance",i,sep=""),
mystoploss(data=mydata,n=i),envir = .GlobalEnv)}
I get error :
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
worker initialization failed: Error in as.name
Best Regards
This is because the assign operations are happening in the worker processes. The vaues of the variables are being sent back (see your R session console) but not with the names you assigned. You need to capture these values and assign them names again. See this related question.
The following is an alternative that may be of help: asign the output of foreach to an intermediate variable and assign it to your desired variables in the current 'master process' environment.
PerformanceAll <- foreach(i = 1:14,.combine="c") %dopar% { mystoploss(data=mydata,n=i) } #pick .combine appropriately
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[i]) }
Here is the full example I tried:
library(foreach)
library(doSNOW)
mystoploss <- function(data=1,n=1){
return(runif(data)) #some operation, returns a scalar
}
mydata <- 1
registerDoSNOW(makeCluster(4, type = "SOCK"))
PerformanceAll <- foreach(i = 1:14,.combine="c") %dopar% { mystoploss(data=mydata,n=i) }#pick .combine appropriately
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[i]) }
Edit: If the output of mystoploss is a list, then do the following changes:
mystoploss <- function(data=1,n=1){#Example
return(list(a=runif(data),b=1)) #some operation, return a list
}
PerformanceAll <- foreach(i = 1:14) %dopar% { mystoploss(data=mydata,n=i) }#remove .combine
for(i in 1:14){ assign(paste("Performance",i,sep=""), PerformanceAll[[i]]) } #double brackets
Related
I am trying to use the doParallel package with foreach and %dopar% for the first time as I need to increase the speed of my computation.
Although the code is executed without raising any error, the file is not stored in the output folder.
I have a list of file paths (list_files) and a function (my_function) that I previously validated using sapply. When I use sapply, the output is stored in the output location. Using foreach and %dopar% returns no output in my output location.
# Define my function and call it my_function
my_function <- function(input_dir, output_dir) {
tryCatch(
expr = {
file <- read.csv(input_dir,sep = "\t", col.names = c("column1", "column2"))
file <- as_tibble(file)
file_noNA <- file %>% filter(!is.na(column1))
name <- substr(input_dir, nchar(input_dir)-8, nchar(input_dir)-4)
save(file_noNA, file = paste0(output_dir, name, ".rds"))
}
)
}
library("parallel")
library("foreach")
library("doParallel")
# Set number of cores
n.cores <- 5
# Check doParallel package
doParallel::registerDoParallel(n.cores)
getDoParWorkers()
# Apply function with parallel computing
foreach(i = list_files) %dopar% function(x) {
my_function(
input_dir = x,
output_dir = output_location)
}
This is what I have tried (without success):
Assigned the result
used foreach(i = list_files, .combine = 'c') %dopar% function(x) {...}
Used a single file instead of list_files
Reduced the number of cores
Do I need to add an export statement, e.g.
.export=ls(envir=globalenv())
or
.export=ls() ?
I have found a feature/bug in the foreach package, which I do not understand. Perhaps someone can explain me this behaviour:
I created a for-loop with the foreach package (I use them together with mutlicore calculations, but here just in a sequentiell example, the bug appears in both variants). This loop runs r times. In every run a list with c entries is returned. So I expect a list with r entries, and every entry consists of c lists.
My code was the following one:
library(foreach)
clusters <- 10
runs <- 100
temp <- foreach(r = 1:runs,
.combine = 'list',
.multicombine = TRUE) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
With this code, all works as expected, see the following picture:
But when increasing runs <- 101, the output temp is this:
The expected list structure is destroyed. But when commenting out the line .combine = 'list' all works as expected.
library(foreach)
clusters <- 10
runs <- 100
temp <- foreach(r = 1:runs,
.multicombine = TRUE) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
Can someone explain this behaviour?
Thanks for any help!
Meanwhile I have found a solution.
The foreach function knows that some comine-functions (e.g. c or cbind) take many arguments, and will call them with up to 100 arguments (by default) in order to improve performance. With the argument .maxcombine you can set them manually.
library(foreach)
clusters <- 10
runs <- 101
temp <- foreach(r = 1:runs,
.combine = 'list',
.maxcombine = runs,
.multicombine = T) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
the following is a parallel loop I am trying to run in R:
cl <- makeCluster(30,type="SOCK")
registerDoSNOW(cl)
results <- foreach (i = 1:30, .combine='bindlist', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
stopCluster(cl)
The output of my code is always a list and I want to combine the list into one large list. Thus I wrote the following .combine function:
bindlist <- function(x,y,...){
append(list(x),list(y),list(...))
}
As I am doing multiple runs and the number of variables change I tried to use .... However it does not work. How can I rewrite the .combine function so it can work with changing numbers of variables?
Have you considered using 'c'
results <- foreach (i = 1:4, .combine='c', .multicombine=TRUE) %dopar% {
test <- i
test <- as.list(test)
list(test)
}
If this adds an additional unwanted 'level' to your results, you could use 'unlist' to remove that level.
unlist(results, recursive = FALSE)
I'd like to know whether the cpv function within the trotter package works with %dopar%? I'm getting the following error:
task 1 failed - "object of type 'S4' is not subsettable"
Here's a small example:
library(doParallel)
library(trotter)
registerDoParallel(cores = 2)
x <- letters
combos <- cpv(2, 1:4)
print(combos)
num_combos <- length(combos)
results_list <- foreach(combo_num=1:num_combos) %dopar% { # many iterations
y <- x[combos[combo_num]]
# time consuming stuff follows that involves using y
}
Replacing %dopar% with %do% (or simply using a for loop) and it works fine.
Depending on the cluster type one needs to explicitly specify the used packages via the .packages argument. The following should work:
library(doParallel)
library(trotter)
cl <- makePSOCKcluster(2)
registerDoParallel(cl=cl)
x <- letters
combos <- cpv(2, 1:4)
num_combos <- length(combos)
rl <- foreach(combo_num=1:num_combos, .packages="trotter") %dopar% {
x[combos[combo_num]]
}
I am trying to convert the following code into parallel using foreach and %dopar%.
library(doSNOW)
library(foreach)
cl<- makeCluster(4, type = "SOCK")
registerDoSNOW(cl)
min_subid <- c()
max_subid <- c()
p_typ <- c()
p_nm <- c()
st_tm<-c()
end_tm <- c()
supp <- c()
chart_type <- c()
foreach(j =1:noOfPhases) %dopar%
{
start_time <-phases[j, colnames(phases)=="StartTime"]
end_time <-phases[j, colnames(phases)=="StopTime"]
phase_type <-phases[j, colnames(phases)=="Phase_Type_Id"]
phase_name <-phases[j, colnames(phases)=="Phase_Name"]
suppress <-phases[j, colnames(phases)=="Suppression_Time"]
chart_typ <-phases[j, colnames(phases)=="chartType"]
conft<-(masterData$Time.Subgroup>=start_time & masterData$Time.Subgroup<=end_time)
masterData[which(conft), colnames(masterData)=="Phase_Type"]<-phase_type
masterData[which(conft), colnames(masterData)=="Phase_Name"]<-phase_name
min_subid <- rbind(min_subid, min(which(conft)))
max_subid <- rbind(max_subid, max(which(conft)))
p_typ <- rbind( p_typ, masterData$Phase_Type[min(which(conft))])
p_nm <- rbind( p_nm, masterData$Phase_Name[min(which(conft))])
st_tm <- rbind( st_tm, as.character(start_time))
end_tm <- rbind( end_tm, as.character(end_time))
supp <- rbind(supp,as.character(suppress))
chart_type <- rbind(chart_type,as.character(chart_typ))
phase_info <- data.frame(Subgrp_No_Start=min_subid, Subgrp_No_End=max_subid, Phase_Type=p_typ,
Phase_Name=p_nm, Start_Time=st_tm, Stop_Time=end_tm,
Suppression_Time=supp,ChartType=chart_type)
}
phase_output<-merge(phase_info, phases, by.x=c("Start_Time",
"Stop_Time","ChartType"), by.y=c("StartTime", "StopTime","chartType"))
The above code executes successfully when %do% is included instead of %dopar%. can anyone help me in understanding why I get the following error when it runs parallel (%dopar%) and runs successfully on sequential (%do%)
Error in merge(phase_info, phases, by.x = c("Start_Time", "Stop_Time", :
object 'phase_info' not found
The solution is really simple, but I start off with an explanation of what is happening when you execute the code to explain the error.
What happens in your foreach block is that the one data frame (phase_info) is created for each value of j and they are returned together in a list. However, since your assignment phase_info <- data.frame(...) is located inside the foreach rather than outside, the list is not stored anywhere and gets discarded. The cause for confusion is that when using %do% you create all the data frames sequentially on the master node and when using %dopar% the frames are being created in parallel on the worker nodes. The following merge command is executed on the master node causing an error if you used %dopar% since phase_info does not exist in its workspace. Also note that when using %do% like above, each iterations of foreach overwrites the result of the previous ones (i.e. you get only the result of the last iteration).
This minor change fixes it:
phase_info <- foreach(...) %dopar% {
...
data.frame(Subgrp_No_Start=min_subid, Subgrp_No_End=max_subid, Phase_Type=p_typ,
Phase_Name=p_nm, Start_Time=st_tm, Stop_Time=end_tm,
Suppression_Time=supp,ChartType=chart_type)
# No need to give it a name as it will be returned and the name forgotten
}
phase_output <- merge(phase_info, ...)
As I mentioned above, phase_info will now be a list where each element is a data frame. I am just guessing now but you probably want to execute the merge elementwise then, like this:
phase_output <- lapply(phase_info, merge, phases, by.x=c("Start_Time",
"Stop_Time","ChartType"), by.y=c("StartTime", "StopTime","chartType"))