I am trying to learn foreach to parallelise my task
My for-loop looks like this:
# create an empty matrix to store results
mat <- matrix(-9999, nrow = unique(dat$mun), ncol = 2)
for(mun in unique(dat$mun)) {
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
unique(dat$mun) has a length of 5563.
I want to use foreach to pararellise my task.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
foreach(icount(iters)) %dopar% {
mun <- unique(dat$mun)[mun] # this is where I cannot figure out how to assing mun so that it read the data for mun
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
This could be one solution.
Do note that I'm using windows here, and i specified registerDoParallel() for it to work.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
registerDoParallel()
mun <- unique(dat$mun)
tableList <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
tot.dat <- sum(dat$x)
}
unlist(tableList)
Essentially, whatever result inside {...} will be stored inside a list.
In this case, the result (tot.dat which is a number) is compiled in tableList, and by performing unlist() we can convert it to a vector for further use.
The result inside {...} can be anything, a single number, a vector, a dataframe, or anything.
Another approach for your problem would be to combine all existing data together, labelling it with its appropriate source file, so the middle component will look something like
library(plyr)
tableAll <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
dat$source = mun[i]
}
rbind.fill(tableAll)
Then we can use it for further analysis.
Related
I have a foreach %dopar% code setup to process my data in parallel but I am looking at ways to increase the speed. Basically I have a large data frame that is loaded using fread and
The size of the dataframe is 225 obs x 655369 variables. The foreach command selects two variables at a time, runs this process function (code that calculated various mediation, moderation, and conditional process models), for a total of 327,684 times. For this function the data must all be within the same dataframe. I noticed that the size of the dataframe seems to greatly slow down the foreach function.
From what I can tell the major cause of the slow down due to dataframe size is because of how the process function accesses the data for processing. So, what I am guessing is that each time the foreach runs the process function parses the entire dataframe until it finds the correct variable for each of the inputs.
One of my thoughts is to just chunk the data into smaller data frames to speed up processing time, and then merge the outputs together at the end. But I was wondering if anyone else has any suggestions for speeding this up as I am obviously not overly familiar with R.
The variable names for area_list and thickness_list, which are the mediators and the only values that change between each loop are labelled like this, such that the last number is either 0 or 1 for the pair, with all other numbers matching:
value_0_0_0 with value_0_0_1 for the first loop
value_1_0_0 with value_1_0_1 for the second loop
value_2_0_0 with value_2_0_1 for the third loop
value_3_0_0 with value_3_0_1 for the fourth loop
...
value_327684_1_0 with value_327684_1_1 etc.
options(scipen=999)
library(tidyverse)
library(foreach)
library(iterators)
library(parallel)
library(doParallel)
library("data.table")
library('janitor')
source("/scratch/R/process.r")
nCores <- detectCores() - 1
cl <- makeCluster(nCores)
registerDoParallel(cl)
my_data <- fread(
file = "/scratch/R/data.csv", header = TRUE, fill=TRUE, data.table = FALSE)
#Change values from -999 to NA in specific columns to avoid data issues (McAuley data)
my_data[, 88:133][my_data[, 88:133] == -999] <- NA
#Create dataframe for prepost useable data only
prepost_df <- subset(my_data, Select_PrePost==1)
pre_df <- subset(my_data, Select==1)
Large_MyData <- fread(
file = "/scratch/R/large.csv", header = TRUE, sep = ",", data.table = FALSE)
area_list <- names(Large_MyData)[grep("_1$",names(Large_MyData))]
thickness_list <- names(Large_MyData)[grep("_0$",names(Large_MyData))]
merged_data <- merge(pre_df, Large_MyData, by = "subs")
yvalue = "y"
xvalue = "x"
covariates = c("a","g","e")
ptm <- proc.time()
loopResults<-
foreach(area=area_list,thickness=thickness_list, .combine = rbind) %dopar%{
if (merged_data[area][1,1] == 0) {
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,'','','','','','')
} else {
result<-process(data=merged_data,y=yvalue,x=xvalue,m=c(area,thickness),cov=covariates,
model=4,contrast=1,boot=5000,save=2,modelbt=1,outscreen=0)
indirectEffects<-result[23:24,1:4]
indirectEffects_bootmean_area<-result[27,2]*result[38,2]
indirectEffects_bootmean_thickness<-result[32,2]*result[39,2]
indirectEffects_bootscore_area<-(indirectEffects_bootmean_area/result[23,2])
indirectEffects_bootscore_thickness<-(indirectEffects_bootmean_thickness/result[24,2])
merge_df1<-rbind(indirectEffects_bootmean_area,indirectEffects_bootmean_thickness)
merge_df2<-rbind(indirectEffects_bootscore_area,indirectEffects_bootscore_thickness)
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,indirectEffects,merge_df1,merge_df2)
}
}
proc.time() - ptm
stopCluster(cl)
colnames(loopResults) <- c("Vector","yvalue","xvalue","Effect","BootSE","BootLLCI","BootULCI","BootMean","boot_score")
loopResults
I'm trying to create a function that runs kmeans clustering on a specific columns within a dataset and returns the cluster membership. The idea is that someone else could say "what would the clustering look like if I used columns x,y, and z".
I'm trying to use the following code. For some reason, the magic_result() won't return anything when I put it into the function.
mydata.test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b =
c(2,2,2,4,4,4,5,5,5), c = c(1,1,1,6,6,6,4,4,4), d =
c(1,1,1,4,4,4,2,2,2), e = c(14,40,84,14,40,84,14,40,84))
mylist.test <- list(c(1,2),c(2,3),c(1,2,3),c(1,2,5))
magic_free()
my.kmeans.test <- function(myd,myk,myl) {
library(magicfor)
magic_for(print,silent=T)
for(i in myl) {
kmeans <- kmeans(myd[,i],centers=myk,nstart=25)
cl <- kmeans$cluster
print(cl)
}
res <- magic_result()
res.cl <- res$cl
return(res.cl)
}
What I don't understand is that when I try to run this as just a for loop (rather than a function) it works.
library(magicfor)
magic_for(print,silent=T)
for(i in myl) {
kmeans <- kmeans(mydata.test[,i],centers=3,nstart=25)
cl <- kmeans$cluster
print(cl)
}
res <- magic_result()
res.cl <- res$cl
res.cl
I'm guessing there's something funky going on with magicfor. Any idea how to get around this? Anything is appreciated.
Using map from purrr, you can just do
library(purrr)
my.kmeans.test <- function(myd, myk, myl) {
map(myl, function(idx) {
kmeans(myd[, idx], centers=myk, nstart=25)$center
})
}
my.kmeans.test(mydata.test, 3, mylist.test)
I would like to know if it would be possible to output two different objects after using foreach %dopar% loop.
I will try to explain what I am looking for. Let's suppose I have two data.frames as a result of several operations inside the loop:
library(doMC)
library(parallel)
registerDoMC(cores=4)
result <- foreach(i=1:100) %dopar% {
#### some code here
#### some code here
vec1 <- result_from_previous code # It would be the 1st object I'd like to ouput
vec2 <- result_from_previous code # It would be the 2nd object I'd like to output
}
My desired output would be a list of data.frames of length 2, such as:
dim(result[[1]]) # equals to nrow=length(vec1) and ncol=100
dim(result[[2]]) # equals to nrow=length(vec2) and ncol=100
I have tried with this from a previous post Saving multiple outputs of foreach dopar loop:
comb <- function(x, ...) {
lapply(seq_along(x), function(i) c(x[[i]], lapply(list(...), function(y) y[[i]])))
result <- foreach(i=1:100, .comb='comb', .multicombine=TRUE) %dopar% {
#### some code here
#### some code here
vec1 <- result_from_previous code
vec2 <- result_from_previous code
list(vec1, vec2)
}
But it doesn't give the expected result
When I do the following:
result <- foreach(i=1:100, .comb=cbind) %dopar% {
#### some code here
#### some code here
vec1 <- result_from_previous code
vec2 <- result_from_previous code
}
I obtain only the data.frame of vec2. Is there any way of returning or saving both outputs?
Thanks
If you need to return two objects from the body of the foreach loop, you must bundle them into one object somehow or other, and a list is the most general way to do that. The trick is to provide an appropriate combine function to achieve the desired final result. If you want to combine all of the vec1 objects with cbind, and also all of the vec2 objects with cbind, the mapply function is quite handy. I think this is what you want:
comb <- function(...) {
mapply('cbind', ..., SIMPLIFY=FALSE)
}
Here's a little test program for this combine function:
result <- foreach(i=1:100, .combine='comb', .multicombine=TRUE) %dopar% {
vec1 <- rep(i, 10)
vec2 <- rep(2*i, 10)
list(vec1, vec2)
}
This will return a list containing two, 10 X 100 matrices, but the same combine function can be used if vec1 and vec2 are data frames.
I would like to know if/how it would be possible to return multiple outputs as part of foreach dopar loop.
Let's take a very simplistic example. Let's suppose I would like to do 2 operations as part of the foreach loop, and would like to return or save the results of both operations for each value of i.
For only one output to return, it would be as simple as:
library(foreach)
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
oper1 <- foreach(i=1:100000) %dopar% {
i+2
}
oper1 would be a list with 100000 elements, each element is the result of the operation i+2 for each value of i.
Suppose now I would like to return or save the results of two different operations separately, e.g. i+2 and i+3. I tried the following:
oper1 = list()
oper2 <- foreach(i=1:100000) %dopar% {
oper1[[i]] = i+2
return(i+3)
}
hoping that the results of i+2 will be saved in the list oper1, and that the results of the second operation i+3 will be returned by foreach. However, nothing gets populated in the list oper1! In this case, only the result of i+3 gets returned from the loop.
Is there any way of returning or saving both outputs in two separate lists?
Don't try to use side-effects with foreach or any other parallel program package. Instead, return all of the values from the body of the foreach loop in a list. If you want your final result to be a list of two lists rather than a list of 100,000 lists, then specify a combine function that transposes the results:
comb <- function(x, ...) {
lapply(seq_along(x),
function(i) c(x[[i]], lapply(list(...), function(y) y[[i]])))
}
oper <- foreach(i=1:10, .combine='comb', .multicombine=TRUE,
.init=list(list(), list())) %dopar% {
list(i+2, i+3)
}
oper1 <- oper[[1]]
oper2 <- oper[[2]]
Note that this combine function requires the use of the .init argument to set the value of x for the first invocation of the combine function.
I prefer to use a class to hold multiple results for a %dopar% loop.
This example spins up 3 cores, calculates multiple results on each core, then returns the list of results to the calling thread.
Tested under RStudio, Windows 10, and R v3.3.2.
library(foreach)
library(doParallel)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $result1 and $result2.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
multiResultClass <- function(result1=NULL,result2=NULL)
{
me <- list(
result1 = result1,
result2 = result2
)
## Set the name for the class
class(me) <- append(class(me),"multiResultClass")
return(me)
}
cl <- makeCluster(3)
registerDoParallel(cl)
oper <- foreach(i=1:10) %dopar% {
result <- multiResultClass()
result$result1 <- i+1
result$result2 <- i+2
return(result)
}
stopCluster(cl)
oper1 <- oper[[1]]$result1
oper2 <- oper[[1]]$result2
This toy example shows how to return multiple results from a %dopar% loop.
This example:
Spins up 3 cores.
Renders a graph on each core.
Returns the graph and an attached message.
Prints the graphs and it's attached message out.
I found this really useful to speed up using Rmarkdown to print 1,800 graphs into a PDF document.
Tested under Windows 10, RStudio, and R v3.3.2.
R code:
# Demo of returning multiple results from a %dopar% loop.
library(foreach)
library(doParallel)
library(ggplot2)
cl <- makeCluster(3)
registerDoParallel(cl)
# Create class which holds multiple results for each loop iteration.
# Each loop iteration populates two properties: $resultPlot and $resultMessage.
# For a great tutorial on S3 classes, see:
# http://www.cyclismo.org/tutorial/R/s3Classes.html#creating-an-s3-class
plotAndMessage <- function(resultPlot=NULL,resultMessage="?")
{
me <- list(
resultPlot = resultPlot,
resultMessage = resultMessage
)
# Set the name for the class
class(me) <- append(class(me),"plotAndMessage")
return(me)
}
oper <- foreach(i=1:5, .packages=c("ggplot2")) %dopar% {
x <- c(i:(i+2))
y <- c(i:(i+2))
df <- data.frame(x,y)
p <- ggplot(df, aes(x,y))
p <- p + geom_point()
message <- paste("Hello, world! i=",i,"\n",sep="")
result <- plotAndMessage()
result$resultPlot <- p
result$resultMessage <- message
return(result)
}
# Print resultant plots and messages. Despite running on multiple cores,
# 'foreach' guarantees that the plots arrive back in the original order.
foreach(i=1:5) %do% {
# Print message attached to plot.
cat(oper[[i]]$resultMessage)
# Print plot.
print(oper[[i]]$resultPlot)
}
stopCluster(cl)
i've following problem:
I use the for-loop within R to get specific data from a matrix.
my code is as follows.
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}
Using print, R delivers the data.
Using write.table, R delivers the data into different files.
My aim is to aggregate the results from the for-loop within one matrix. (each row for 'i')
But unfortunately I can not make it.
sorry, i'm a real noob within R.
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
this delivers 100 objects, which contain my results.
But what i need is one matrix/dataframe with 100 columns or one list.
Any ideas?
Hey!
Hence I'm not allowed to edit my own answers, here my (simple) solution as follows:
DELIST <- vector("list",100)
for(i in 1:100)
{
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST[[i]] <- as.character((subset(datensatz_Start_End.frame, TIME <= T))[,1])
}
DELIST[[99]] ## it is possible to requist the relevant companies for every 'i'
Thx to everyone!
George
If you want a list you can use lapply instead of loop
LL <- lapply(1:100,
function(i) {
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
assign(paste('b',i,sep=''),DELIST)
}
)
After that you can rbind results together using do.call
result <- do.call(rbind, LL)
Or if you are confident that columns of all elements of LL are going to be of same, then you can use more efficient rbindlist from package data.table
result <- rbindlist(LL)
check out rbind function. You can start with empty DELIST.DF and append each row to it inside the loop -
DELIST.DF <- NULL
for(i in 1:100){
T <- as.Date(as.mondate (STARTLISTING)+i)
DELIST <- (subset(datensatz_Start_End.frame, TIME <= T))[,1]
DELIST.DF <- rbind(DELIST.DF, DELIST)
write.table(DELIST, file = paste("tab", i, ".csv"), sep="," )
print(DELIST)
}