Ways to increase foreach %dopar% loop speed with large data frame - r

I have a foreach %dopar% code setup to process my data in parallel but I am looking at ways to increase the speed. Basically I have a large data frame that is loaded using fread and
The size of the dataframe is 225 obs x 655369 variables. The foreach command selects two variables at a time, runs this process function (code that calculated various mediation, moderation, and conditional process models), for a total of 327,684 times. For this function the data must all be within the same dataframe. I noticed that the size of the dataframe seems to greatly slow down the foreach function.
From what I can tell the major cause of the slow down due to dataframe size is because of how the process function accesses the data for processing. So, what I am guessing is that each time the foreach runs the process function parses the entire dataframe until it finds the correct variable for each of the inputs.
One of my thoughts is to just chunk the data into smaller data frames to speed up processing time, and then merge the outputs together at the end. But I was wondering if anyone else has any suggestions for speeding this up as I am obviously not overly familiar with R.
The variable names for area_list and thickness_list, which are the mediators and the only values that change between each loop are labelled like this, such that the last number is either 0 or 1 for the pair, with all other numbers matching:
value_0_0_0 with value_0_0_1 for the first loop
value_1_0_0 with value_1_0_1 for the second loop
value_2_0_0 with value_2_0_1 for the third loop
value_3_0_0 with value_3_0_1 for the fourth loop
...
value_327684_1_0 with value_327684_1_1 etc.
options(scipen=999)
library(tidyverse)
library(foreach)
library(iterators)
library(parallel)
library(doParallel)
library("data.table")
library('janitor')
source("/scratch/R/process.r")
nCores <- detectCores() - 1
cl <- makeCluster(nCores)
registerDoParallel(cl)
my_data <- fread(
file = "/scratch/R/data.csv", header = TRUE, fill=TRUE, data.table = FALSE)
#Change values from -999 to NA in specific columns to avoid data issues (McAuley data)
my_data[, 88:133][my_data[, 88:133] == -999] <- NA
#Create dataframe for prepost useable data only
prepost_df <- subset(my_data, Select_PrePost==1)
pre_df <- subset(my_data, Select==1)
Large_MyData <- fread(
file = "/scratch/R/large.csv", header = TRUE, sep = ",", data.table = FALSE)
area_list <- names(Large_MyData)[grep("_1$",names(Large_MyData))]
thickness_list <- names(Large_MyData)[grep("_0$",names(Large_MyData))]
merged_data <- merge(pre_df, Large_MyData, by = "subs")
yvalue = "y"
xvalue = "x"
covariates = c("a","g","e")
ptm <- proc.time()
loopResults<-
foreach(area=area_list,thickness=thickness_list, .combine = rbind) %dopar%{
if (merged_data[area][1,1] == 0) {
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,'','','','','','')
} else {
result<-process(data=merged_data,y=yvalue,x=xvalue,m=c(area,thickness),cov=covariates,
model=4,contrast=1,boot=5000,save=2,modelbt=1,outscreen=0)
indirectEffects<-result[23:24,1:4]
indirectEffects_bootmean_area<-result[27,2]*result[38,2]
indirectEffects_bootmean_thickness<-result[32,2]*result[39,2]
indirectEffects_bootscore_area<-(indirectEffects_bootmean_area/result[23,2])
indirectEffects_bootscore_thickness<-(indirectEffects_bootmean_thickness/result[24,2])
merge_df1<-rbind(indirectEffects_bootmean_area,indirectEffects_bootmean_thickness)
merge_df2<-rbind(indirectEffects_bootscore_area,indirectEffects_bootscore_thickness)
merge_df3<-rbind(area,thickness)
merge_df_out<-cbind(merge_df3,yvalue,xvalue,indirectEffects,merge_df1,merge_df2)
}
}
proc.time() - ptm
stopCluster(cl)
colnames(loopResults) <- c("Vector","yvalue","xvalue","Effect","BootSE","BootLLCI","BootULCI","BootMean","boot_score")
loopResults

Related

Automatic creation and use of custom made function in R - in for each loop - storing the result in one DF - 3D array

few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of
eval(parse(text = Function text))
here is the link: Automatic creation and use of custom made function in R.
This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.
Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.
what I would like to do is run the evaluation of the indices in parallel to reduce the computation time.
here a dummy example to what I would like to emulate, but not in parallel computing.
# create a fake matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4
image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))
#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
"HYPR_IMG[,,3] + HYPR_IMG[,,2]",
"HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
"HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))
# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID=IDX_DF$IDXname[i]
IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
eval(parse(text = IDXFunc_call))
IDX_VAL=IDXfun_tmp (HYPR_IMG)
image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID)
temp_DF=as.vector(IDX_VAL)
Store_DF=cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.
Is it really possible ?
Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?
Thanks a lot!!!
For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial.
Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.
cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)
Please note the stopCluster(cl) which is important to shut down any loose R sessions.
Benchmark results (4 tiny cores):
Unit: milliseconds
expr min lq mean median uq max neval
Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623 100
Parallel 1.382324 1.491634 2.038024 1.554690 1.907728 18.23942 100
For replications of benchmarks the code has been provided below:
cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
Loop = {
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID = IDX_DF$IDXname[i]
IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
IDX_VAL = IDXfun_tmp(HYPR_IMG)
#Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
#image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID)
temp_DF = as.vector(IDX_VAL)
Store_DF = cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
rm(Store_DF)
},
Parallel = {
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
rm(result)
}
)
parallel::stopCluster(cl)
Edit: Using the foreach package
After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:
The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.
Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.
cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row")
library(foreach)
result <-
foreach(
#Supply the iterator
row = rowIterator,
#Specify if the calculations needs to be in order. If not then we can get better performance not doing so
.inorder = FALSE,
#In most foreach loops you will have to export the data you need for the calculations
# it worked without doing so for me, in which case it is faster if the exported stuff is large
#.export = c("HYPR_IMG"),
#We need to say how the output should be merged. If nothing is given it will be output as a list
#data.table rbindlist is faster than rbind (returns a data.table)
.combine = function(...)data.table::rbindlist(list(...)) ,
#otherwise we could've used:
#.combine = rbind
#if we dont use rbind or cbind (i used data.table::rbindlist for speed)
# we will have to tell if it can take more than 1 argument
.multicombine = TRUE
) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%
IDX_ID <- row[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID],
indx~ID,
value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)

Parallel Computation for Create_Matrix 'RTextTools' package

I am creating a DocumentTermMatrix using create_matrix() from RTextTools and create container and model based on that. It is for extremely large datasets.
I do this for each category (factor levels). So for each category it has to run matrix, container and model. When I run the below code in (say 16 core / 64 gb) - it runs only in one core and memory used is less than 10%.
Is there way I can speedup this process? Perhaps using doparallel & foreach? Any information would certainly help.
#import the required libraries
library("RTextTools")
library("hash")
library(tm)
for ( n in 1:length(folderaddress)){
#Initialize the variables
traindata = list()
matrix = list()
container = list()
models = list()
trainingdata = list()
results = list()
classifiermodeldiv = 0.80`
#Create the directory to place the models and the output files
pradd = paste(combinedmodelsaveaddress[n],"SelftestClassifierModels",sep="")
if (!file.exists(pradd)){
dir.create(file.path(pradd))
}
Data$CATEGORY <- as.factor(Data$CATEGORY)
#Read the training files
X <- split(Data, Data$CATEGORY)
data <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[,5])
names(data) <- levels(Data$CATEGORY)
list2env(data, envir = .GlobalEnv)
files=as.matrix(names(data))
fileno=length(files)
fileno=as.integer(fileno)
print(fileno)
#For all the files in the training folder(the number of files in the training folder = Number of categories in Taxonomy)
for(i in 1:fileno){
filename = as.character(files[i,1])
data1 = as.data.frame(data[i])
data1 = as.matrix(data1)
filenamechanged = gsub ("\\.[[:alnum:]]+","",filename)
type = matrix(data = as.character(filenamechanged),nrow = length(data1[,1]),ncol=1 )
data1 = cbind(data1,type)
traindata[[i]] = data1
print(i)
}
for(i in 1:fileno){
#Obtain the unique classified data from the train files for one category
trainingdata1 = as.data.frame(traindata[[i]][,1])
uniquetraintweet = hash()
typetrain1 = matrix(data=as.character(traindata[[i]][1,2]), ncol =1, nrow = length(trainingdata1[,1]))
#If the training data is less than 10 records for a category, do not create a model
#The model created based on a smaller set of data will not be accurate
if (length(trainingdata1[,1])<200){
matrix[[i]] = NULL
next
}
#Obtain the unique classified data from the train files of all the other category except that is considered as training category
trainingdata2=matrix(data="",nrow=0,ncol=1)
for (j in 1:fileno){
if ( j==i) next
trainingdata2dummy = as.data.frame(traindata[[j]][,1])
length(trainingdata1[,1])
colnames(trainingdata2)="feedbacks"
colnames(trainingdata2dummy)="feedbacks"
trainingdata2 = rbind(trainingdata2,trainingdata2dummy)
}
#Consider one category as training set and make the remaining categories as Others
typetrain2 = matrix(data="ZZOther",nrow=length(trainingdata2[,1]),ncol=1)
colnames(trainingdata1)="feedbacks"
trainingdata[[i]]=rbind(trainingdata1,trainingdata2)
colnames(typetrain1)="type"
colnames(typetrain2)="type"
type=rbind(typetrain1,typetrain2)
trainingdata[[i]] = cbind(trainingdata[[i]],type)
trainingdata[[i]]=trainingdata[[i]][sample(nrow(trainingdata[[i]])),]
#Input the training set and other set to the classifier
mindoc = max(1,floor(min(0.001*length(trainingdata[[i]][,1]),3)))
#Create Matrix
matrix[[i]] <- create_matrix(trainingdata[[i]][,1], language="english",
removeNumbers=FALSE, stemWords=FALSE,weighting=weightTf,minWordLength=3, minDocFreq=mindoc, maxDocFreq=floor(0.5*(length(trainingdata[[i]][,1]))))
#rowTotals <- apply(matrix[[i]] , 1, sum) #Find the sum of words in each Document
#matrix[[i]] <- matrix[[i]][rowTotals> 0,]
print(i)
#Create Container
container[[i]] <- create_container(matrix[[i]],trainingdata[[i]][,2],trainSize=1:length(trainingdata[[i]][,1]),virgin=FALSE)
print(i)
#Create Models
models[[i]] <- train_models(container[[i]], algorithms=c("SVM"))
print(i)
}
save(matrix, file = paste(pradd,"/Matrix",sep=""))
save(models, file = paste(pradd,"/Models",sep=""))
}
Here is an example of working with RTextTools in parallel. I created the dummy function using information to be found here.
The function myFun follows the introduction in the above link - at the end it writes a csv file (no directory is specified) containing the analytics/summary. Afterwards it is straight forward application of the base R package parallel in order to run myFun in parallel.
library(parallel)
library(RTextTools)
# I. A dummy function
# Uses RTextTools
myFun <- function (trainMethod) {
library(RTextTools)
data(USCongress)
# Create the document-term matrix
doc_matrix <- create_matrix(USCongress$text, language="english", removeNumbers=TRUE,
stemWords=TRUE, removeSparseTerms=.998)
container <- create_container(doc_matrix, USCongress$major, trainSize=1:4000,
testSize=4001:4449, virgin=FALSE)
# Train
model <- train_model(container,trainMethod)
classify <- classify_model(container, model)
# Analytics
analytics <- create_analytics(container,
cbind(classify))
summary(analytics)
# Saving
nameToSave <- paste(trainMethod, 'DocumentSummary.csv', sep = '_')
write.csv(analytics#document_summary, nameToSave)
}
# II. Parallel Processing
#
# 1. Vector for parallelization & number of cores available
trainMethods <- c('SVM','GLMNET','MAXENT','SLDA','BOOSTING')
num_cores <- detectCores() - 1L
# 2. Start a cluster
cl <- makeCluster(num_cores)
# 3. Export Variables needed to the cluster
# specifying exactly which variables should be exported
clusterExport(cl, varlist = c('myFun', 'trainMethods'))
# 4. do in parallel
parLapply(cl, seq_along(trainMethods), function (n) myFun(trainMethod = trainMethods[n]))
# stop the cluster
stopCluster(cl)
In your case, you'd have to turn your code into a function myFun (n, ...) with n being an element of seq_along(folderaddress) and of course substitute seq_along(trainMethods) for seq_along(folderaddress) in parLapply.
Of course chances are there exist ways besides parallelization to enhance your code. The problem is without sample data, any suggested improvement is but conjecture.

Using foreach instead of for loop

I am trying to learn foreach to parallelise my task
My for-loop looks like this:
# create an empty matrix to store results
mat <- matrix(-9999, nrow = unique(dat$mun), ncol = 2)
for(mun in unique(dat$mun)) {
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
unique(dat$mun) has a length of 5563.
I want to use foreach to pararellise my task.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
foreach(icount(iters)) %dopar% {
mun <- unique(dat$mun)[mun] # this is where I cannot figure out how to assing mun so that it read the data for mun
dat <- read.csv(paste0("data",mun,".csv")
tot.dat <- sum(dat$x)
mat[mat[,1]== mun,2] <- tot.dat
}
This could be one solution.
Do note that I'm using windows here, and i specified registerDoParallel() for it to work.
library(foreach)
library(doParallel)
# number of iterations
iters <- 5563
registerDoParallel()
mun <- unique(dat$mun)
tableList <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
tot.dat <- sum(dat$x)
}
unlist(tableList)
Essentially, whatever result inside {...} will be stored inside a list.
In this case, the result (tot.dat which is a number) is compiled in tableList, and by performing unlist() we can convert it to a vector for further use.
The result inside {...} can be anything, a single number, a vector, a dataframe, or anything.
Another approach for your problem would be to combine all existing data together, labelling it with its appropriate source file, so the middle component will look something like
library(plyr)
tableAll <- foreach(i=1:iters) %dopar% {
dat <- read.csv(paste0("data",mun[i],".csv")
dat$source = mun[i]
}
rbind.fill(tableAll)
Then we can use it for further analysis.

RAM demands increase exponentially when running a for-loop executing the glm function multiple times

I want to detect interaction effects in a dataset and I have written code that creates all the possible combinations of predictors, fits gml models with each pair separately and stores the vital statistics of the model and the model itself.
My dataset is comprised of 100,000 + observations and I want to examine 200,000 + possible pair combinations.
The code runs without errors but the problem is that after the 2,000 th iteration the 60 GB RAM of my PC has been filled (when I begin running the code there are 58 GB of free RAM )
For a reproducible example I will use the mtcars dataset:
data(mtcars)
setDT(mtcars)
predictor_names <- setdiff(names(mtcars) , "am")
combinations <- combn(length(predictor_names) , 2)
combinations <- t(combinations)
combinations <- as.data.frame(combinations)
models_glm <- list()
Coefficients_dt <- data.table(Predictor_1 = character() , Predictor_2 = character(), dev_ratio = numeric() ,
Estimate = numeric(), p.value = numeric())
system.time(
for (i in (1 : (nrow(combinations) - 1 ))) {
# Extracts the index positions of the names of the pairs
#----------------------------------------------------------
m <- combinations[i, 1]
k <- combinations[i, 2]
# Extracts the names of the predictors from a vector that holds them
#------------------------------------------------------------------------
m_name <- predictor_names[m]
k_name <- predictor_names[k]
# Uses the names of the predictors to construct a formula
#------------------------------------------------------------------
formula_glm <- paste0( "am ~ " , m_name, " * " , k_name)
formula_glm <- as.formula(formula_glm )
# Passes the formula to a glm model
#-------------------------------------------------------------------
model <- glm(formula_glm , mtcars, family = "binomial")
# Stores the model to a list
#-------------------------------------------------------------------
models_glm [[ paste0(m_name , "_*_" , k_name)]] <- model
# Calculates the dev.ratio
#---------------------------------------------------------------
residual.deviance <- model$deviance
null.deviance <- model$null.deviance
dev.ratio <- (null.deviance - residual.deviance) / null.deviance
# Extracts the Coefficient estimate and p-value from the model
#-------------------------------------------------------------------
Coefficients_df <- as.data.frame(summary(model)$coefficients)
names(Coefficients_df) <- c("Estimate" , "SE" , "Z", "p.value")
if(dim(Coefficients_df)[1] == 4){
Coefficients_dt <- rbind(Coefficients_dt , data.table(
Predictor_1 = m_name ,
Predictor_2 = k_name,
dev_ratio = dev.ratio,
Estimate = Coefficients_df$Estimate[4] ,
p.value = Coefficients_df$p.value[4]
))
}
}
)
What can I do to overcome this problem?
I.e. I would like to understand the root cause of the problem: What is taking space in RAM? The objects involved are not very large compared to the available RAM. Specifically the Coefficients_dt data.table at the most will become 200,000 row x 5 columns large.
So something else is going on and consumes more and more RAM as the iterations build up in the for-loop.
Next I would like to understand if there is some action I can take in the middle of the execution of the for-loop -- e.g. command nested in an if statement within the for loop-- that would free RAM space while possibly saving any objects that would be removed from the RAM and should be safeguarded.
Your advice will be appreciated.
Consider the following options:
Pre-allocate any needed object so instead of expanding it with values iteratively which requires the machine to reallocate space using memory, you assign values to existing elements:
models_glm <- vector(mode = "list", length = 45)
In fact, even consider naming the elements beforehand:
pnames <- sapply(1:nrow(combinations)-1, function(i){
paste0(predictor_names[combinations[i,1]], "_*_",
predictor_names[combinations[i,2]])
})
models_glm <- setNames(vector(mode="list", length=45), pnames)
Use data.table::rbindlist() to row bind a list of data tables into one large dataframe in one call instead of expanding dataframe row by row inside a loop. Below uses lapply returning an object equal to length of input. Also, notice empty datatable to avoid NULL returns, left out of rbindlist:
dTList <- lapply(seq(nrow(combinations)-1), function(i) {
#... same as above
# use <<- operator to update environment object outside function
models_glm[[paste0(m_name, "_*_", k_name)]] <<- model
#...
Coefficients_df <- setNames(as.data.frame(summary(model)$coefficients),
c("Estimate", "SE", "Z", "p.value"))
if(dim(Coefficients_df)[1] == 4){
data.table(
Predictor_1 = m_name ,
Predictor_2 = k_name,
dev_ratio = dev.ratio,
Estimate = Coefficients_df$Estimate[4],
p.value = Coefficients_df$p.value[4]
)
} else {
# RETURN EMPTY DT
data.table(
Predictor_1 = character(),
Predictor_2 = character(),
dev_ratio = numeric(),
Estimate = numeric(),
p.value = numeric()
)
}
})
coefficients <- data.table::rbindlist(dTlist)
rm(dTlist)
gc()
Finally, for large operations that do not need design/programming work, consider using the automated Rscript.exe over RStudio or Rgui as these later programs require additional resources. Below is a command line that can run from PowerShell, CMD prompt, or batch (.bat) file assuming Rscript is an environment PATH variable:
Rscript "C:\Path\To\ModelCoefficientDataBuild.R"
Specifically, RStudio's rsession.exe on Windows tends to not release memory back to OS once it obtains it until the session is over. See RStudio forum posts on subject. Of course be sure to save your needed objects to disk for use later:
saveRDS(coefficients, "coefficients_datatable.rds")

Parallel computing with foreach - Saving loop outputs to global list

I would like to run a large loop with the foreach function.This means using the %dopar% operator.
I can't find any questions already answered to this problem exactly. If this is a duplicate though, please point me in the right direction and I'll close this question.
I have been having mixed success. It works for simple examples on my machine, as per the help documentation, however I cannot seem to get good results for my own work.
My example is slightly more complicated, so the devil seems to be in the detail, as always! I have also read the 'white paper' provided by the package creators Revolution Analytics (you can get it here).
I don't see how best to maybe use the .combine argument to apply results to my global output list.
I would like to assign claculated value to one big list as opposed to using cbind or c
My example is pretty convoluted, but if I simplify it any further then any answers might not address my issue.
I will perform a kind of moving-linear model. So fit a model using lm() over 50 obersvations [1:50], predict the 51st observation [51], saving the results to a list.
Then I will shift it all one observation further. So a lm over [2:51] and predict the 52nd observation [52].
I will use a total of 100 observations, so I can make a maximum of 50 predictions.
## ============================================ ##
## Setup the backend for the foreach function ##
## ============================================ ##
## doMC calls upon cores on demand, uses them and closes them
library(doMC)
registerDoMC(detectCores()) #detectCores() uses all cores
## for Windows users
#library(doParallel) --> for Windows users
#registerDoParallel(detectCores())
## ======================== ##
## Create some dummy data ##
## ======================== ##
## three columns, one hundred observations
my_data <- data.table(outcome = runif(100), V1 = 3*runif(100), V2 = sqrt(runif(100)))
## Have a look at the data if you like - using the DT package
library(DT)
datatable(my_data, options = list(pageLength = nrow(my_data)))
## ================================= ##
## Perform the loop the normal way ##
## ================================= ##
## Create container (a list of lists) for results
my_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(my_results)){my_results[[i]] <- step_results}
## Use a for loop to stpe through all the 50 'slices'
for(i in 1:50) { #max. 50 predictions possible
## Fit a linear model
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
## Predict the next step
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred #simply measure the delta to the actual value
## Assign some results to the container created earlier
my_results[[i]][[1]] <- my_fit$coefficients
my_results[[i]][[2]] <- my_fit$rank
my_results[[i]][[3]] <- error
}
str(my_results) ## Keep this container to compare to our next one
## ============================================ ##
## Perform the loop using foreach and %dopar% ##
## ============================================ ##
## Create same results object for results as previously for parallel results
par_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(par_results)){par_results[[i]] <- step_results}
my_results_par <- foreach(i = 1:50) %dopar%
{ #max. 50 predictions possible
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred
## Assign some results to the container created earlier
par_results[[i]][[1]] <- my_fit$coefficients
par_results[[i]][[2]] <- my_fit$rank
par_results[[i]][[3]] <- error
Sys.sleep(i/20) #Allows time to see R processes spawn on your system
return(par_results)
}
## We can see straight away that this didn't work as I would like it to
identical(my_results, my_results_par) #FALSE
## This shows that the output seems good on the surface
class(my_results_par)
length(my_results_par)
## This shows that it doesn't (WARNING: very long)
str(my_results_par)
You can try out the various .combine arguments in the foreach function, for example:
foreach(i = 1:50, .combine = "c") {computation}
or
foreach(i = 1:50, .combine = "cbind") {computation}
these prodice a vector and a matrix respectively, but do not contain all the results that I was trying to save in each loop.
Questions
Does that structure give you a clue as to what is going on?
How might I use .combine argument to create my desired output?
Is what I am trying to do even possible??
Do I need to put the loop with foreach at a different point in the algorithm?
I have read that you can supply a custom function to foreach... might this be the way to do it? I still don't see how I would combine the results.
Yes, this can easily be done. We can modify your code for the foreach-step to the following, where we export the data.table package to each worker.
my_results_par <- foreach(i = 1:50, .combine = append, .packages = c("data.table")) %dopar%
{
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred
par_results <- list(
coefs = my_fit$coefficients,
rank = my_fit$rank,
error = error
)
par_results <- list(par_results)
names(par_results) <- paste0("step_", i)
return(par_results)
}
identical(my_results, my_results_par)
[1] TRUE

Resources