I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.
Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.
My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:
x <- array(dim=c(2,2,2))
for (i in 1:2){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- i+j+k
}
}
}
If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:
library(doParallel)
library(foreach)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
x <- foreach(i=1:2, .combine=rbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.
I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!
Any help would be greatly appreciated. Thanks in advance.
There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:
library(abind)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
mbind <- function(...) abind(..., along=3)
x <- foreach(i=1:2, .combine=mbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.
What I would do and I already use in one of my packages, bigstatsr.
Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.
Example:
x.all <- array(dim=c(20,2,2))
no_cores <- 3
tmpfile <- tempfile()
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)
library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)
foreach(ic = 1:no_cores) %dopar% {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x <- array(dim = c(length(ind), 2, 2))
for (i in seq_along(ind)){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- ind[i]+j+k
}
}
}
saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)
for (ic in 1:no_cores) {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}
print(x.all)
Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.
Related
I have a large database and I wrote a code which executes the same calculations on that in a rolling manner by nesting it in a for loop. My problem is that the code runs pretty long. As I read, this is probably caused by R using a single-threaded method as default. As far as I know, foreach package would make it possible to speed up the execution by considerable time, however, I am unsure how to implement it. Currently, my code looks like this, in every iteration I subset a chunk of the large database and do various stuff with these subsets. At the end of an iteration, I collect the output in a time series. Is it possible to apply foreach in this situation?
(k in seq(1,5284, 21)) {
fdata <- data[k:(k+251),]
tdata <- data[(k+252):(k+377),]
}
Thanks!
This is certainly doable using foreach. Depending on your OS you would first have to load a suitable backend (e.g. SNOW on a windows machine) and then set up a cluster.
Example:
library(foreach)
library(doSNOW)
# set number of cores/CPUs to be used
(n_cores <- parallel::detectCores() - 1)
# some example data
dat <- matrix(1:1e3, ncol = 10)
# a set you iterate over
k <- 1:99
# run stuff in parallel
cl <- makeCluster(n_cores)
registerDoSNOW(cl)
result <- foreach(k) %dopar% {
fdata <- dat[k:(k+1), ]
# do computationally expensive stuff with `fdata`
# ... and return something
cumsum(fdata[1,] + fdata[2,])
}
stopCluster(cl)
By default result will be a list of the results. There are, however, ways to combine into an array or similar. Look at details on the .combine argument in ?foreach.
I'm using the R package QCA (https://cran.r-project.org/web/packages/QCA/index.html) for Qualitative Comparative Analysis. I want to be able to try out many different combinations, which is taking a very long time. On my faster CPU, trying all the options that I am interested in takes a little over 24 hours. R seems to be using only one of the cores available on my CPU and requires relatively little memory (just under 100MB). I am hoping someone has a good idea on how to speed up this process, perhaps through parallelization?
Here's what I'm doing:
Loading my data set (data), which is a CSV file with the outcome condition and all the options for my causal conditions. The causal conditions are in 4 groups A, B, C, and D. There are approx. 200 observations, i.e., rows in the data set.
Starting a log file with sink()
Creating a series of nested loops to generate each combination of causal conditions I want to examine.
Running the minimize() function within the nested loops. Specifically this looks like this:
for (a in causal_condition_group_A) {
for (b in causal_condition_group_B) {
for (c in causal_condition_group_C) {
for (d in causal_condition_group_D) {
minimize(data, outcome = my_outcome, conditions = paste0(a, ", ", b, ", ", c, ", ", d), ...)
}
}
}
}
The minimize function's conditions argument essentially takes a character vector as input and this is all my nested loops are creating. For example, a random conditions argument might read:
conditions = "causal_condition_A_87, causal_condition_B_2, causal_condition_C_42, causal_condition_D_219"
I tried several different things in an attempt to parallelize this approach, but so far I have not been successful. I tried experimenting with both parSapply and foreach %dopar%, but I am running into various problems. I either can't get the actual parallelization process to work properly or - and this is in some of my toy experiments - I am having trouble logging all the output, which is essential.
Please let me know if I can provide additional information to help clear things up! Thanks for your help!
EDIT:
I was able to create a working foreach() loop based on #HenrikB's advice, but I'm running into a different problem now.
Here's my test solution so far. It includes one less nested loop than I want in the final solution, but that's not important for now:
# SET QCA OUTCOME CONDITION
outcome = "c_outcome"
# LOAD LIBRARIES
library(doParallel)
library(QCA)
# CREATE CLUSTER FOR PARALLELIZATION
cores <- detectCores()
cl <- makeCluster(cores[1]-1, type = "PSOCK", outfile="")
registerDoParallel(cl)
# LOAD AND SET UP DATA
outcomecond <- read.csv("outcomes.csv", header=TRUE, row.names="ID")
causalcond <- read.csv("causal_conditions.csv", header=TRUE, row.names="ID")
data <- cbind(outcomecond[outcome], causalcond)
temp <- data[!is.na(data[outcome]), ] #keep only rows where outcome is not NA
# EXPORT CURRENT DATASET TO CLUSTERS
clusterExport(cl, "temp")
# CREATE CAUSAL CONDITION LISTS
causal_condition_group_A <- colnames(causalcond[, 1:99])
causal_condition_group_B <- colnames(causalcond[, 100:141])
causal_condition_group_C <- colnames(causalcond[, 142:183])
# EXPORT LIBRARIES TO CLUSTERS
clusterEvalQ(cl, library(doParallel))
clusterEvalQ(cl, library(QCA))
# START TIMER
start.time <- Sys.time()
# THREE NESTED FOREACH LOOPS (ONE FOR EACH CAUSAL CONDITION GROUP)
x <-
foreach(c=causal_condition_group_C, .combine='cbind') %:%
foreach(b=causal_condition_group_B, .combine='cbind') %:%
foreach(a=causal_condition_group_A, .combine='cbind') %dopar% {
tryCatch({
minimize(temp, outcome = outcome, conditions = paste0(a,",",b,",",c), n.cut = 1, incl.cut = 0.400, include = "?", details = TRUE, use.letters = TRUE)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# END TIMER
end.time <- Sys.time()
# PROGRAM RUNNING TIME:
print(end.time - start.time)
# END CLUSTER
stopCluster(cl)
When I run this as a sequential %do% loop, I need something like 3 GB of memory to store x. However, when I try to run this sequentially, the memory rises exponentially. Here's a screenshot from shortly before I gave up:
Screenshot of task manager
Does someone know why %dopar% is using so much more memory and is there a way to avoid this?
Could I, for example, be writing x to file once in a while and purge memory after I do this? x is a list that is of dimension 14 by number of iterations in the foreach() loops. Here is what one of the "columns" in x looks like:
result.98
tt List,11
options List,10
negatives Numeric,3
initials "~A*B"
PIchart TRUE
primes Integer,2
solution List,1
essential "~A*B"
inputcases "205,245,253,306,425,468,490,511,514,585,587,657,684,696,739,740,784,796"
pims List,1
IC List,4
numbers Numeric,4
SA List,1
call Expression
I have the following R "apply" statement:
for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
matrix_of_sums[,i]<-
apply(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}
So, I have the following data structures:
simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).
dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.
matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.
So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.
Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?
Thanks!
With base R parallel, try
library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
You could also try foreach %dopar%
library(doParallel) # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
rowSums(simulation_results[,colnames(simulation_results) %in%
dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)
I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.
without really having any applicable or sample data to go off of... the process would look like this:
Create a holding matrix(matrix_of_sums)
loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
find matching indices within the simulation model(simulation_results)
bind the rowSums into the holding matrix(matrix of sums)
I recreated a sample set which is meaningless and produces identical results but should work for your data
# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
parallel::mclapply(1:nrow(ts_df), function(i){
# Store the row to its own variable for ease
d <- ts_df[i,]
# cbind the results using the global assignment operator `<<-`
msums <<- cbind(
msums,
rowSums(
sim_df[,which(colnames(sim_df) %in% colnames(d))]
))
}, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)
I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.
Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1
Is there a problem when accessing/writing to global variable in using doSNOW package on multiple cores?
In the below program, each of the MyCalculations(ii) writes to the ii-th column of the matrix "globalVariable"...
Do you think the result will be correct? Will there be hidden catches?
Thanks a lot!
p.s. I have to write out to the global variable because this is a simplied example, in fact I have lots of outputs that need to be transported from within the parallel loops... therefore, probably the only way is to write out to global variables...
library(doSNOW)
MaxSearchSpace=44*5
globalVariable=matrix(0, 10000, MaxSearchSpace)
cl<-makeCluster(7)
registerDoSNOW(cl)
foreach (ii = 2:nMaxSearchSpace, .combine=cbind, .verbose=F) %dopar%
{
MyCalculations(ii)
}
stopCluster(cl)
p.s. I am asking - within the DoSnow framework, is there any danger of accessing/writing global variables... thx
Since this question is a couple months old, I hope you've found an answer by now. However, in case you're still interested in feedback, here's something to consider:
When using foreach with a parallel backend, you won't be able to assign to variables in R's global environment in the way you're attempting (you probably noticed this). Using a sequential backend, assignment will work, but not using a parallel one like with doSNOW.
Instead, save all the results of your calculations for each iteration in a list and return this to an object, so that you can extract the appropriate results after all calculations have been completed.
My suggestion starts similarly to your example:
library(doSNOW)
MaxSearchSpace <- 44*5
cl <- makeCluster(parallel::detectCores())
# do not create the globalVariable object
registerDoSNOW(cl)
# Save the results of the `foreach` iterations as
# lists of lists in an object (`theRes`)
theRes <- foreach (ii = 2:MaxSearchSpace, .verbose=F) %dopar%
{
# do some calculations
theNorms <- rnorm(10000)
thePois <- rpois(10000, 2)
# store the results in a list
list(theNorms, thePois)
}
After all iterations have been completed, extract the results from theRes and store them as objects (e.g., globalVariable, globalVariable2, etc.)
globalVariable1 <- do.call(cbind, lapply(theRes, "[[", 1))
globalVariable2 <- do.call(cbind, lapply(theRes, "[[", 2))
With this in mind, if you are performing calculations with each iteration that are dependent on the results of calculations from previous iterations, then this type of parallel computing is not the approach to take.