Related
I have found a feature/bug in the foreach package, which I do not understand. Perhaps someone can explain me this behaviour:
I created a for-loop with the foreach package (I use them together with mutlicore calculations, but here just in a sequentiell example, the bug appears in both variants). This loop runs r times. In every run a list with c entries is returned. So I expect a list with r entries, and every entry consists of c lists.
My code was the following one:
library(foreach)
clusters <- 10
runs <- 100
temp <- foreach(r = 1:runs,
.combine = 'list',
.multicombine = TRUE) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
With this code, all works as expected, see the following picture:
But when increasing runs <- 101, the output temp is this:
The expected list structure is destroyed. But when commenting out the line .combine = 'list' all works as expected.
library(foreach)
clusters <- 10
runs <- 100
temp <- foreach(r = 1:runs,
.multicombine = TRUE) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
Can someone explain this behaviour?
Thanks for any help!
Meanwhile I have found a solution.
The foreach function knows that some comine-functions (e.g. c or cbind) take many arguments, and will call them with up to 100 arguments (by default) in order to improve performance. With the argument .maxcombine you can set them manually.
library(foreach)
clusters <- 10
runs <- 101
temp <- foreach(r = 1:runs,
.combine = 'list',
.maxcombine = runs,
.multicombine = T) %do% {
signal_all <- lapply(1:clusters, function(x){
return(1)
})
return(signal_all)
} ## end do
few days ago I ask this topic about calling a custom made function within a loop that was well resolved by a combination of
eval(parse(text = Function text))
here is the link: Automatic creation and use of custom made function in R.
This allowed me to work with for loop and call automatically the function I need from a Data frame storing the body of the function to create.
Now I would like to bring the question to a next level. My problem is computation time. I need to evaluate something like 52 indices from a hyperspectrial image. this means that in R my hyperspectral image is loaded as a 3d array of 512x512x204 bands.
what I would like to do is run the evaluation of the indices in parallel to reduce the computation time.
here a dummy example to what I would like to emulate, but not in parallel computing.
# create a fake matrix rappresenting my Hyperpectral image
HYPR_IMG=array(NA,dim=c(5,3,4))
HYPR_IMG[,,1]=1
HYPR_IMG[,,2]=2
HYPR_IMG[,,3]=3
HYPR_IMG[,,4]=4
image.plot(HYPR_IMG[,,1], zlim=c(0,20))
image.plot(HYPR_IMG[,,2], zlim=c(0,20))
image.plot(HYPR_IMG[,,3], zlim=c(0,20))
image.plot(HYPR_IMG[,,4], zlim=c(0,20))
#create a fake DF for simulating my indices stored in the dataframe
IDXname=c("IDX1","IDX2","IDX3","IDX4")
IDXFunc=c("HYPR_IMG[,,1] + 3*HYPR_IMG[,,2]",
"HYPR_IMG[,,3] + HYPR_IMG[,,2]",
"HYPR_IMG[,,4] + HYPR_IMG[,,2] - HYPR_IMG[,,3]",
"HYPR_IMG[,,1] + HYPR_IMG[,,4] + 4*HYPR_IMG[,,2] + HYPR_IMG[,,3]")
IDX_DF=as.data.frame(cbind(IDXname,IDXFunc))
# that was what I did before
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID=IDX_DF$IDXname[i]
IDX_Fun_tmp=IDX_DF$IDXFunc[which(IDX_DF$IDXname==IDX_ID)] #use for extra care to select the right fuction
IDXFunc_call=paste("IDXfun_tmp=function(HYPR_IMG){",IDX_Fun_tmp,"}",sep="")
eval(parse(text = IDXFunc_call))
IDX_VAL=IDXfun_tmp (HYPR_IMG)
image.plot(IDX_VAL,zlim=c(0,20)); title(main=IDX_ID)
temp_DF=as.vector(IDX_VAL)
Store_DF=cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
my final goal is to have the very same Store_DF ,storing all the Indices value. Here I have a for loop but using a foreach loop things should speed up. if needed I am working with windows 8 or more as OS.
Is it really possible ?
Will I be able at the end, to reduce the overall computational time having the same Store_DF dataframe or somthing simlar like a matrix with the columns names?
Thanks a lot!!!
For the specific example using either the build in parallelization of a package like data.table or a parallel apply might be more beneficial.
Below is a minimal example of how to achieve the results using a parApply from the parallel package. Note the output is a matrix, which actually yields slightly better performance in base R (not the case necessarily in tidyverse or data.table). In case the data.frame structure is vital you will have to convert it with data.frame.
cl <- parallel::makeCluster( parallel::detectCores() )
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
IDXname
parallel::stopCluster(cl)
Please note the stopCluster(cl) which is important to shut down any loose R sessions.
Benchmark results (4 tiny cores):
Unit: milliseconds
expr min lq mean median uq max neval
Loop 8.420432 9.027583 10.426565 9.272444 9.943783 26.58623 100
Parallel 1.382324 1.491634 2.038024 1.554690 1.907728 18.23942 100
For replications of benchmarks the code has been provided below:
cl <- parallel::makeCluster( parallel::detectCores() )
microbenchmark::microbenchmark(
Loop = {
Store_DF=data.frame(NA)
for (i in 1: length(IDX_DF$IDXname)) {
IDX_ID = IDX_DF$IDXname[i]
IDX_Fun_tmp = IDX_DF$IDXFunc[which(IDX_DF$IDXname == IDX_ID)] #use for extra care to select the right function
eval(parse(text = paste0("IDXfun_tmp = function(HYPR_IMG){", IDX_Fun_tmp, "}")))
IDX_VAL = IDXfun_tmp(HYPR_IMG)
#Plotting in parallel is not a good idea. It will most often not work but might make the R session crash or slow down significantly (at best the latter at worst the prior)
#image.plot(IDX_VAL, zlim = c(0,20)); title(main = IDX_ID)
temp_DF = as.vector(IDX_VAL)
Store_DF = cbind(Store_DF,temp_DF)
names(Store_DF)[i+1] <- as.vector(IDX_ID)
}
rm(Store_DF)
},
Parallel = {
result <- parallel::parApply(cl = cl, X = IDX_DF, MARGIN = 1, FUN = function(x, IMAGES){
IDX_ID <- x[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", x[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(IMAGES))
names(IDX_VAL) <- IDX_ID
IDX_VAL
}, IMAGES = HYPR_IMG)
colnames(result) = IDXname
rm(result)
}
)
parallel::stopCluster(cl)
Edit: Using the foreach package
After a few comments about performance issues (maybe due to memory), i decided to make an illustration of how one could obtain the same result using the foreach package. A few notes:
The foreach package uses iterators. As standard it can be used like a for loop, where it will iterate over each column in a data.frame
Like other parallel implementations in R, if you are on Windows, often you will have to export the data used for calculations. It can sometimes be avoided with some fiddling and foreach sometimes will let you not export data. When this is, is unclear from the documentation.
The output of the foreach will be combined either as a list or as defined by the .combine argument, which can be rbind, cbind or any other function.
There is a lot of comments, making the code seem alot longer than it actually it. Removing comments and blank lines, it is 9 lines longer.
Below is the code which will yield the same output as above. Note i have used the data.table package. For more information about this package i suggest their wikipedia on github.
cl <- parallel::makeCluster( parallel::detectCores() )
#Foeach uses doParallel for the parallization
doParallel::registerDoParallel(cl)
#To iterate over the rows, we need to use iterators
# if foreach is given a matrix it will be converted to a column iterators
rowIterator <- iterators::iter(IDX_DF, by = "row")
library(foreach)
result <-
foreach(
#Supply the iterator
row = rowIterator,
#Specify if the calculations needs to be in order. If not then we can get better performance not doing so
.inorder = FALSE,
#In most foreach loops you will have to export the data you need for the calculations
# it worked without doing so for me, in which case it is faster if the exported stuff is large
#.export = c("HYPR_IMG"),
#We need to say how the output should be merged. If nothing is given it will be output as a list
#data.table rbindlist is faster than rbind (returns a data.table)
.combine = function(...)data.table::rbindlist(list(...)) ,
#otherwise we could've used:
#.combine = rbind
#if we dont use rbind or cbind (i used data.table::rbindlist for speed)
# we will have to tell if it can take more than 1 argument
.multicombine = TRUE
) %dopar% #Specify how to do the calculations. %do% loop. %dopar% parallel loop. %:% nested loops (next foreach tells how we do the loop)
{ #to do stuff in parallel we use the %dopar%. Alternative %do%. For multiple foreach we split each of them by %:%
IDX_ID <- row[["IDXname"]]
eval(parse(text = paste0("IDXfun_tmp <- function(HYPR_IMG){", row[["IDXFunc"]], "}")))
IDX_VAL <- as.vector(IDXfun_tmp(HYPR_IMG))
data.frame(ID = IDX_ID, IDX_VAL)
}
#output is saved in result
result
result_reformatted <- dcast(result[,indx := 1:.N, by = ID],
indx~ID,
value.var = "IDX_VAL")
#if we dont want to use data.table we could use unstack instead
unstack(test, IDX_VAL ~ ID)
I have a generic chunking function that breaks big calls into smaller pieces and runs them in parallel.
chunk_it <- function(d, n, some_fun) {
# run n chunks of d in parallel
dat <- foreach(...) %doPar% {
some_fun(...)
}
}
I want to make it so that this generic chunking function can identify if it's being called by a process that's already in parallel (chunked in my terminology)
chunked_highlevel <- function(d, n, some_fun) {
# run n chunks of d in parallel
...
chunk_it(lowerlevel_d, n) # do not chunk!
}
What I would like to happen here is that if I have chunked the process at a higher level, that it does not activate the chunking function at the lower level.
Is there a way to identify when you're already inside a parallel process?
So, that we could code like this:
chunk_it <- function(d, n, some_fun) {
# run n chunks of d in parallel
if(!already_parallel) {
dat <- foreach(...) %doPar% {
some_fun(...)
}
} else {
dat <- some_fun()
}
}
I don't think there's an official way of doing this. However, in general there should be code evident in the call stack which makes it obvious whether you're in parallel code. What I've got so far looks like this. It seems to work for doSNOW with either MPI or SOCK, but will probably need adjustment for other packages that implement %dopar%. It's also dependent on some internal details of snow which may be subject to change in future versions.
library(doSNOW)
library(foreach)
my_fn <- function(bit) {
is_parallel <- any(unlist(lapply(sys.calls(), function(cal) {
as.character(cal[[1]]) %in% c("slaveLoop", "%dopar%")
})))
is_parallel
}
foreach(x = 1:2) %do% my_fn(x)
# [[1]]
# [1] FALSE
#
# [[2]]
# [1] FALSE
cl <- makeCluster(2)
registerDoSNOW()
foreach(x = 1:2) %dopar% my_fn(x)
# [[1]]
# [1] TRUE
#
# [[2]]
# [1] TRUE
The future package (I'm the author) has built in support for nested parallelism so that you do not have to worry about it as a developer while still giving the end user full power to control how and where parallelization is taking place.
Here's an example from one of the future vignettes:
library("future")
library("listenv")
x <- listenv()
for (ii in 1:3) {
x[[ii]] %<-% {
y <- listenv()
for (jj in 1:3) {
y[[jj]] %<-% { ii + jj/10 }
}
y
}
}
unlist(x)
## [1] 1.1 1.2 1.3 2.1 2.2 2.3 3.1 3.2 3.3
Note how there are two-layers of future assignments (%<-%). The default is to always process them sequentially unless specificiation says otherwise. For instance, to process the outer loop of future assignments in parallel on your local machine, use:
plan(multiprocess)
This will cause x[[ii]] %<-% { ... } for ii = 1, 2, 3 to run in parallel, while the contained y[[jj]] %<-% { ... } will run sequentially. The equivalent fully explicit setting for this is:
plan(list(multiprocess, sequential))
Now, if you want to run the outer loop of futures (x[[ii]]) sequentially and the inner loop of futures (y[[jj]]) in parallel, you can specify:
plan(list(sequential, multiprocess))
before running the code.
BTW, the number of parallel processes used with multiprocess is future::availableCores(). Think of it as parallel::detectCores() but that is also agile to mc.cores, HPC cluster environments etc. Importantly, future::availableCores() will return 1 if it's already running in parallel ("is a parallel child"). This means that if you do:
plan(list(multiprocess, multiprocess))
the inner layer of futures will actually only see a single core. You can think of this as a built-in automatic protection from creating a huge number of parallel processes by mistake through recursive parallelism.
You can force a different setting though (but not recommended). For instance, say you want the outer layer to run four parallel tasks at the same time, and each of those tasks to run two parallel tasks at the same time (on your local machine), then you can use:
plan(list(
tweak(multiprocess, workers = 4L),
tweak(multiprocess, workers = 2L)
))
That will run at most 4*2 = 8 parallel tasks at the same time (plus the master process).
If you have a set of machines available, you can do:
plan(list(
tweak(cliuster, workers = c("machine1", "machine2", "machine3")),
multiprocess
))
that will distribute the outer layer of futures (x[[ii]]) to those three machines, and the inner layer of futures (y[[ii]]) will run in parallel using all the available cores on those machines.
Note how the code doesn't change - only the settings (= plan() call). This is in the spirit of "write once, run wherever". There are many different future-strategy setups you can use; see the vignettes of the future package.
Now, what if you wanna use foreach()? You can use the doFuture %dopar% adapter that works on top of the future framework. For example,
library("doFuture")
registerDoFuture()
some_fun <- function(j) {
list(j = j, pid.j = Sys.getpid())
}
my_fun <- function(i) {
y <- foreach(j = 1:3) %dopar% { some_fun(j = j) }
list(i = i, pid.i = Sys.getpid(), y = y)
}
x <- foreach(i = 1:3) %dopar% { my_fun(i = i) }
Run the above and look at str(x) and its different PIDs for the different plan():s exemplified above. That'll illustrate what's going on.
Hope this helps
Problem Description:
I have a big matrix c, loaded in RAM memory. My goal is through parallel processing to have read only access to it. However when I create the connections either I use doSNOW, doMPI, big.matrix, etc the amount to ram used increases dramatically.
Is there a way to properly create a shared memory, where all the processes may read from, without creating a local copy of all the data?
Example:
libs<-function(libraries){# Installs missing libraries and then load them
for (lib in libraries){
if( !is.element(lib, .packages(all.available = TRUE)) ) {
install.packages(lib)
}
library(lib,character.only = TRUE)
}
}
libra<-list("foreach","parallel","doSNOW","bigmemory")
libs(libra)
#create a matrix of size 1GB aproximatelly
c<-matrix(runif(10000^2),10000,10000)
#convert it to bigmatrix
x<-as.big.matrix(c)
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
out<-foreach(linID = 1:10, .combine=c) %dopar% {
#load bigmemory
require(bigmemory)
# attach the matrix via shared memory??
m <- attach.big.matrix(mdesc)
#dummy expression to test data aquisition
c<-m[1,1]
}
closeAllConnections()
RAM:
in the image above, you may find that the memory increases a lot until foreach ends and it is freed.
I think the solution to the problem can be seen from the post of Steve Weston, the author of the foreach package, here. There he states:
The doParallel package will auto-export variables to the workers that are referenced in the foreach loop.
So I think the problem is that in your code your big matrix c is referenced in the assignment c<-m[1,1]. Just try xyz <- m[1,1] instead and see what happens.
Here is an example with a file-backed big.matrix:
#create a matrix of size 1GB aproximatelly
n <- 10000
m <- 10000
c <- matrix(runif(n*m),n,m)
#convert it to bigmatrix
x <- as.big.matrix(x = c, type = "double",
separated = FALSE,
backingfile = "example.bin",
descriptorfile = "example.desc")
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
## 1) No referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
## 2) Referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
invisible(c) ## c is referenced and thus exported to workers
t <- attach.big.matrix("example.desc")
for (i in seq_len(30L)) {
for (j in seq_len(m)) {
y <- t[i,j]
}
}
return(0L)
}
closeAllConnections()
Alternatively, if you are on Linux/Mac and you want a CoW shared memory, use forks. First load all your data into the main thread, and then launch working threads (forks) with general function mcparallel from the parallel package.
You can collect their results with mccollect or with the use of truly shared memory using the Rdsm library, like this:
library(parallel)
library(bigmemory) #for shared variables
shared<-bigmemory::big.matrix(nrow = size, ncol = 1, type = 'double')
shared[1]<-1 #Init shared memory with some number
job<-mcparallel({shared[1]<-23}) #...change it in another forked thread
shared[1,1] #...and confirm that it gets changed
# [1] 23
You can confirm, that the value really gets updated in backgruound, if you delay the write:
fn<-function()
{
Sys.sleep(1) #One second delay
shared[1]<-11
}
job<-mcparallel(fn())
shared[1] #Execute immediately after last command
# [1] 23
aaa[1,1] #Execute after one second
# [1] 11
mccollect() #To destroy all forked processes (and possibly collect their output)
To control for concurency and avoid race conditions use locks:
library(synchronicity) #for locks
m<-boost.mutex() #Lets create a mutex "m"
bad.incr<-function() #This function doesn't protect the shared resource with locks:
{
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
}
good.incr<-function()
{
lock(m)
a<-shared[1]
Sys.sleep(1)
shared[1]<-a+1
unlock(m)
}
shared[1]<-1
for (i in 1:5) job<-mcparallel(bad.incr())
shared[1] #You can verify, that the value didn't get increased 5 times due to race conditions
mccollect() #To clear all threads, not to get the values
shared[1]<-1
for (i in 1:5) job<-mcparallel(good.incr())
shared[1] #As expected, eventualy after 5 seconds of waiting you get the 6
#[1] 6
mccollect()
Edit:
I simplified dependencies a bit by exchanging Rdsm::mgrmakevar into bigmemory::big.matrix. mgrmakevar internally calls big.matrix anyway, and we don't need anything more.
I'm trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data for many different groups and I'm trying to move from a for loop to a foreach loop using the doParallel package to take advantage of the multicore processor installed.
The problem I am experiencing relates to memory and how the new R processes seem to consume large quantities of it. I think that what is happening is that the large data.table containing ALL data is copied into each new process, hence I run out of RAM and Windows starts swapping to disk.
I've created a simplified reproducible example which replicates my problem, but with less data and less analysis inside the loop. It would be ideal if a solution existed which could only farm out the data to the worker processes on demand, or sharing the memory already used between cores. Alternatively some kind of solution may already exist to split the big data into 4 chunks and pass these to the cores so they have a subset to work with.
A similar question has previously been posted here on Stackoverflow however I cannot make use of the bigmemory solution offered as my data contains a character field. I will look further into the iterators package, however I'd appreciate any suggestions from members with experience of this problem in practice.
rm(list=ls())
library(data.table)
num.series = 40 # can customise the size of the problem (x10 eats my RAM)
num.periods = 200 # can customise the size of the problem (x10 eats my RAM)
dt.all = data.table(
grp = rep(1:num.series,each=num.periods),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
library(doParallel)
registerDoParallel(4)
foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind") %dopar%
{
dt.sub = dt.all[grp == grp]
f_lm(dt.sub, grp)
}
detach(package:doParallel)
Iterators can help to reduce the amount of memory that needs to be passed to the workers of a parallel program. Since you're using the data.table package, it's a good idea to use iterators and combine functions that are optimized for data.table objects. For example, here is a function like isplit that works on data.table objects:
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
Note that it isn't completely compatible with isplit, since the arguments and return value are slightly different. There may also be a better way to subset the data.table, but I think this is more efficient than using isplit.
Here is your example using isplitDT and a combine function that uses rbindlist which combines data.tables faster than rbind:
dtcomb <- function(...) {
rbindlist(list(...))
}
results <-
foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
Update
I wrote a new iterator function called isplitDT2 which performs much better than isplitDT but requires that the data.table have a key:
isplitDT2 <- function(x, vals) {
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=x[val], key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
This is called as:
setkey(dt.all, grp)
results <-
foreach(dt.sub=isplitDT2(dt.all, levels(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This uses a binary search to subset dt.all rather than a vector scan, and so is more efficient. I don't know why isplitDT would use more memory, however. Since you're using doParallel, which doesn't call the iterator on-the-fly as it sends out tasks, you might want to experiment with splitting dt.all and then removing it to reduce your memory usage:
dt.split <- as.list(isplitDT2(dt.all, levels(dt.all$grp)))
rm(dt.all)
gc()
results <-
foreach(dt.sub=dt.split,
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This may help by reducing the amount of memory needed by the master process during the execution of the foreach loop, while still only sending the required data to the workers. If you still have memory problems, you could also try using doMPI or doRedis, both of which get iterator values as needed, rather than all at once, making them more memory efficient.
The answer requires the iterators package and use of isplit which is similar to split in that it breaks the main data object into chunks based on one or more factor columns. The foreach loop iterates through the chunks of data, passing only the subset out to the worker process rather than the whole table.
So the differences in the code are as follows:
library(iterators)
dt.all = data.table(
grp = factor(rep(1:num.series, each =num.periods)), # grp column is a factor
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
results =
foreach(dt.sub = isplit(dt.all, dt.all$grp), .packages="data.table", .combine="rbind")
%dopar%
{
f_lm(dt.sub$value, dt.sub$key[[1]])
}
The result of the isplit is that dt.sub is now a list with 2 elements: the key is in itself a list of the values used to split and the value contains the subset as a data.table.
Credit for this solution is given to a SO answer given by David and a response by Russell to my question on an excellent blog post about iterators.
------------------------------------ EDIT ------------------------------------
To test the performance of isplitDT v isplit and rbindlist v rbind the following code was used:
rm(list=ls())
library(data.table) ; library(iterators) ; library(doParallel)
num.series = 400
num.periods = 2000
dt.all = data.table(
grp = factor(rep(1:num.series,each=num.periods)),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
registerDoParallel(8)
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
dtcomb <- function(...) {
rbindlist(list(...))
}
# isplit/rbind
st1 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine="rbind",
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplit/rbindlist
st2 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine='dtcomb', .multicombine=TRUE,
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplitDT/rbind
st3 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
# isplitDT/rbindlist
st4 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
rbind(st1, st2, st3, st4)
This gives the following timings:
user.self sys.self elapsed user.child sys.child
st1 12.08 1.53 14.66 NA NA
st2 12.05 1.41 14.08 NA NA
st3 45.33 2.40 48.14 NA NA
st4 45.00 3.30 48.70 NA NA
------------------------------------ EDIT 2 ------------------------------------
Thanks to Steve's updated answer and the function isplitDT2, which makes use of the keys on the data.table, we have a clear new winner in terms of speed. Running microbenchmark to compare my original solution (in this answer) shows around 7-fold improvement from isplitDT2 with rbindlist. Memory usage has not yet been compared directly but the performance gain leads me to accept the answer at last.
Holding everything in memory is one of those (aargh, annoying) things that R programmers have to learn to deal with. It's pretty easy to imagine your code example as either memory-bound or CPU-bound, and you'll need to figure that out before trying to apply workarounds.
Assuming the memory is being consumed by your dataset (dt_all) and not during the actual model run, it is possible you might be able to release enough memory for the worker processes to parallelize:
foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind") %dopar%
{
dt.sub = dt.all[grp == grp]
rm(dt.all)
gc()
f_lm(dt.sub, grp)
}
However, this assumes that your working set (dt.sub) is small enough that you can fit more than one of them in memory at a time. It isn't hard to imagine a problem set too large for that. Also, and this is really annoying, all the workers are going to fire up at one time and kill your machine anyway, so you might need to make them pause for a couple seconds to allow other children to load up and release memory.
Though desperately stupid and brute-force, I have handled this exact problem by writing the subsets out to disk as individual data files, and then used a batch script to run my computations in parallel.