I am working on data in R with 18M records. My computer does not have a wealth of RAM available, so I am trying the "ff" package to compensate. To make the amount of time reasonable, I am also using the "foreach" package and running the job in parallel. I am having issues when I run "foreach" in parallel with the full data; smaller groups of the data (say first 100K rows) run correctly.
What I am trying to obtain is rolling daily averages for peoples' values based on dates. I want the average daily value for past 7, 28, 91, etc. days. I am relatively new to R, so I do not understand its nuances. When I run this on the full data it stops after an hour and gives the error:
Task 1 failed - object 'PersonID' not found
What can I do to appropriately use the "ff" package with the "foreach" package. Also, it would be great if there were some way to output the data in a ff data frame and then into SQL. The code is below:
library("ff")
library("ffbase")
library("RODBC")
myconn <- odbcConnect("NO SHOW")
data <- as.ffdf(sqlFetch(myconn, "NO SHOW"))
#data[data=="NULL"] <- NA
#persons <- unique(data$PersonID, incomparables=FALSE)
persons <- aggregate(Value ~ PersonID, data=data, FUN=length)$PersonID
rollingLength <- 7
rollingTimes <- c(7,28,91,182,364,728,100000000)
valueCol <- 6
sinceCol <- 4
func <- function(stuff,id) {
check <- subset(stuff, PersonID == id)
tempvalue <- data.frame(matrix(,nrow=nrow(check),ncol=7,byrow=TRUE))
colnames(tempvalue) <- c("value7","value28","value91","value182","value364","value728","valueLTD")
tempvalue[1,] <- c(NA,NA,NA,NA,NA,NA,NA)
rollingTrips <- c(1,1,1,1,1,1,1)
rollingSinceLast <- c(0,0,0,0,0,0,0)
startIndex <- c(1,1,1,1,1,1,1)
rollingvalues <- c(0,0,0,0,0,0,0)
rollingvalues[1:rollingLength] <- check[1,valueCol]
if (nrow(check) > 1) {
for (r in 2:nrow(check)) {
tempvalue[r,] <- rollingvalues / rollingTrips
rollingvalues <- rollingvalues + check[r,valueCol]
rollingTrips <- rollingTrips + 1
rollingSinceLast <- rollingSinceLast + ifelse(is.na(check[r,sinceCol]), 0, check[r,sinceCol])
for (c in 1:(rollingLength-1)) {
while (rollingSinceLast[c] >= rollingTimes[c]) {
rollingvalues[c] <- rollingvalues[c] - check[startIndex[c],valueCol]
rollingTrips[c] <- rollingTrips[c] - 1
rollingSinceLast[c] <- rollingSinceLast[c] - check[startIndex[c]+1,sinceCol]
startIndex[c] <- startIndex[c] + 1
}
}
}
}
return (cbind(check, tempvalue))
}
library(foreach)
library(doParallel)
cl<-makeCluster(12)
registerDoParallel(cl)
strt<-Sys.time()
outdata <- foreach(id=persons, .combine="rbind", .packages="ff") %dopar% func(data,id)
print(Sys.time()-strt)
stopCluster(cl)
sqlSave(myconn, outdata)
odbcClose(myconn)
foreach package's %dopar% command need boundaries of a key value.
You can simply split your personID. Also, you sholud set the partition value less than makeCluster(). If you don't do that, you got file.access(filename, 0) == 0 is not TRUE massage. Because, you can not access to pre-saved ff package file on the same cluster.
split personID example:
split_min<-min(persons$personID)
split_max<-max(persons$personID)
partition<-12 # "partition < cluster" is good.
quart_half<-floor((split_max-split_min)/partition)
split_num<-matrix(0,partition,2)
split_num[1,1]<-split_min
split_num[1,2]<-quart_half+split_min
if(partition>=3){
for(i in 2:(partition-1)){
split_num[i,1]<-split_num[i-1,2]+1
split_num[i,2]<-split_num[i-1,2]+quart_half
}}
split_num[partition,1]<-split_num[partition-1,2]+1
split_num[partition,2]<-split_max
And, change foreach statement.
outdata <- foreach(i=1:partition, .combine="rbind", .packages="ff") %dopar% {
IDs<-subset(persons,personID>=split_num[i,1] & personID<=split_num[i,1])$personID
for(z in IDs){
func(data,z)}
}
or,
outdata <- foreach(i=1:partition, .combine="rbind") %dopar% {
require(ff) #or require(ffbase)
IDs<-subset(persons,personID>=split_num[i,1] & personID<=split_num[i,1])$personID
for(z in IDs){
func(data,z)}
}
Good luck to you.
Related
I'm currently trying to write an R script to import a variety of files I've created related to a dataset. This involves reading a lot of .txt files using several nested for loops based on how I've organized the directories and names of the files.
I can run the inner most loop fine (albiet a little slow). However, trying to run the second loop or any further loops creates the following error:
Error: vector memory exhausted (limit reached?)
I believe this may be related to how R handles memory? I'm running R out of Rstuidio. I've also tried the solution posted here with no luck
'R
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Code Below
subjects <- 72
loop1_names <- as.character(list('a','b','c'))
loop2_names <- as.character(list('one','two','three'))
loop3_names <- as.character(list('N1','N2'))
loop4_names<- as.character(list('choice1','choice2','choice3'))
i<-1;j<-1;
loop3.subset<- data.frame
for(k in 1:length(loop3_names)){
loop4.subset<- data.frame()#Data frame for handling each set of loop 4 values
for(l in 1:length(loop4_names)){
#Code for extracting the variables for each measure
measures.path <- file.path(results_fldr, 'amp_measures',loop1_names[i],loop2_names[j],'mont',loop3_names[k])
measures.data <- read.table(file.path(measures.path, paste(paste(loop1_names[i],loop2_names[j],loop3_names[k],loop4_names[l],sep = '_'),'.txt',sep = '')), header = T, nrows = subjects)
#Get rid of the IDs, we'll add those back in later
col_idx_ID <- grep('ID', names(measures.data))
measures.data <- as.data.frame(measures.data[,-col_idx_ID])# make sure when trimming to keep the measures as a data frame
names(measures.data) <- c(paste(loop1_names[i],loop2_names[j],loop3_names[k],loop4_names[l],sep = '_'))#Add a label to the data
#Now combine this data with the other data in the loop4 subset data frame
if(l == 1){
loop4.subset <- measures.data
} else {
loop4.subset <- merge(erp.subset,measures.data)
}
}#End l/loop 4
if(k == 1){
loop3.subset <- loop4.subset
} else {
freq.subset <- merge(loop3.subset,loop4.subset)
}
}#End k/loop 3
Generally I would suggest you read in only part of the data to memory, then write the partially merge to disk. In the example below which of course I can't run because I don't have your files. I write to disk after each i, j loop and then after that is done have 9 files. Now you merge those 6 files in another loop. If you still have memory problems break this up into another 2 files by first doing the "j" merge and writing each to 3 "i" files. Then if you can't merge those files you have a fundamental problem with lack of memory on your machine.
subjects <- 72
loop1_names <- as.character(list('a','b','c'))
loop2_names <- as.character(list('one','two','three'))
loop3_names <- as.character(list('N1','N2'))
loop4_names<- as.character(list('choice1','choice2','choice3'))
for(i in 1:length(loop1_names)) {
for(j in 1:length(loop2_names)) {
loop3.subset<- data.frame
for(k in 1:length(loop3_names)){
loop4.subset<- data.frame()
for(l in 1:length(loop4_names)){
##Code for extracting the variables for each measure
measures.path <- file.path(results_fldr,
'amp_measures',
loop1_names[i],
loop2_names[j],
'mont',
loop3_names[k])
measures.data <- read.table(file.path(measures.path, paste(paste(loop1_names[i],
loop2_names[j],
loop3_names[k],
loop4_names[l],
sep = '_'),'.txt',sep = '')),
header = T, nrows = subjects)
##Get rid of the IDs, we'll add those back in later
col_idx_ID <- grep('ID', names(measures.data))
measures.data <- as.data.frame(measures.data[,-col_idx_ID])
names(measures.data) <- c(paste(loop1_names[i],
loop2_names[j],
loop3_names[k],
loop4_names[l],
sep = '_'))
## Now combine this data with the other data in the loop4 subset data frame
if(l == 1){
loop4.subset <- measures.data
} else {
loop4.subset <- merge(erp.subset,measures.data)
}
}#End l/loop 4
if(k == 1){
loop3.subset <- loop4.subset
} else {
freq.subset <- merge(loop3.subset,loop4.subset)
}
}#End k/loop 3
write.table(freq.subset, paste0(i, "_", j, ".txt"))
}
}
## Now you have 6 files to read in a merge.
## Something like this:
df <- NULL
for(i in 1:length(loop1_names)) {
for(j in 1:length(loop2_names)) {
df1 <- read.table(paste0(i, "_", j, ".txt"))
df <- merge(df, df1)
}
}
When I using the doParallel library, I encountered this weird error, the system throws this
" Error in { : task 1 failed -could not find function "%dopar%"
To be specific, this is what I did
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Read the data
coin95 <-read.csv('~/Documents/coin95.csv')
coin95 <- coin95[,!(names(coin95) %in% c("X"))]
coin95[c("Person")] <- sapply(coin95[c("Person")],as.character)
# create the name list
coin95_name <- as.character(coin95$Person)
coin95_name <- unique(coin95_name)
n <- as.numeric(length(coin95_name))
# the average counting process
ntw <- function(now){
foreach (Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <-subset(coin95, subset = coin95$Person == Ii)$duration
stepfun(time,seq(0,length(time)))(now)/n
}
}
# the average cumulative hazard
lambda <- function(now,params){
b <- params[1]
sigma <- params[2]
mu <- params[3]
xi <- params[4]
beta1 <- params[5]
beta2 <- params[6]
k <- function(spread){
L0 <- (1+(spread -mu)*xi/sigma)^(-1/xi)
return(L0)
}
foreach(Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <- subset(coin95, subset = coin95$Person == Ii)$duration
noncov <- subset(coin95, subset = coin95$Person == Ii)$noncovered
reim <- subset(coin95, subset = coin95$Person == Ii)$reimbursement
(b*now+sum( exp(-k(now-time[(time < now)])+beta1*noncov[(time < now)]+beta2*reim[(time <now)]) ))/n
}
}
So far, everything is GOOD, I have created two functions ntw and lambda using the foreach. They worked perfectly.
Then I create the third function also using the foreach:
# the distance
Time <- coin95$duration
Time <- sort(as.double(Time))
jl <- function(params){
res<-foreach(Ii = Time,.combine = "rbind",.export = c("ntw","lambda")) %dopar% {
(ntw(Ii)-ntw(Ii-1e-7)) * (ntw(Ii)- lambda(Ii,params))^2
}
return(sqrt(sum(res)))
}
guess<-c(0.0,1.3333,0.0,0.1,-1.2,3e-3)
Type jl(guess):
> jl(guess)
Show Traceback
Rerun with Debug
Error in { : task 1 failed -could not find function "%dopar%"
Any Idea what's going wrong ?
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
Above packages are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. I should mention that it will most likely help even though you are not using these packages in your code.
I have written a program to generate a very large amount of random multivariate distributed data (25 x 30 x 10 000 000) using mvtnorm, then do some simple calculations and manipulations on the matrices.
I am using the foreach and doParallel packages to run operations in parallel to reduce time. A completely arbitrary example, just to demonstrate the packages is:
foreach (x = matr) %dopar% {
x[time_horizon + 1] <- x[time_horizon]
x <- cbind(100,x)
for (m in 2:(time_horizon + 1)) {
# loop through each row of matrix to apply function
x[,m] <- x[,m-1] + x[,m]
}
return(x)
}
I have created an implicit cluster of cores to run these foreach functions on:
registerDoParallel(4)
The problem
When I run with multiple cores, it appears to multiply or duplicate the memory used when I monitor performance on Task Manager (i.e. 2 cores uses more memory than 1 core, 4 cores uses more memory than 2).
When I run my program for (25 x 30 x 1 000 000), running in parallel helps the speed of execution (i.e. 4 cores is faster than 1 core). However, when I run my program for (25 x 30 x 2 500 000) and above, too much memory is used and that appears to slow it down.
One friend said it could potentially be a page fault and the hard drive must be accessed when I run out of RAM.
Why is the duplication of memory across cores happening? Is it supposed to happen? Can I stop it? Are there other solutions?
Edit (Full Code):
library(mvtnorm)
library(foreach)
library(doParallel)
library(ggplot2)
library(reshape2)
library(plyr)
# Calculate the number of cores
no_cores <- detectCores()
# Create an implicit cluster and regular cluster
registerDoParallel(no_cores)
daily_pnl <- function() {
time_horizon <- 30
paths <- 2500000
asset <- 25
path_split <- 100
corr_mat <- diag(asset)
expected_returns <- runif(asset,0.0, 0.25)
# Create a list of vectors to store pnl information for each asset
foreach(icount(time_horizon), .packages = "mvtnorm") %dopar% {
average_matrix <- matrix(, (paths/path_split), asset)
split_start <- 1
my_day <- rmvnorm(paths, expected_returns, corr_mat, method="chol")
for (n in 1:(paths/path_split)) {
average_matrix[n,] <- colMeans(my_day[split_start:(split_start + path_split - 1),])
split_start <- split_start + path_split
}
return(average_matrix)
}
}
matrix_splitter <- function(matr) {
time_horizon <- 30
paths <- 2500000
path_split <- 100
asset <- 25
alply(array(unlist(daily), c(paths/path_split,time_horizon,asset)),3)
}
cum_returns <- function(matr) {
time_horizon <- 30
paths <- 2500000
asset <- 25
foreach (x = matr) %dopar% {
x[time_horizon + 1] <- x[time_horizon]
x <- cbind(100,x)
for (m in 2:(time_horizon + 1)) {
# loop through each row of matrix to apply function
x[,m] <- x[,m-1] + x[,m]
}
return(x)
}
}
plotting <- function(path_matr) {
security_paths <- as.data.frame(t(path_matr))
security_paths$id <- 1:nrow(security_paths)
plot_paths <- melt(security_paths, id.var="id")
ggplot(plot_paths, aes(x=id, y=value,group=variable,colour=variable)) +
geom_line(aes(lty=variable))
}
system.time(daily <- daily_pnl())
system.time(daily_by_security <- matrix_splitter(daily))
rm(daily)
gc()
system.time(security_paths <- cum_returns(daily_by_security))
rm(daily_by_security)
gc()
plot_list <- foreach(x = security_paths, .packages = c("reshape2", "ggplot2")) %dopar% {
if (nrow(x) > 100) {
plotting(head(x,100))
} else {
plotting(x)
}
}
#Stop implicit cluster and regular cluster
stopImplicitCluster()
gc()
This seems to be a really old problem. I am having a similar issue. I don't need compute parallelization I actually need memory parallelization. (if such a thing can exist)
what works for me is azure do parallel. instead of registering system cores register cores from the cloud using registerDoAzureParallel(cluster)
your json will define the size of the machines (memory) you hire for the job. make sure each worker has enough memory to get a copy of your r environment. This will probably kill your network. You will be sending data to 30 -40 (depending on how many you have asked for) workers from your machine.
more documentation here.
https://github.com/Azure/doAzureParallel
Can we do something with sparklyr to address such issues?
I'm trying to move from a serial to parallel approach to accomplish some multivariate time series analysis tasks on a large data.table. The table contains data for many different groups and I'm trying to move from a for loop to a foreach loop using the doParallel package to take advantage of the multicore processor installed.
The problem I am experiencing relates to memory and how the new R processes seem to consume large quantities of it. I think that what is happening is that the large data.table containing ALL data is copied into each new process, hence I run out of RAM and Windows starts swapping to disk.
I've created a simplified reproducible example which replicates my problem, but with less data and less analysis inside the loop. It would be ideal if a solution existed which could only farm out the data to the worker processes on demand, or sharing the memory already used between cores. Alternatively some kind of solution may already exist to split the big data into 4 chunks and pass these to the cores so they have a subset to work with.
A similar question has previously been posted here on Stackoverflow however I cannot make use of the bigmemory solution offered as my data contains a character field. I will look further into the iterators package, however I'd appreciate any suggestions from members with experience of this problem in practice.
rm(list=ls())
library(data.table)
num.series = 40 # can customise the size of the problem (x10 eats my RAM)
num.periods = 200 # can customise the size of the problem (x10 eats my RAM)
dt.all = data.table(
grp = rep(1:num.series,each=num.periods),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
library(doParallel)
registerDoParallel(4)
foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind") %dopar%
{
dt.sub = dt.all[grp == grp]
f_lm(dt.sub, grp)
}
detach(package:doParallel)
Iterators can help to reduce the amount of memory that needs to be passed to the workers of a parallel program. Since you're using the data.table package, it's a good idea to use iterators and combine functions that are optimized for data.table objects. For example, here is a function like isplit that works on data.table objects:
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
Note that it isn't completely compatible with isplit, since the arguments and return value are slightly different. There may also be a better way to subset the data.table, but I think this is more efficient than using isplit.
Here is your example using isplitDT and a combine function that uses rbindlist which combines data.tables faster than rbind:
dtcomb <- function(...) {
rbindlist(list(...))
}
results <-
foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
Update
I wrote a new iterator function called isplitDT2 which performs much better than isplitDT but requires that the data.table have a key:
isplitDT2 <- function(x, vals) {
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=x[val], key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
This is called as:
setkey(dt.all, grp)
results <-
foreach(dt.sub=isplitDT2(dt.all, levels(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This uses a binary search to subset dt.all rather than a vector scan, and so is more efficient. I don't know why isplitDT would use more memory, however. Since you're using doParallel, which doesn't call the iterator on-the-fly as it sends out tasks, you might want to experiment with splitting dt.all and then removing it to reduce your memory usage:
dt.split <- as.list(isplitDT2(dt.all, levels(dt.all$grp)))
rm(dt.all)
gc()
results <-
foreach(dt.sub=dt.split,
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
}
This may help by reducing the amount of memory needed by the master process during the execution of the foreach loop, while still only sending the required data to the workers. If you still have memory problems, you could also try using doMPI or doRedis, both of which get iterator values as needed, rather than all at once, making them more memory efficient.
The answer requires the iterators package and use of isplit which is similar to split in that it breaks the main data object into chunks based on one or more factor columns. The foreach loop iterates through the chunks of data, passing only the subset out to the worker process rather than the whole table.
So the differences in the code are as follows:
library(iterators)
dt.all = data.table(
grp = factor(rep(1:num.series, each =num.periods)), # grp column is a factor
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
results =
foreach(dt.sub = isplit(dt.all, dt.all$grp), .packages="data.table", .combine="rbind")
%dopar%
{
f_lm(dt.sub$value, dt.sub$key[[1]])
}
The result of the isplit is that dt.sub is now a list with 2 elements: the key is in itself a list of the values used to split and the value contains the subset as a data.table.
Credit for this solution is given to a SO answer given by David and a response by Russell to my question on an excellent blog post about iterators.
------------------------------------ EDIT ------------------------------------
To test the performance of isplitDT v isplit and rbindlist v rbind the following code was used:
rm(list=ls())
library(data.table) ; library(iterators) ; library(doParallel)
num.series = 400
num.periods = 2000
dt.all = data.table(
grp = factor(rep(1:num.series,each=num.periods)),
pd = rep(1:num.periods, num.series),
y = rnorm(num.series * num.periods),
x1 = rnorm(num.series * num.periods),
x2 = rnorm(num.series * num.periods)
)
dt.all[,y_lag := c(NA, head(y, -1)), by = c("grp")]
f_lm = function(dt.sub, grp) {
my.model = lm("y ~ y_lag + x1 + x2 ", data = dt.sub)
coef = summary(my.model)$coefficients
data.table(grp, variable = rownames(coef), coef)
}
registerDoParallel(8)
isplitDT <- function(x, colname, vals) {
colname <- as.name(colname)
ival <- iter(vals)
nextEl <- function() {
val <- nextElem(ival)
list(value=eval(bquote(x[.(colname) == .(val)])), key=val)
}
obj <- list(nextElem=nextEl)
class(obj) <- c('abstractiter', 'iter')
obj
}
dtcomb <- function(...) {
rbindlist(list(...))
}
# isplit/rbind
st1 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine="rbind",
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplit/rbindlist
st2 = system.time(results <- foreach(dt.sub=isplit(dt.all,dt.all$grp),
.combine='dtcomb', .multicombine=TRUE,
.packages="data.table") %dopar% {
f_lm(dt.sub$value, dt.sub$key[[1]])
})
# isplitDT/rbind
st3 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
# isplitDT/rbindlist
st4 = system.time(results <- foreach(dt.sub=isplitDT(dt.all, 'grp', unique(dt.all$grp)),
.combine='dtcomb', .multicombine=TRUE,
.packages='data.table') %dopar% {
f_lm(dt.sub$value, dt.sub$key)
})
rbind(st1, st2, st3, st4)
This gives the following timings:
user.self sys.self elapsed user.child sys.child
st1 12.08 1.53 14.66 NA NA
st2 12.05 1.41 14.08 NA NA
st3 45.33 2.40 48.14 NA NA
st4 45.00 3.30 48.70 NA NA
------------------------------------ EDIT 2 ------------------------------------
Thanks to Steve's updated answer and the function isplitDT2, which makes use of the keys on the data.table, we have a clear new winner in terms of speed. Running microbenchmark to compare my original solution (in this answer) shows around 7-fold improvement from isplitDT2 with rbindlist. Memory usage has not yet been compared directly but the performance gain leads me to accept the answer at last.
Holding everything in memory is one of those (aargh, annoying) things that R programmers have to learn to deal with. It's pretty easy to imagine your code example as either memory-bound or CPU-bound, and you'll need to figure that out before trying to apply workarounds.
Assuming the memory is being consumed by your dataset (dt_all) and not during the actual model run, it is possible you might be able to release enough memory for the worker processes to parallelize:
foreach(grp=unique(dt.all$grp), .packages="data.table", .combine="rbind") %dopar%
{
dt.sub = dt.all[grp == grp]
rm(dt.all)
gc()
f_lm(dt.sub, grp)
}
However, this assumes that your working set (dt.sub) is small enough that you can fit more than one of them in memory at a time. It isn't hard to imagine a problem set too large for that. Also, and this is really annoying, all the workers are going to fire up at one time and kill your machine anyway, so you might need to make them pause for a couple seconds to allow other children to load up and release memory.
Though desperately stupid and brute-force, I have handled this exact problem by writing the subsets out to disk as individual data files, and then used a batch script to run my computations in parallel.
I would like to do bootstrap of residuals for nls fits in a loop. I use nlsBoot and in order to decrease computation time I would like to do that in parallel (on a Windows 7 system at the moment). Here is some code, which reproduces my problem:
#function for fitting
Falge2000 <- function(GP2000,alpha,PAR) {
(GP2000*alpha*PAR)/(GP2000+alpha*PAR-GP2000/2000*PAR)
}
#some data
PAR <- 10:1600
GPP <- Falge2000(-450,-0.73,PAR) + rnorm(length(PAR),sd=0.0001)
df1 <- data.frame(PAR,GPP)
#nls fit
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
#bootstrap of residuals
library(nlstools)
summary(nlsBoot(mod,niter=5))
#works
#now do it several times
#and in parallel
library(foreach)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cl)
ttt <- foreach(1:5, .packages='nlstools',.export="df1") %dopar% {
res <- nlsBoot(mod,niter=5)
summary(res)
}
#Error in { :
#task 1 failed - "Procedure aborted: the fit only converged in 1 % during bootstrapping"
stopCluster(cl)
I suspect this an issue with environments and after looking at the code of nlsBoot the problem seems to arise from the use of an anonymous function in a lapply call:
l1 <- lapply(1:niter, function(i) {
data2[, var1] <- fitted1 + sample(scale(resid1, scale = FALSE),
replace = TRUE)
nls2 <- try(update(nls, start = as.list(coef(nls)), data = data2),
silent = TRUE)
if (inherits(nls2, "nls"))
return(list(coef = coef(nls2), rse = summary(nls2)$sigma))
})
if (sum(sapply(l1, is.null)) > niter/2)
stop(paste("Procedure aborted: the fit only converged in",
round(sum(sapply(l1, is.null))/niter), "% during bootstrapping"))
Is there a way to use nlsBoot in a parallel loop? Or do I need to modify the function? (I could try to use a for loop instead of lapply.)
By moving the creation of the mod object into the %dopar% loop, it looks like everything works OK. Also, this automatically exports the df1 object, so you can remove the .export argument.
ttt <- foreach(1:5, .packages='nlstools') %dopar% {
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
res <- nlsBoot(mod,niter=5)
capture.output(summary(res))
}
However, you might need to work out what you want returned. Using capture.output was just to see if things were working, since summary(res) seemed to only return NULL.