nlsBoot and foreach %dopar%: scoping issues - r

I would like to do bootstrap of residuals for nls fits in a loop. I use nlsBoot and in order to decrease computation time I would like to do that in parallel (on a Windows 7 system at the moment). Here is some code, which reproduces my problem:
#function for fitting
Falge2000 <- function(GP2000,alpha,PAR) {
(GP2000*alpha*PAR)/(GP2000+alpha*PAR-GP2000/2000*PAR)
}
#some data
PAR <- 10:1600
GPP <- Falge2000(-450,-0.73,PAR) + rnorm(length(PAR),sd=0.0001)
df1 <- data.frame(PAR,GPP)
#nls fit
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
#bootstrap of residuals
library(nlstools)
summary(nlsBoot(mod,niter=5))
#works
#now do it several times
#and in parallel
library(foreach)
library(doParallel)
cl <- makeCluster(1)
registerDoParallel(cl)
ttt <- foreach(1:5, .packages='nlstools',.export="df1") %dopar% {
res <- nlsBoot(mod,niter=5)
summary(res)
}
#Error in { :
#task 1 failed - "Procedure aborted: the fit only converged in 1 % during bootstrapping"
stopCluster(cl)
I suspect this an issue with environments and after looking at the code of nlsBoot the problem seems to arise from the use of an anonymous function in a lapply call:
l1 <- lapply(1:niter, function(i) {
data2[, var1] <- fitted1 + sample(scale(resid1, scale = FALSE),
replace = TRUE)
nls2 <- try(update(nls, start = as.list(coef(nls)), data = data2),
silent = TRUE)
if (inherits(nls2, "nls"))
return(list(coef = coef(nls2), rse = summary(nls2)$sigma))
})
if (sum(sapply(l1, is.null)) > niter/2)
stop(paste("Procedure aborted: the fit only converged in",
round(sum(sapply(l1, is.null))/niter), "% during bootstrapping"))
Is there a way to use nlsBoot in a parallel loop? Or do I need to modify the function? (I could try to use a for loop instead of lapply.)

By moving the creation of the mod object into the %dopar% loop, it looks like everything works OK. Also, this automatically exports the df1 object, so you can remove the .export argument.
ttt <- foreach(1:5, .packages='nlstools') %dopar% {
mod <- nls(GPP~Falge2000(GP2000,alpha,PAR),start=list(GP2000=-450,alpha=-0.73),data=df1, upper=c(0,0),algorithm="port")
res <- nlsBoot(mod,niter=5)
capture.output(summary(res))
}
However, you might need to work out what you want returned. Using capture.output was just to see if things were working, since summary(res) seemed to only return NULL.

Related

Data frame creation inside Parlapply in R

I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below
#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i+1):p ){
cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)
The following Parallel Part does work but if I generate the data as PART 2
# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)
I configured the function that I want to run parallelly... as
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-coef(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
So I write the following way of running parlapply
{
no_cores <- detectCores(logical = TRUE)
myclusternumber<-(no_cores-1)
cl <- makeCluster(myclusternumber)
registerDoParallel(cl)
p1 <- ncol(my.data)
obj<-splitIndices(p1, myclusternumber)
clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
clusterEvalQ(cl, {
library(MASS)
library(Matrix)
library(BAS)
})
newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
stopCluster(cl)
}
But whenever am doing PART 1 I get the following error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: object 'my_df' not found
But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.
Posting this as one possible workaround, see if it works:
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my_df <<- my_df
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-BAS:::coef.bas(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.
There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.

How to use "DoParallel" for this LOOP which took me 12h+ to run

I am currently working on this code. Everything is fine, except...the code needs 12 hours to finish.
I really do want to change this code to use Doparallel, but I just failed over and over again. I have to use Foreach but it seems it cannot improve the efficiency... I don't know how to design a "parallel" for this array as I need this array for further analysis...
Can anybody help me?
Data <- read_csv("C:/Users/Desktop/200144.csv")
ourModels <- list("forecast(naive(data),h=h)","forecast(ets(data,model='ANN'),h=h,level=95)", "forecast(ets(data,model='AAN', damped=FALSE),h=h,level=95)"
,"forecast(ets(data,model='MAM', damped=FALSE),h=h,level=95)","forecast(ets(data,model='AAN', damped=TRUE),h=h,level=95)"
,"forecast(ets(data,ic='aicc'),h=h,level=95)","mapa(data,fh=h,conf.lvl=0.95)","forecast(auto.ssarima(data,ic='AICc'),h=h,level=95)")
ourHorizon <- 2 #1,2,6
ourOrigins <- 144-36-ourHorizon + 1 # x rolling origins
ourForecasts <- array (NA,c(ourHorizon,ourOrigins,length(ourModels),ncol(Data)))#max origin, max horizon, how many time series.
ourHoldoutValues <- array (NA,c(ourHorizon,ourOrigins,ncol(Data)))#max origin, max horizon, how many time series.
EMTable<-array (NA,c(ncol(Data),length(ourModels),3))#3 metrics
RMSETABLE<-array (NA,c(1,ourOrigins,length(ourModels),ncol(Data)))
#collect forecasting results and Standard deviation of past forecasting errors
foreach(a = 1:ncol(Data))%do% {# a is the No. of time series
foreach(b = 1:8)%do%{# b is the No. of the models
ourData <- as.data.frame(Data [,a])
#ourData <- na.omit(ourData)
ourData <- ts(ourData,frequency=12) #monthly data
#Change Models here:
ourCall <- ourModels[[b]]
if (b==7){#MAPA
ourValue <- c("outfor","MSE")
} else if (b==8){#SARIMA
ourValue <- c("mean","model$lossValue")
} else if (b==1){
ourValue <- c("mean","model$sigma2")}#?????
else
ourValue <- c("mean","model$mse")
ourROReturn <- ro(ourData, h=ourHorizon, origins=ourOrigins, call=ourCall, parallel = TRUE,
value=ourValue, ci=FALSE, co=TRUE)
if (b==7){#MAPA is special
ourForecasts[,,b,a] <- ourROReturn$outfor
ourHoldoutValues[,,a] <- ourROReturn$holdout
#measure forecasting performances here #MPE
EMTable[a,b,1]<-mean(apply(((ourROReturn$holdout - ourROReturn$outfor) / ourROReturn$holdout),1,mean,na.rm=TRUE))#MPE
RMSETABLE[,,b,a]<-(ourROReturn$MSE)^0.5
} else{#use ourreturn$model can get their INSAMPLE MSE
ourForecasts[,,b,a] <- ourROReturn$mean
ourHoldoutValues[,,a] <- ourROReturn$holdout
EMTable[a,b,1]<-mean(apply(((ourROReturn$holdout - ourROReturn$mean) / ourROReturn$holdout),1,mean,na.rm=TRUE))#MPE
RMSETABLE[,,b,a]<-(ourROReturn$model)^0.5}
}
print(a)
}
You are almost there. All you need to do is parallel execution as follows
# load libraries
library(parallel)
library(doParalle)
# create parallel cluster
cl <- makeCluster(9,timeout=120)
doParallel::registerDoParallel(cl)
getDoParWorkers()
## Your parallel for loop
#collect forecasting results and Standard deviation of past forecasting errors
foreach(a = 1:ncol(Data), .packages=c(any packages you need inside for loop)) %dopar% {
....
....
....
....
}
If you want to further improve the execution time, make the second foreach loop foreach(b = 1:8)%do%{ as a function and use apply function.

R foreach could not find function "%dopar%"

When I using the doParallel library, I encountered this weird error, the system throws this
" Error in { : task 1 failed -could not find function "%dopar%"
To be specific, this is what I did
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Read the data
coin95 <-read.csv('~/Documents/coin95.csv')
coin95 <- coin95[,!(names(coin95) %in% c("X"))]
coin95[c("Person")] <- sapply(coin95[c("Person")],as.character)
# create the name list
coin95_name <- as.character(coin95$Person)
coin95_name <- unique(coin95_name)
n <- as.numeric(length(coin95_name))
# the average counting process
ntw <- function(now){
foreach (Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <-subset(coin95, subset = coin95$Person == Ii)$duration
stepfun(time,seq(0,length(time)))(now)/n
}
}
# the average cumulative hazard
lambda <- function(now,params){
b <- params[1]
sigma <- params[2]
mu <- params[3]
xi <- params[4]
beta1 <- params[5]
beta2 <- params[6]
k <- function(spread){
L0 <- (1+(spread -mu)*xi/sigma)^(-1/xi)
return(L0)
}
foreach(Ii = coin95_name,.combine = "+",.export = c("coin95","n")) %dopar% {
time <- subset(coin95, subset = coin95$Person == Ii)$duration
noncov <- subset(coin95, subset = coin95$Person == Ii)$noncovered
reim <- subset(coin95, subset = coin95$Person == Ii)$reimbursement
(b*now+sum( exp(-k(now-time[(time < now)])+beta1*noncov[(time < now)]+beta2*reim[(time <now)]) ))/n
}
}
So far, everything is GOOD, I have created two functions ntw and lambda using the foreach. They worked perfectly.
Then I create the third function also using the foreach:
# the distance
Time <- coin95$duration
Time <- sort(as.double(Time))
jl <- function(params){
res<-foreach(Ii = Time,.combine = "rbind",.export = c("ntw","lambda")) %dopar% {
(ntw(Ii)-ntw(Ii-1e-7)) * (ntw(Ii)- lambda(Ii,params))^2
}
return(sqrt(sum(res)))
}
guess<-c(0.0,1.3333,0.0,0.1,-1.2,3e-3)
Type jl(guess):
> jl(guess)
Show Traceback
Rerun with Debug
Error in { : task 1 failed -could not find function "%dopar%"
Any Idea what's going wrong ?
Quick fix for problem with foreach %dopar% is to reinstall these packages:
install.packages("doSNOW")
install.packages("doParallel")
install.packages("doMPI")
Above packages are responsible for parallelism in R. Bug which existed in old versions of these packages is now removed. I should mention that it will most likely help even though you are not using these packages in your code.

Parallel computing with foreach - Saving loop outputs to global list

I would like to run a large loop with the foreach function.This means using the %dopar% operator.
I can't find any questions already answered to this problem exactly. If this is a duplicate though, please point me in the right direction and I'll close this question.
I have been having mixed success. It works for simple examples on my machine, as per the help documentation, however I cannot seem to get good results for my own work.
My example is slightly more complicated, so the devil seems to be in the detail, as always! I have also read the 'white paper' provided by the package creators Revolution Analytics (you can get it here).
I don't see how best to maybe use the .combine argument to apply results to my global output list.
I would like to assign claculated value to one big list as opposed to using cbind or c
My example is pretty convoluted, but if I simplify it any further then any answers might not address my issue.
I will perform a kind of moving-linear model. So fit a model using lm() over 50 obersvations [1:50], predict the 51st observation [51], saving the results to a list.
Then I will shift it all one observation further. So a lm over [2:51] and predict the 52nd observation [52].
I will use a total of 100 observations, so I can make a maximum of 50 predictions.
## ============================================ ##
## Setup the backend for the foreach function ##
## ============================================ ##
## doMC calls upon cores on demand, uses them and closes them
library(doMC)
registerDoMC(detectCores()) #detectCores() uses all cores
## for Windows users
#library(doParallel) --> for Windows users
#registerDoParallel(detectCores())
## ======================== ##
## Create some dummy data ##
## ======================== ##
## three columns, one hundred observations
my_data <- data.table(outcome = runif(100), V1 = 3*runif(100), V2 = sqrt(runif(100)))
## Have a look at the data if you like - using the DT package
library(DT)
datatable(my_data, options = list(pageLength = nrow(my_data)))
## ================================= ##
## Perform the loop the normal way ##
## ================================= ##
## Create container (a list of lists) for results
my_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(my_results)){my_results[[i]] <- step_results}
## Use a for loop to stpe through all the 50 'slices'
for(i in 1:50) { #max. 50 predictions possible
## Fit a linear model
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
## Predict the next step
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred #simply measure the delta to the actual value
## Assign some results to the container created earlier
my_results[[i]][[1]] <- my_fit$coefficients
my_results[[i]][[2]] <- my_fit$rank
my_results[[i]][[3]] <- error
}
str(my_results) ## Keep this container to compare to our next one
## ============================================ ##
## Perform the loop using foreach and %dopar% ##
## ============================================ ##
## Create same results object for results as previously for parallel results
par_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(par_results)){par_results[[i]] <- step_results}
my_results_par <- foreach(i = 1:50) %dopar%
{ #max. 50 predictions possible
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred
## Assign some results to the container created earlier
par_results[[i]][[1]] <- my_fit$coefficients
par_results[[i]][[2]] <- my_fit$rank
par_results[[i]][[3]] <- error
Sys.sleep(i/20) #Allows time to see R processes spawn on your system
return(par_results)
}
## We can see straight away that this didn't work as I would like it to
identical(my_results, my_results_par) #FALSE
## This shows that the output seems good on the surface
class(my_results_par)
length(my_results_par)
## This shows that it doesn't (WARNING: very long)
str(my_results_par)
You can try out the various .combine arguments in the foreach function, for example:
foreach(i = 1:50, .combine = "c") {computation}
or
foreach(i = 1:50, .combine = "cbind") {computation}
these prodice a vector and a matrix respectively, but do not contain all the results that I was trying to save in each loop.
Questions
Does that structure give you a clue as to what is going on?
How might I use .combine argument to create my desired output?
Is what I am trying to do even possible??
Do I need to put the loop with foreach at a different point in the algorithm?
I have read that you can supply a custom function to foreach... might this be the way to do it? I still don't see how I would combine the results.
Yes, this can easily be done. We can modify your code for the foreach-step to the following, where we export the data.table package to each worker.
my_results_par <- foreach(i = 1:50, .combine = append, .packages = c("data.table")) %dopar%
{
my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])
my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)])
error <- my_data$outcome[i+50] - my_pred
par_results <- list(
coefs = my_fit$coefficients,
rank = my_fit$rank,
error = error
)
par_results <- list(par_results)
names(par_results) <- paste0("step_", i)
return(par_results)
}
identical(my_results, my_results_par)
[1] TRUE

Foreach code works for %do% but not for %dopar%

This works normally on my computer:
registerDoSNOW(makeCluster(2, type = "SOCK"))
foreach(i = 1:M,.combine = "c") %dopar% {
sum(rnorm(M))
}
So I can say that I can run parallelized code on this computer, right?
Ok. I have a piece of code that I wish to run on parallel with foreach. It runs perfectly when it's written with %do%, but doesn't work properly when I change it to %dopar%. (PS: I have already initialized the cluster with registerDoSNOW(makeCluster(2, type = "SOCK")) in the same way as before.)
My main interest in the code is getting the vector u.varpred. I get it nicely with %do%, but when I run it with %dopar%, the vector comes as a NULL.
Here is the loop with the code that's needed to run it all properly. It uses functions in the geoR package.
#you can pretty much ignore all this, it's just preparation for the loop
N=20
NN=10
set.seed(111);
datap <- grf(N, cov.pars=c(20, 5),nug=1)
grid.o <- expand.grid(seq(0, 1, l=100), seq(0, 1, l=100))
grid.c <- expand.grid(seq(0, 1, l=NN), seq(0,1, l=NN))
beta1=mean(datap$data)
emv<- likfit(datap, ini=c(10,0.4), nug=1)
krieging <- krige.conv(datap, loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars))
names(grid.c) = names(as.data.frame(datap$coords))
list.geodatas<-list()
valores<-c(datap$data,0)
list.dataframes<-list()
list.krigings<-list(); i=0; u.varpred=NULL;
#here is the foreach code
t<-proc.time()
foreach(i=1:length(grid.c[,1]), .packages='geoR') %do% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)))
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc=grid.o,
krige=krige.control(type.krige="SK", trend.d="cte",
beta =beta1, cov.pars=emv$cov.pars));
u.varpred[i] <- mean(krieging$krige.var - list.krigings[[i]]$krige.var)
list.dataframes[[i]]<-0 #i dont need those objects anymore but since they
# are lists i dont want to put <-NULL as it'll ruin their ordering
list.krigings[[i]]<- 0
list.geodatas[[i]] <-0
}
t<-proc.time()-t
t
You can check that this runs nicely (provided you have the following packages: geoR, foreach and doSNOW). But once I use registerDoSNOW(......) and %dopar%, u.varpred comes as a NULL.
Could you guys please try to see if I made a mistake in the foreach statement/process or if it's just the code that can't be parallel? (I thought it could, because any given iteration does not deppend on any of the iterations before it..)
I am sorry both the code and this question are so long. Thanks in advance for taking the time to read it.
My friend helped me directly. Here is a way it works:
u.varpred <- foreach(i = 1:length(grid.c[,1]), .packages = 'geoR', .combine = "c") %dopar% {
list.dataframes[[i]] <- rbind(datap$coords,grid.c[i,]);
list.geodatas[[i]] <- as.geodata(data.frame(cbind(list.dataframes[[i]],valores)));
list.krigings[[i]] <- krige.conv(list.geodatas[[i]], loc = grid.o,
krige = krige.control(type.krige = "SK", trend.d = "cte",
beta = beta1, cov.pars = emv$cov.pars));
u.varpred <- mean(krieging$krige.var - list.krigings[[i]]$krige.var);
list.dataframes[[i]] <- 0;
list.krigings[[i]] <- 0;
list.geodatas[[i]] <- 0;
u.varpred #this makes the results go into u.varpred
}
He gave me an example on why this works:
a <- NULL
foreach(i = 1:10) %dopar% {
a <- 5
}
print(a)
# a is still NULL
a <- NULL
a <- foreach(i = 1:10) %dopar% {
a <- 5
a
}
print(a)
#now it works
Hope this helps anyone.

Resources