Avoid "optimization failure" in for loop in R - r

I'm trying to make a lot of time series forecast using the HoltWinters function in R.
For this purpose, I use a for loop and inside I call to the function, and save the prediction in a data.frame.
The problem is that some results of the HoltWinters function gives errors, specifically optimization errors:
Error en HoltWinters(TS[[i]]) : optimization failure
This error break the loop.
So what I need is something like "try": if it can make the HoltWinters function, it saves the prediction, otherwise it save the error.
The code below replicates the problem:
data <- list()
data[[1]] <- rnorm(36)
data[[2]] <-
c(
24,24,28,24,28,22,18,20,19,22,28,28,28,26,24,
20,24,20,18,17,21,21,21,28,26,32,26,22,20,20,
20,22,24,24,20,26
)
data[[3]] <- rnorm(36)
TS <- list()
Outputs <- list()
for (i in 1:3) {
TS[[i]] <- ts(data[[i]], start = 1, frequency = 12)
Function <- HoltWinters(TS[[i]])
TSpredict <- predict(Function, n.ahead = 1)[1]
Outputs[[i]] <-
data.frame(LastReal = TS[[i]][length(TS[[i]])], Forecast = TSpredict)
}
Where i <- 2 The problem is generated.
What I need is that in this example the "Outputs" list is as follows:
> Outputs
[[1]]
LastReal Forecast
1 0.5657129 -2.274507
[[2]]
LastReal Forecast
1 error error
[[3]]
LastReal Forecast
1 0.4039783 -0.9556881
Thanks in advance.

I ran into this same problem with HoltWinters the other day and took Roman's advice by using tryCatch. It's not the most intuitive to implement based on the documentation, but I found this link very helpful for understanding it: How to write trycatch in R
My solution built off of the sample there.
data <- list()
data[[1]] <- rnorm(36)
data[[2]] <- c(
24,24,28,24,28,22,18,20,19,22,28,28,
28,26,24,20,24,20,18,17,21,21,21,28,
26,32,26,22,20,20,20,22,24,24,20,26
)
data[[3]] <- rnorm(36)
TS <- list()
Outputs <- list()
result <- list()
for (i in 1:3) {
Outputs[[i]] <- tryCatch({
#You can enter messages to see where the loop is
#message(paste("Computing", i))
TS[[i]] <- ts(data[[i]], start = 1, frequency = 12)
Function <- HoltWinters(TS[[i]])
TSpredict <- predict(Function, n.ahead = 1)[1]
result[[i]] <-
data.frame(LastReal = TS[[i]][length(TS[[i]])], Forecast = TSpredict)
},
error = function(cond) {
#message(paste("ERROR: Cannot process for time series:", i))
msg <- data.frame(LastReal = "error", Forecast = "error")
return(msg)
})
}
And for the Outputs
> Outputs
[[1]]
LastReal Forecast
1 0.4733632 0.5469373
[[2]]
LastReal Forecast
1 error error
[[3]]
LastReal Forecast
1 0.8984626 -0.5168826
You can use other error handling parameters such as finally and warning to deal with other exceptions that may arise.

Related

How can I transform the following script into a function for a package?

I have the following data as an example:
IID<-c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
KB<-c(345,1234,2000,1567,
376,1657,9656,234,
1865,1565,123,111,
1999,2032,1565,234)
data<-cbind(IID,KB)
I wrote a script to process it and give me some outcomes:
results_kb <- function(class) {
this_iids_roh <- dat[class,]
my_list<-c("Sum_long"=sum(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"N_long"=length(this_iids_roh$KB[this_iids_roh$KB>=1500]),
"Sum_short"=sum(this_iids_roh$KB[this_iids_roh$KB<1500]),
"N_short"=length(this_iids_roh$KB[this_iids_roh$KB<1500]))
return(my_list)
}
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
I created first a function that gives me the outcome table and then I processed the dataset I want to analyze. I added a time indicator because this script is to process very large samples. As you can test it works perfectly.
Now I want to write my own package, so I have to reduce all this into functions. The first part of the script is already in a function. For the second part I have tried the following:
rohsum<-function(data){
dat <- data.table::as.data.table(data)
dat$IID<-as.factor(dat$IID)
data.table::setkey(dat,"IID")
results <- c()
nLevels <- length(levels(dat$IID))
start <- proc.time()
pb <- txtProgressBar(min = 0, max = nLevels, style = 3)
for (i in 1:nLevels){
this_iid <- levels(dat$IID)[i]
results <- rbind(results,results_kb(this_iid))
setTxtProgressBar(pb,i)
}
close(pb)
proc.time()-start
results<-data.frame(levels(dat$IID),results)
results$IID<-results$levels.dat.IID.
results[results==0] <- NA
return(results)
}
However this seems not to work, since when I tried to run rohsum(data) I get the following error message:
Error in results_kb(this_iid) : object 'dat' not found
Even more, If I tried to run several times I get the following:
How can I solve this issue to be able to build my own package?

Data frame creation inside Parlapply in R

I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below
#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i+1):p ){
cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)
The following Parallel Part does work but if I generate the data as PART 2
# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)
I configured the function that I want to run parallelly... as
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-coef(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
So I write the following way of running parlapply
{
no_cores <- detectCores(logical = TRUE)
myclusternumber<-(no_cores-1)
cl <- makeCluster(myclusternumber)
registerDoParallel(cl)
p1 <- ncol(my.data)
obj<-splitIndices(p1, myclusternumber)
clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
clusterEvalQ(cl, {
library(MASS)
library(Matrix)
library(BAS)
})
newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
stopCluster(cl)
}
But whenever am doing PART 1 I get the following error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: object 'my_df' not found
But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.
Posting this as one possible workaround, see if it works:
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my_df <<- my_df
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-BAS:::coef.bas(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.
There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.

Read many files in parallel and extract data

I have 1000 json files. And I would like to read them in parallel. I have 4 CPU cores.
I have a character vector which has the names of all the files as following:-
cik_files <- list.files("./data/", pattern = ".json")
And using this vector I load the file and extract the data and add it to the following list:-
data <- list()
Below is the code for extracting the data:-
for(i in 1:1000){
data1 <- fromJSON(paste0("./data/", cik_files[i]), flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
data1 <- data1[grep("CY20[0-9]{2}$", data1$frame), c(3, 9)]
try({if(nrow(data1) > 0){
data1$cik <- strtrim(cik_files[i], 13)
data[[length(data) + 1]] <- data1
}}, silent = TRUE)
}
}
This however, takes quite a lot of time. So I was wondering how I can run the code within the for loop but in parallel.
Thanks in advance.
Here is an attempt to solve the problem in the question. Untested, since there is no data.
Step 1
First of all, rewrite the loop in the question as a function.
f <- function(i, path = "./data", cik_files){
filename <- file.path(path, cik_files[i])
data1 <- fromJSON(filename, flatten = TRUE)
if(("NetIncomeLoss" %in% names(data1$facts$`us-gaap`))){
data1 <- data1$facts$`us-gaap`$NetIncomeLoss$units$USD
found <- grep("CY20[0-9]{2}$", data1$frame)
if(length(found) > 0){
tryCatch({
out <- data1[found, c(3, 9)]
out$cik <- strtrim(cik_files[i], 13)
out
},
error = function(e) e,
warning = function(w) w)
} else NULL
} else NULL
}
Step 2
Now load the package parallel and run one of the following, depending on OS.
library(parallel)
# Not on Windows
library(jsonlite)
json_list <- mclapply(seq_along(cik_files), f, cik_files = cik_files)
# Windows
ncores <- detectCores()
cl <- makeCluster(ncores - 1L)
clusterExport(cl, "cik_files")
clusterEvalQ(cl, "cik_files")
clusterEvalQ(cl, library(jsonlite))
json_list <- parLapply(cl, seq_along(cik_files), f, cik_files = cik_files)
stopCluster(cl)
Step 3
Extract the data from the returned list json_list.
err <- sapply(json_list, inherits, "error")
warn <- sapply(json_list, inherits, "warning")
ok <- !(err | warn)
json_list[ok] # correctly read in

Error in meanf - unused argument

Having problems with a function I wrote in R using the forecast package. This is the function:
generateARIMAForecasts <- function(inputTSDecompList, inputArimaOrder, fcstHrzn, cnst, drft){
tmpSTL <- NULL;
fcasting <- NULL;
tsfcastList <- NULL;
counter <- 1;
while(counter <= length(inputTSDecompList)){
#select the TS decompositions
tmpSTL <- inputTSDecompList[counter]$TimeSeriesDecomposition;
#add the lattice plot to the list of plots
if(cnst == TRUE & drft == TRUE){
fcasting <- forecast(tmpSTL, h=fcstHrzn,
forecastfunction=function(x,h,level, ...){
fit <- Arima(x, order=inputArimaOrder, include.constant = TRUE, include.drift = TRUE)
return(forecast(fit,h=fcstHrzn,level=level, ...))});
}
fcastCoefs <- fcasting$model$coef;
fcstValues <- fcasting;
fcastSummary <- summary(fcasting);
#add the forecast results to the forecast list
tsfcastList[[counter]] <- list(FinancialInstitution=LVTSFITimeSeriesList[counter]$LVTSFITimeSeriesList$FinancialInstitution,
ForecastCoefficients=fcastCoefs,
ForecastedSeries=fcstValues,
ForecastSummary=fcastSummary);
counter <- counter+1;
}
return(tsfcastList);
}
The function takes a list of STL decomposed series, and generates Arima forecasts for each of the individual stl decomposed time series in the input list.
I have run the forecast generation manually by hardcoding for individual elements and it works. However when I try to do it using the function I get the following error
Error in meanf(object, h = h, level = level, fan = fan, lambda = lambda, :
unused argument (forecastfunction = function (x, h, level, ...)
{
fit <- Arima(x, order = inputArimaOrder, include.constant = TRUE, include.drift = TRUE)
return(forecast(fit, h = fcstHrzn, level = level, ...))
})
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Could someone advise please?
Hi after a few more hours manually debugging each line in the RStudio console, I figured it out the issue was my call
tmpSTL <- inputTSDecompList[counter]$TimeSeriesDecomposition;
This returned NULL because I had created the inputTSDecompList as a 2-D list using
tsDecomList[[counter]] <- list(FinancialInstitution=inputTSList[counter]$LVTSFITimeSeriesList$FinancialInstitution, TimeSeriesDecomposition=tsDecom);
So I should have been calling
tmpSTL <- inputTSDecompList[[counter]]$TimeSeriesDecomposition;

The function lda() throws an error when passing a subset argument

This error looks common but I've can't seem to get my head round this.
I've been given the following code (on a course but it's (the code) not graded) as a shortcut to doing LDA. Apparently it works on some computers but not mine. I've upgraded R and R studio and also the MASS library. Any ideas?
The error I get is:
Error in eval(expr, envir, enclos) : object 'training' not found
The code is
lda.valid <- function(formula,data,...,train.fraction=0.75){
grouping <- model.response(model.frame(formula,data))
tbl <- table(grouping,lda(formula,data,...,CV=TRUE)$class)
CV <- sum(diag(tbl))/sum(tbl)
n <- nrow(data)
training <- sample(1:n,n*train.fraction)
lda.training <- lda(formula,data,...,subset=training)
lda.pred <- predict(lda.training,data[-training,])
tbl <- table(grouping[-training],lda.pred$class)
VAL <- sum(diag(tbl))/sum(tbl)
c(CV=CV,VAL=VAL)
}
I run the following and get the error. Is it related to the "..." (ellipsis)
lda.valid(Species~.,data=iris,prior=c(1/3,1/3,1/3),train.fraction=0.5)
I was looking at the trycatch stuff to catch the error but don't see how I can print the stacktrace.
Any hints or suggestions. I probably don't understand the stacktrace at this point.
The error occurs where you call lda.training <- lda(...). This seems to be related to internals of the lda() function, and it's not clear to me why this happens.
However, the intent of this code seems to perform the lda using a only a training subset of the data.
This is easy enough to specify directly by subsetting the data in advance. So I suggest replacing the offending line with
lda.training <- lda(formula, data[training, ], ...)
Thus the complete function is:
library(MASS)
lda.valid <- function(formula, data, ..., train.fraction = 0.75){
grouping <- model.response(model.frame(formula, data))
tbl <- table(grouping, lda(formula, data, ..., CV = TRUE)$class)
CV <- sum(diag(tbl))/sum(tbl)
n <- nrow(data)
training <- sample(1:n, n*train.fraction)
lda.training <- lda(formula, data[training, ], ...) # <<<--- Changed
lda.pred <- predict(lda.training, data[-training, ])
tbl <- table(grouping[-training], lda.pred$class)
VAL <- sum(diag(tbl))/sum(tbl)
c(CV = CV, VAL = VAL)
}
lda.valid(Species~., data = iris, prior = c(1/3, 1/3, 1/3), train.fraction = 0.5)
This results in:
> lda.valid(Species~., data = iris, prior = c(1/3, 1/3, 1/3), train.fraction = 0.5)
CV VAL
0.98 0.96

Resources