What's the relation between time series multi-step forecast with re-estimation and forecast errors obtained by tsCV()? - r

I have a monthly time series (yt) with 108 observations (1/2010 to 12/2018) which I divided into training [1:84] and test sets [85:108]. My idea is to train my model in the training set and apply it to the test set so I can compare the obtained forecasts with real observations from the test set (and do it for h = 1, 3, 6 and 12). So, I want to obtain forecasts and their errors, and use it to calculate MAE, MSE and RMSE.
To obtain the forecasts
I implement the code for multi-step forecasts with re-estimation provided by Dr. Rob Hyndman in https://robjhyndman.com/hyndsight/rolling-forecasts/:
h<-1 # repeat to h=3,6 and 12
train<-window(yt, end=c(2016,12))
test<-window(yt, start=2017)
n<-length(test)-h+1
fit<-Arima(train, order=c(0,1,1), seasonal=list(order=c(0,1,1), period=12), method="ML") # model identified by analysing ACF and PACF of yt
order<-arimaorder(fit)
fcmat1<-matrix(0, nrow=n, ncol=h)
for(i in 1:n) {
x<-window(yt, end=c(2016,12)+(i-1)/12)
refit<-Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat1[i,]<-forecast(refit, h=h)$mean
}
I can then plot the values in fcmat1 over yt and visually access the difference between the obtained forecasts and the real values of yt[85:108].
To obtain the errors
In the link above, Dr. Rob refers to another page related to time-series cross-validation (in https://robjhyndman.com/hyndsight/tscv/), and so, I thought of simply doing
for_func<-function(x, h){forecast(fit, h=h)}
e<-tsCV(yt, for_func, h=1)
and use "e" to obtain the accuracy measures, but then I get confused because it seems to me that the forecasts calculated in for_func of tsCV are different from those obtained previously (fcmat1) and so will be the errors. In the Multi-step forecasts with re-estimation approach, I'm fitting the model to the training set and then "extend" it to the remaining data. In the tsCV approach, the Arima model is being applied to all data (yt) and not just the train dataset.
If I didn't want to plot the forecasts, I would never have come with this problem and I would have used tsCV with no doubts. The same if I just want to plot the forecasts without evaluating its accuracy (I would just use the multi-step with re-estimation)
So, my questions:
Multi-step forecasting with re-estimation and tsCV are somehow related? If yes, how?
In practice, which of the following functions should I use to obtain the errors?
tsCV():
e1<-tsCV(yt, for_func, h=1)
mae1<-mean(abs(e1), na.rm=TRUE)
mse1<-mean(e1^2, na.rm=TRUE)
rmse1<-sqrt(mse1)
# MAE MSE RMSE
# h=1 40.80492 2459.470 49.59305
# h=3 43.23866 2846.406 53.35172
# h=6 46.54597 3321.426 57.63181
# h=12 51.19123 3969.944 63.00749
accuracy() applied to fcmat
e1<-fcmat1[,1]-test
mae1ad<-mean(abs(e1), na.rm=TRUE)
mse1ad<-mean(e1^2, na.rm=TRUE)
rmse1ad<-sqrt(mse1ad)
# MAE MSE RMSE
# h=1 39.05335 2314.315 48.10733
# h=3 35.95370 2045.558 45.22784
# h=6 45.00950 3301.727 57.46066
# h=12 72.05596 7821.347 88.43838
accuracy() as it is usually applied
for1<-forecast(Arima(train, order=c(0,1,1), seasonal=list(order=c(0,1,1), period=12), method="ML"),h=1)
e1ac<-accuracy(for1,test)
mae1ac<-e1ac[6]
mse1ac<-e1ac[4]^2
rmse1ac<-e1ac[4]
# MAE MSE RMSE
# h=1 31.93750 1020.004 31.93750
# h=3 55.97562 5097.511 71.39685
# h=6 42.35456 3628.103 60.23374
# h=12 30.84436 2201.275 46.91774
Can someone enlighten me about this? (and eventually alert me if I'm doing something wrong)
Thanks

Related

Implementation of time series cross-validation

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

How to create a sliding window in R to divide data into test and train samples to test accuracy of forecasts?

We are using the forecast package in R to read 3 weeks worth of hourly data (3*7*24 data points) and make predictions for the next 24 hours. It's a time-series with multiple seasonality.
We have the forecast model running just fine and it seems to be doing well. Now, we wish to quantify the accuracy of our approach / forecasting algorithm for our data. We wish to use the accuracy function in forecast package for this purpose. We understand that the accuracy function works so that it f is the forecast and x is the actual observation vector then accuracy(f,x) would give us the several accuracy measurements for this forecast.
We have data from the past several months and we wish to write a sliding window algorithm that picks (3*7*24) hour values and then predicts the next 24 hours. Then, compares these values against actual data for the next day / 24 hours, displays the accuracy, then slides the window by (24 points / hours) / next day and repeats.
The sample data is generated as follows:
library("forecast")
time <- 1:(12*168)
set.seed(1)
ds <- msts(sin(2*pi*time/24)+c(1,1,1.2,0.8,1,0,0)[((time-1)%/%24)%%7+1]+ time/400+rnorm(length(time),0,0.2),seasonal.periods=c(24,168))
plot(ds)
head(ds)
tail(ds)
length(ds)
length(time)
Forecasting procedure is as follows:
model <- tbats(ds[1:504])
fcst <- forecast(model,h=24,level=90)
accuracy(fcst,ds[505:528]) ##Test accuracy of forecast against next/actual 24 values
Now, we wish to slide the "window" by 24 and repeat the same procedure, that is, the next set of values used to build the model will be ds[25:528] and their accuracy will be tested against ds[529:552] ... and so on. How can we implement this?
Also, is there a better way to test overall accuracy of this forecasting algorithm for our scenario?
I would do this by creating a vector of times representing the front edge of the sliding windows, then using lapply to iterate the forecasting and scoring process over the windows those edges imply. Like...
# set a couple of parameters we'll use to slice the series into chunks:
# window width (w) and the time step at which you want to end the first
# training set
w = 24 ; start = 504
# now use those parameters to make a vector of the time steps at which each
# window will end
steps <- seq(start + w, length(ds), by = w)
# using lapply, iterate the forecasting-and-scoring process over the
# windows that created
cv_list <- lapply(steps, function(x) {
train <- ds[1:(x - w)]
test <- ds[(x - w + 1):x]
model <- tbats(train)
fcst <- forecast(model, h = w, level = 90)
accuracy(fcst, test)
})
Example output for the first window:
> cv_list[[1]]
ME RMSE MAE MPE MAPE MASE
Training set 0.0001587681 0.3442898 0.2689754 34.3957362 84.30841 0.9560206
Test set 0.2619029897 0.8961109 0.7868256 -0.6832273 36.64301 2.7966186
ACF1
Training set 0.02588145
Test set NA
If you want summaries of the scores for the whole list, you can do something like...
rmse <- mean(unlist(lapply(cv_list, '[[', "Test set","RMSE")))
...which produces this:
[1] 1.011177

how to use previous observations to forecast the next period using for loops in r?

I have made 1000 observations for xt = γ1xt−1 + γ2xt−2 + εt [AR(2)].
What I would like to do is to use the first 900 observations to estimate the model, and use the remaining 100 observations to predict one-step ahead.
This is what I have done so far:
data2=arima.sim(n=1000, list(ar=c(0.5, -0.7))) #1000 observations simulated, (AR (2))
arima(data2, order = c(2,0,0), method= "ML") #estimated parameters of the model with ML
fit2<-arima(data2[1:900], c(2,0,0), method="ML") #first 900 observations used to estimate the model
predict(fit2, 100)
But the problem with my code right now is that the n.ahead=100 but I would like to use n.ahead=1 and make 100 predictions in total.
I think I need to use for loops for this, but since I am a very new user of Rstudio I haven't been able to figure out how to use for loops to make predictions. Can anyone help me with this?
If I've understood you correctly, you want one-step predictions on the test set. This should do what you want without loops:
library(forecast)
data2 <- arima.sim(n=1000, list(ar=c(0.5, -0.7)))
fit2 <- Arima(data2[1:900], c(2,0,0), method="ML")
fit2a <- Arima(data2[901:1000], model=fit2)
fc <- fitted(fit2a)
The Arima command allows a model to be applied to a new data set without the parameters being re-estimated. Then fitted gives one-step in-sample forecasts.
If you want multi-step forecasts on the test data, you will need to use a loop. Here is an example for two-step ahead forecasts:
fcloop <- numeric(100)
h <- 2
for(i in 1:100)
{
fit2a <- Arima(data2[1:(899+i)], model=fit2)
fcloop[i] <- forecast(fit2a, h=h)$mean[h]
}
If you set h <- 1 above you will get almost the same results as using fitted in the previous block of code. The first two values will be different because the approach using fitted does not take account of the data at the end of the training set, while the approach using the loop uses the end of the training set when making the forecasts.

Feature selection + cross-validation, but how to make ROC-curves in R

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke
You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

Maximum likelihood estimation for ARMA(1,1)-GARCH(1,1)

Following some standard textbooks on ARMA(1,1)-GARCH(1,1) (e.g. Ruey Tsay's Analysis of Financial Time Series), I try to write an R program to estimate the key parameters of an ARMA(1,1)-GARCH(1,1) model for Intel's stock returns. For some random reason, I cannot decipher what is wrong with my R program. The R package fGarch already gives me the answer, but my customized function does not seem to produce the same result.
I would like to build an R program that helps estimate the baseline ARMA(1,1)-GARCH(1,1) model. Then I would like to adapt this baseline script to fit different GARCH variants (e.g. EGARCH, NGARCH, and TGARCH). It would be much appreciated if you could provide some guidance in this case. The code below is the R script for estimating the 6 parameters of an ARMA(1,1)-GARCH(1,1) model for Intel's stock returns. At any rate, I would be glad to know your thoughts and insights. If you have a similar example, please feel free to share your extant code in R. Many thanks in advance.
Emily
# This R script offers a suite of functions for estimating the volatility dynamics based on the standard ARMA(1,1)-GARCH(1,1) model and its variants.
# The baseline ARMA(1,1) model characterizes the dynamic evolution of the return generating process.
# The baseline GARCH(1,1) model depicts the the return volatility dynamics over time.
# We can extend the GARCH(1,1) volatility model to a variety of alternative specifications to capture the potential asymmetry for a better comparison:
# GARCH(1,1), EGARCH(1,1), NGARCH(1,1), and TGARCH(1,1).
options(scipen=10)
intel= read.csv(file="intel.csv")
summary(intel)
raw_data= as.matrix(intel$logret)
library(fGarch)
garchFit(~arma(1,1)+garch(1,1), data=raw_data, trace=FALSE)
negative_log_likelihood_arma11_garch11=
function(theta, data)
{mean =theta[1]
delta=theta[2]
gamma=theta[3]
omega=theta[4]
alpha=theta[5]
beta= theta[6]
r= ts(data)
n= length(r)
u= vector(length=n)
u= ts(u)
u[1]= r[1]- mean
for (t in 2:n)
{u[t]= r[t]- mean- delta*r[t-1]- gamma*u[t-1]}
h= vector(length=n)
h= ts(h)
h[1]= omega/(1-alpha-beta)
for (t in 2:n)
{h[t]= omega+ alpha*(u[t-1]^2)+ beta*h[t-1]}
#return(-sum(dnorm(u[2:n], mean=mean, sd=sqrt(h[2:n]), log=TRUE)))
pi=3.141592653589793238462643383279502884197169399375105820974944592
return(-sum(-0.5*log(2*pi) -0.5*log(h[2:n]) -0.5*(u[2:n]^2)/h[2:n]))
}
#theta0=c(0, +0.78, -0.79, +0.0000018, +0.06, +0.93, 0.01)
theta0=rep(0.01,6)
negative_log_likelihood_arma11_garch11(theta=theta0, data=raw_data)
alpha= proc.time()
maximum_likelihood_fit_arma11_garch11=
nlm(negative_log_likelihood_arma11_garch11,
p=theta0,
data=raw_data,
hessian=TRUE,
iterlim=500)
#optim(theta0,
# negative_log_likelihood_arma11_garch11,
# data=raw_data,
# method="L-BFGS-B",
# upper=c(+0.999999999999,+0.999999999999,+0.999999999999,0.999999999999,0.999999999999,0.999999999999),
# lower=c(-0.999999999999,-0.999999999999,-0.999999999999,0.000000000001,0.000000000001,0.000000000001),
# hessian=TRUE)
# We record the end time and calculate the total runtime for the above work.
omega= proc.time()
runtime= omega-alpha
zhours = floor(runtime/60/60)
zminutes=floor(runtime/60- zhours*60)
zseconds=floor(runtime- zhours*60*60- zminutes*60)
print(paste("It takes ",zhours,"hour(s)", zminutes," minute(s) ","and ", zseconds,"second(s) to finish running this R program",sep=""))
maximum_likelihood_fit_arma11_garch11
sqrt(diag(solve(maximum_likelihood_fit_arma11_garch11$hessian)))

Resources