Implementation of time series cross-validation - r

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.

The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

Related

Time series prediction with and without NAs (ARIMA and Forecast package) in R

This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE

forecasting with tscv auto.arima predicted values in R

I want to do an out-of-sample forecast experiment using the auto.arima function. Further, time series cross validation with a fixed rolling window size should be applied. The goal is to obtain one step forecasts for 1,3 and 6 steps ahead.
library(forecast)
library(tseries)
#the time series
y1 = 2+ 0.15*(1:20) + rnorm(20,2)
y2 = y1[20]+ 0.3*(1:30) + rnorm(30,2)
y = as.ts(c(y1,y2))
#10obs in test set, 40obs in training set
ntest <- 10
ntrain <- length(y)-ntest
#auto.arima with some prefered specifications
farima <- function(x,h){forecast(auto.arima(x,ic="aic",test=c("adf"),seasonal=FALSE,
stepwise=FALSE, approximation = FALSE,
method=c("ML")),h=h)}
# executing the following function, gives the forecast errors in a matrix for each one-step forecast
e <- tsCV(y,farima,h = 6,window=40)
The predicted values are given by subtracting the error from the true value:
#predicted values
fc1 <- c(NA,y[2:50]-e[1:49,1])
fc1 <- fc1[41:50]
fc3 <- c(NA,y[2:50]-e[1:49,3])
fc3 <- fc3[41:50]
fc6 <- c(NA,y[2:50]-e[1:49,6])
fc6 <- fc6[41:50]
However I´m curious whether the predicted values for the 3-step ahead are coded correctly. Since the first 3-step ahead forecast is the prediction of the 43th observation?
Also i dont understand why the matrix e for the 3-step ahead error [3th column] has a value for observation 40. Since i thought the first 3-step ahead forecast is obtained for observation 43 and thus there shouldnt be an error for observation 40.
Always read the help file:
Value
Numerical time series object containing the forecast errors as a vector (if h=1) and a matrix otherwise. The time index corresponds to the last period of the training data. The columns correspond to the forecast horizons.
So tsCV() returns errors in a matrix where the (i,j)th entry contains the error for forecast origin i and forecast horizon h. So the value in row 40 and column 3 is a 3-step error made at time 40, for time period 43.
Thanks for your help!
So for the h=1,2,3 steps ahead the predicted values are the following:
#predicted values
#h=1
fc1 <- c(NA,y[41:50]-e[40:49,1])
fc1 <- fc1[2:11]
#h=2
fc2 <- c(NA,y[42:50]-e[40:49,2])
fc2 <- fc2[2:10]
#h=3
fc3 <- c(NA,y[43:50]-e[40:49,3])
fc3 <- fc3[2:9]
Is that correct?

Do we need to do differencing of exogenous variables before passing to xreg argument of Arima() in R?

I am trying to build a forecasting model using ARIMAX in R and require some guidance on how covariates are handled in xreg argument.
I understand that, auto.arima function takes care of differencing of covariates while fitting the model (from training period data) and I also don't need to difference the covariates for generating forecasts for test period (future values).
However, while fitting the model using Arima() in R with custom (p, d, q) and (P, D, Q)[m] values with d or D greater than 0, do we need to manually do differencing of the covariates?
If I do differencing, I get the issue that the differenced covariates matrix is of smaller length than the number of data points of the dependent variable.
How should one handle this?
Should I send the covariate matrix as it is i.e. without differencing?
Should I do differencing but omit first few observations for which differenced covariate data is not available?
Should I keep the actual values for first few rows where difference covariate values are not available and remaining rows to have differenced values?
If I have to pass flag variables (1/0) to the xreg matrix, should I do differencing of those as well or cbind the actual values of flag variables with the differenced values of remaining variables?
Also, while generating the forecasts for future period, how do I pass the covariate values (as it is or after differencing)?
I am using the following code:
ndiff <- ifelse(((pdq_order == "auto") || (PDQ_order == "auto")), ndiffs(ts_train_PowerTransformed), pdq_order[2])
nsdiff <- ifelse(((pdq_order == "auto") || (PDQ_order == "auto")), nsdiffs(ts_train_PowerTransformed), PDQ_order$order[2])
# Creating the appropriate covariates matrix after doing differencing
ifelse(nsdiff >= 1
, ifelse(ndiff >= 1
, xreg_differenced <- diff(diff(ts_CovariatesData_TrainingPeriod, lag = PDQ_order$period, differences = nsdiff), lag = 1, differences = ndiff)
, xreg_differenced <- diff(ts_CovariatesData_TrainingPeriod , lag = PDQ_order$period, differences = nsdiff)
)
, ifelse(ndiff >= 1
, xreg_differenced <- diff( ts_CovariatesData, lag = 1, differences = ndiff)
, xreg_differenced <- ts_CovariatesData
)
# Fitting the model
model_arimax <- Arima(ts_train_PowerTransformed, order = pdq_order, seasonal = PDQ_order, xreg = xreg_differenced))
# Generating Forecast for the test period
fit.test <- model_arimax %>% forecast(h=length(ts_test),
xreg = as.data.frame(diff(diff(ts_CovariatesData_TestPeriod, lag = PDQ_order$period, differences = nsdiff), lag = 1, differences = ndiff))
Kindly suggest.
Arima will difference both the response variable and the xreg variables as specified in the order and seasonal arguments. You should never need to do the differencing yourself.

How to create a sliding window in R to divide data into test and train samples to test accuracy of forecasts?

We are using the forecast package in R to read 3 weeks worth of hourly data (3*7*24 data points) and make predictions for the next 24 hours. It's a time-series with multiple seasonality.
We have the forecast model running just fine and it seems to be doing well. Now, we wish to quantify the accuracy of our approach / forecasting algorithm for our data. We wish to use the accuracy function in forecast package for this purpose. We understand that the accuracy function works so that it f is the forecast and x is the actual observation vector then accuracy(f,x) would give us the several accuracy measurements for this forecast.
We have data from the past several months and we wish to write a sliding window algorithm that picks (3*7*24) hour values and then predicts the next 24 hours. Then, compares these values against actual data for the next day / 24 hours, displays the accuracy, then slides the window by (24 points / hours) / next day and repeats.
The sample data is generated as follows:
library("forecast")
time <- 1:(12*168)
set.seed(1)
ds <- msts(sin(2*pi*time/24)+c(1,1,1.2,0.8,1,0,0)[((time-1)%/%24)%%7+1]+ time/400+rnorm(length(time),0,0.2),seasonal.periods=c(24,168))
plot(ds)
head(ds)
tail(ds)
length(ds)
length(time)
Forecasting procedure is as follows:
model <- tbats(ds[1:504])
fcst <- forecast(model,h=24,level=90)
accuracy(fcst,ds[505:528]) ##Test accuracy of forecast against next/actual 24 values
Now, we wish to slide the "window" by 24 and repeat the same procedure, that is, the next set of values used to build the model will be ds[25:528] and their accuracy will be tested against ds[529:552] ... and so on. How can we implement this?
Also, is there a better way to test overall accuracy of this forecasting algorithm for our scenario?
I would do this by creating a vector of times representing the front edge of the sliding windows, then using lapply to iterate the forecasting and scoring process over the windows those edges imply. Like...
# set a couple of parameters we'll use to slice the series into chunks:
# window width (w) and the time step at which you want to end the first
# training set
w = 24 ; start = 504
# now use those parameters to make a vector of the time steps at which each
# window will end
steps <- seq(start + w, length(ds), by = w)
# using lapply, iterate the forecasting-and-scoring process over the
# windows that created
cv_list <- lapply(steps, function(x) {
train <- ds[1:(x - w)]
test <- ds[(x - w + 1):x]
model <- tbats(train)
fcst <- forecast(model, h = w, level = 90)
accuracy(fcst, test)
})
Example output for the first window:
> cv_list[[1]]
ME RMSE MAE MPE MAPE MASE
Training set 0.0001587681 0.3442898 0.2689754 34.3957362 84.30841 0.9560206
Test set 0.2619029897 0.8961109 0.7868256 -0.6832273 36.64301 2.7966186
ACF1
Training set 0.02588145
Test set NA
If you want summaries of the scores for the whole list, you can do something like...
rmse <- mean(unlist(lapply(cv_list, '[[', "Test set","RMSE")))
...which produces this:
[1] 1.011177

Dual seasonal cycles in ts object

I want to strip out seasonality from a ts. This particular ts is daily, and has both yearly and weekly seasonal cycles (frequency 365 and 7).
In order to remove both, I have tried conducting stl() on the ts with frequency set to 365, before extracting trend and remainders, and setting the frequency of the new ts to 7, and repeat.
This doesn't seem to be working very well and I am wondering whether it's my approach, or something inherent to the ts which is causing me problems. Can anyone critique my methodology, and perhaps recommend an alternate approach?
There is a very easy way to do it using a TBATS model implemented in the forecast package. Here is an example assuming your data are stored as x:
library(forecast)
x2 <- msts(x, seasonal.periods=c(7,365))
fit <- tbats(x2)
x.sa <- seasadj(fit)
Details of the model are described in De Livera, Hyndman and Snyder (JASA, 2011).
An approach that can handle not only seasonal components (cyclically reoccurring events) but also trends (slow shifts in the norm) admirably is stl(), specifically as implemented by Rob J Hyndman.
The decomp function Hyndman gives there (reproduced below) is very helpful for checking for seasonality and then decomposing a time series into seasonal (if one exists), trend, and residual components.
decomp <- function(x,transform=TRUE)
{
#decomposes time series into seasonal and trend components
#from http://robjhyndman.com/researchtips/tscharacteristics/
require(forecast)
# Transform series
if(transform & min(x,na.rm=TRUE) >= 0)
{
lambda <- BoxCox.lambda(na.contiguous(x))
x <- BoxCox(x,lambda)
}
else
{
lambda <- NULL
transform <- FALSE
}
# Seasonal data
if(frequency(x)>1)
{
x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
trend <- x.stl$time.series[,2]
season <- x.stl$time.series[,1]
remainder <- x - trend - season
}
else #Nonseasonal data
{
require(mgcv)
tt <- 1:length(x)
trend <- rep(NA,length(x))
trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
season <- NULL
remainder <- x - trend
}
return(list(x=x,trend=trend,season=season,remainder=remainder,
transform=transform,lambda=lambda))
}
As you can see it uses stl() (which uses loess) if there is seasonality and penal­ized regres­sion splines if there is no seasonality.
Check if this is useful:
Start and End Values depends on your Data - Change the Frequency values accordingly
splot <- ts(Data1, start=c(2010, 2), end=c(2013, 9), frequency=12)
additive trend, seasonal, and irregular components can be decomposed using the stl() Function
fit <- stl(splot, s.window="period")
monthplot(splot)
library(forecast)
vi <-seasonplot(splot)
vi should give seperate values for a seasonal indices
Also Check the below one:
splot.stl <- stl(splot,s.window="periodic",na.action=na.contiguous)
trend <- splot.stl$time.series[,2]
season <- splot.stl$time.series[,1]
remainder <- splot - trend - season

Resources