tsCV() function versus train/test split for time-series in R - r

So, frequency of data is monthly and it is stationary I have an ARIMA model using auto.arima. Couple tests are applied to the data before creating model like ACF,ADF etc.
y is my monthly time-series object using ts() function:
myarima=auto.arima(y, stepwise = F,approximation = F,trace=T)
Then I use forecast function:
forecast = forecast(myarima,h=10)
autoplot(forecast)
Since for this case, I did not create any train and test sets because my data has fluctutation at the end so if I create a train/test split since my test set should equal to the forecast horizon (last 10 months) then the model will not be able to understand fluctutations at the end since it will be the test. Would be great to be enlighten regarding the K-fold cross validation to avoid these kind of scenarios.
Without train and test split, after creating the model and visualizing the forecast, I went for tsCV():
myforecast_arima<-function(x,h){
forecast(auto.arima(x),stepwise=F,approximation=F,h=h)
}
error_myarima<-tsCV(y,myforecast_arima,h=10)
mean(arimaerror^2,na.rm=TRUE) #To get MSE
Then I get a kind of low MSE value which is around 0.30
So, my question is, is it trustworthy method to evaluate ARIMA models and then deploy using this pathway? or Should I use train/test split method? What would you guys prefer in general? Should I use any other method? and how can I determine the window parameter of tsCV() function? If my pathway is correct then how can I improve it? What are the biggest differences between K-Fold CV and tsCV() function?
Thank you!

Related

How do I decide between different forecasting model families to automate forecasting for 150 time series?

I have weekly time series data for multiple departments (retail domain) and based on some research, I am automating the process of finding model parameters for each time series. So far, I have implemented the following models for each time series in a for loop:
1) ARIMA (auto.arima in R)
2) stlf (cannot use R's ets function since I have weekly data)
3) TBATS
4) Regression on ARIMA errors (using fourier terms)
5) Baseline models: naive & mean
I want to understand how to choose models for each time series. I have multiple approaches to this:
1) Choose model with lowest RMSE on test data (risk: overfitting on test data)
2) Choose model with lowest RMSE best on cross-validation of time series (tsCV)
3) Choose one family of models for all the time series based on which family gives lowest average RMSE score across all the time series.
Are there any ways I can improve my approach? Any disadvantages to any of the above approaches? Any better approach?
Thanks a lot!
Forecast your data with all forecasting methods mentioned above, after that calculate the MAPE and check which model is giving best results then use that model for forecast your data.
Also try to check with different different data transformation like log, inverse, etc.. for your input data.

Accuracy Function

I am doing a TS Analysis. What is the difference between these two accuracies:
fit<-auto.arima(tsdata)
fcast<-forecast(fit,6)
accuracy(fcast) #### First Accuracy
fit<-auto.arima(tsdata)
fcast<-forecast(fit,6)
accuracy(fcast,actual values) #### Second Accuracy
How does the accuracy function work when I don't specify the actual values in the accuracy function as in the first case.
Secondly what is the right approach to calculate accuracy?
In this answer I'm assuming you are using the function from the forecast package.
The answer lies within accuracy's description:
Returns range of summary measures of the forecast accuracy. If x is provided, the function measures out-of-sample (test set) forecast accuracy based on x-f. If x is not provided, the function only produces in-sample (training set) accuracy measures of the forecasts based on f["x"]-fitted(f). All measures are defined and discussed in Hyndman and Koehler (2006).
In your case x being the second argument of the function. So, in short accuracy(fcst) provides an estimation of the prediction error, based on the training set.
For example: lets assume you have 12 months and predicting 6 ahead. Then if you use accuracy(fcst) you get the error of the model over the 12 months (only).
Now, let's assume x = real demand for the 6 months you are forecasting. And that you didn't use this data to build the Arima model. In this case accuracy(fcst, x) gives you the test set error, which is a better measure to what you will get in the future using this model (compared to the train set error).
The best practice is to use a test set error because this measure is less prone to bias (you will most likely get "better" prediction results on the training set then on a "hideout" test set, but these results will be a sort of "overfitting"). If you have a test set, you should use the test set as the second argument.

Difference between simulate() and forecast() in "forecast" package

I am working on building a time series model.
However, I am having trouble understanding what the difference is between the simulate function and the forecast function in the forecast package.
Suppose I built an arima model and want to use it to simulate future values as long as 10 years. The data is hourly and we have a year worth of data.
When using forecast to predict the next 1000-step-ahead estimation, I got the following plot.
Using forecast method
Then I used the simulate function to simulate the next 1000 simulated values and got the following plot.
Using simulate method
Data points after the red line are simulated data points.
In the latter example, I used the following codes to simulate the future values.
simulate(arima1, nsim=1000, future=TRUE, bootstrap=TRUE))
where arima1 is my trained arima model, bootstrap residuals are used because the model residuals are not very normal.
Per definition in the forecast package, future=TRUE means that we are simulating future values based on the historical data.
Can anyone tell me what the difference is between these two method? Why does simulate() give me a much more realistic results but forecasted values from forecast() just converge to a constant after several iterations (no much fluctuation to the results from simulate())?
A simulation is a possible future sample path of the series.
A point forecast is the mean of all possible future sample paths. So the point forecasts are usually much less variable than the data.
The forecast function produces point forecasts (the mean) and interval forecasts containing the estimated variation in the future sample paths.
As a side point, an ARIMA model is not appropriate for this time series because of the skewness. You might need to use a transformation first.

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

Multivariate time series model using MARSS package (or maybe dlm)

I have two temporal processes. I would like to see if one temporal process (X_{t,2}) can be used to perform better forecast of the other process (X_{t,1}). I have multiple sources providing temporal data on X_{t,2}, (e.g. 3 time series measuring X_{t,2}). All time series require a seasonal component.
I found MARSS' notation to be pretty natural to fit this type of model and the code looks like this:
Z=factor(c("R","S","S","S")) # observation matrix
B=matrix(list(1,0,"beta",1),2,2) #evolution matrix
A="zero" #demeaned
R=matrix(list(0),4,4); diag(R)=c("r","s","s","s")
Q="diagonal and unequal"
U="zero"
period = 12
per.1st = 1 # Now create factors for seasons
c.in = diag(period)
for(i in 2:(ceiling(TT/period))) {c.in = cbind(c.in,diag(period))}
c.in = c.in[,(1:TT)+(per.1st-1)]
rownames(c.in) = month.abb
C = "unconstrained" #2 x 12 matrix
dlmfit = MARSS(data, model=list(Z=Z,B=B,Q=Q,C=C, c=c.in,R=R,A=A,U=U))
I got a beta estimate implying that the second temporal process is useful in forecasting the first process but to my dismay, MARSS gives me an error when I use MARSSsimulate to forecast because one of the matrices (related to seasonality) is time-varying.
Anyone, knows a way around this issue of the MARSS package? And if not, any tips on fitting an analogous model using, say the dlm package?
I was able to represent my state-space model in a form adequate to use with the dlm package. But I encountered some problems using dlm too. First, the ML estimates are VERY unstable. I bypassed this issue by constructing the dlm model based on marss estimates. However, dlmFilter is not working properly. I think the issue is that dlmFilter is not designed to deal with models with multiple sources for one time series, and additional seasonal components. dlmForecast gives me forecasts that I need!!!
In summary for my multivariate time series model (with multiple sources providing data for one of the temporal processes), the MARSS library gave me reasonable estimates of the parameters and allowed me to obtain filtered and smoothed values of the states. Forecast values were not possible. On the other hand, dlm gave fishy estimates for my model and the dlmFilter didn't work, but I was able to use dlmForecast to forecast values using the model I fitted in MARSS and reexpressed in dlm appropriate form.

Resources