Auto.Arima incorrectly predicts first point - r

I'm trying to complete a time series analysis of some reservoir data and am using auto.arima with a Fourier component to account for seasonality, as described here https://otexts.com/fpp2/dhr.html#dhr The code I have used is shown below and the dataset I used can be found here https://www.dropbox.com/sh/563nu3daeid0agb/AAB6NSddVUKgBCCbQtuqXPsZa?dl=0
Reservoir = read.csv("Reservoir1.csv",TRUE,",")
#impute missing data from data set
Reservoir = imputeTS::na_interpolation(Reservoir)
#Create Time Series
Reservoir = ts(Reservoir[,2],frequency = (365.25),start = c(2013,116))
plots = list()
for (i in seq (10)) {
fit = auto.arima(Reservoir, xreg = fourier(Reservoir, K = i), seasonal = FALSE)
plots[[i]] = autoplot(forecast(fit, xreg = fourier(Reservoir, K = i, h=10))) +
xlab(paste("K=",i,"AICC=",round(fit[["aicc"]],2))) + ylab("")
}
gridExtra::grid.arrange(plots[[1]],plots[[2]],plots[[3]],plots[[4]],plots[[5]],
plots[[6]],plots[[7]],plots[[8]],plots[[9]],plots[[10]],
nrow=5)
bestfit = auto.arima(Reservoir, xreg=fourier(Reservoir, K=9), seasonal=FALSE)
summary(bestfit)
checkresiduals(bestfit)
plot(Reservoir,col="red")
lines(fitted(bestfit),col="blue")
The model fits well, except for the incorrect first prediction. I'm lost as to why only this value would be so far off. Or, is this an acceptable error?

The residuals are the one-step forecast errors using all previous observations. At time 1, the residual is the forecast error with no previous observations, so it is simply based on the fitted model. In fact, it is an artificially "good" forecast because the differencing means there is no way for the model to know the location of the data until there is an observation. But the way ARIMA models are implemented in R makes the first prediction use a little more information than it should.

Related

How to decide the frequency while using the forecast function in R?

I have a series of daily data from 01-01-2014 to 31-01-2022. I want to predict the next 30 days. I am using auto.arima and it has some exogenous variables attached.
Here's the code: -
datax$NMD1<-(datax$NMD1/1000000000)
#Here to make an Arima series out of NMD 1. Exogenous variables here.
ts1<- ts(datax, frequency = 1)
class(ts1)
colnames(ts1)
autoplot(ts1[,"NMD1"])
#defining the set of exogenous variables
xset<- as.matrix(ts1[,"1Y TD INTEREST RATE"], ts1[,"BSE"], ts1[,"Repo Rate"], ts1[,"MIBOR Rate"], ts1[,"1Y OIS Rate" ], ts1[,"3M CD rate(PSU)"], ts1[,"2 Y GSec Rate"])
#Fitting the model
model1 <- auto.arima(ts1[,'NMD1'], xreg=xset, approximation = FALSE, allowmean = FALSE, allowdrift = FALSE)
summary(model1)
checkresiduals(model1)
fcast <- forecast(model1,xreg=xset, h=1)
print(summary(fcast))
autoplot(fcast)
My problems: -
While my model seems to work fine, I am not able to understand what value of h shall i put while forecasting. I also don't understand what frequency really is while we define a time series.
Please help.

How to check accuracy() in VAR model and How to determine right seasonality( is there a function)

How to check accuracy() in VAR model and How to determine the right seasonality( is there a function)
I am trying to create a VAR model. I have monthly data
Var_model <- VAR(cb, p = 1, type = "both", season = 12, exog = NULL)
I put season=12 by default since my data is monthly. How to determine seasonality?
mstl decomposition graphics
peca
waln
almo
pean
Here is the main problem. How to run accuracy() in var model?
forecast <- predict(Var_model, n.ahead = 24, ci = 0.95)
accuracy(forecast$fcst[[1]][,"fcst"], almo)
Here I think I am following procedure. accuracy(forecast, data) but still getting an error
Error in testaccuracy(object, x, test, d, D) : Not enough forecasts.
Check that forecasts and test data match
There is varresult inside of your model Var_model, so you see different accuracy metrics and define how good your model on training set like
accuracy(your_model_name$varresult$col_name)
Also you should check decomposition waives for seasonal period and lags correlation and cross-correlation and if you using VAR because all variable will impacting each other on forecast level.

LSTM time series forecasting, predictions stabilize

My code is in R using the Keras and Tensorflow libraries. I'm creating an LSTM model to forecast 100 future values. My input shape is (100,200,1).
Let's say my input data is X. I make a prediction at time step t=201 and get the column Y of predictions. Then I create Xnew = c(X[2:200],Y) a new variable where I concatenate X (except for the first column) and Y. I use this Xnew to predict the next time step.
What's happening is that, after a certain number of predicted future time steps (around 15), the predictions become constant for each time step afterwards. Does anyone know why this happens?
prdvec = function(dat,modname, numpreds, cnt, scl){
model = load_model_hdf5(modname)
inpt = dat
pred = list()
for(i in 1:numpreds){
pred[[i]] <- predict(model, reshape_X_3d((inpt[,1:ncol(inpt)]-cnt)/scl), batch_size = 1)
inpt = cbind(inpt[,2:ncol(inpt)],(pred[[i]]*scl+cnt))
print(i)
flush.console()
}
pred
}
I encounter a similar problem. Maybe when the LSTM units take into input created by itself, it tends to stabilize.

How to predict future values of time series using h2o.predict

I am going through the book "Hands-on Time series analysis with R" and I am stuck at the example using machine learning h2o package. I don't get how to use h2o.predict function. In the example it requires newdata argument, which is test data in this case. But how do you predict future values of time series if you in fact don't know these values ?
If I just ignore newdata argument I get : predictions with a missing newdata argument is not implemented yet.
library(h2o)
h2o.init(max_mem_size = "16G")
train_h <- as.h2o(train_df)
test_h <- as.h2o(test_df)
forecast_h <- as.h2o(forecast_df)
x <- c("month", "lag12", "trend", "trend_sqr")
y <- "y"
rf_md <- h2o.randomForest(training_frame = train_h,
nfolds = 5,
x = x,
y = y,
ntrees = 500,
stopping_rounds = 10,
stopping_metric = "RMSE",
score_each_iteration = TRUE,
stopping_tolerance = 0.0001,
seed = 1234)
h2o.varimp_plot(rf_md)
rf_md#model$model_summary
library(plotly)
tree_score <- rf_md#model$scoring_history$training_rmse
plot_ly(x = seq_along(tree_score), y = tree_score,
type = "scatter", mode = "line") %>%
layout(title = "Random Forest Model - Trained Score History",
yaxis = list(title = "RMSE"),
xaxis = list(title = "Num. of Trees"))
test_h$pred_rf <- h2o.predict(rf_md, test_h)
test_1 <- as.data.frame(test_h)
mape_rf <- mean(abs(test_1$y - test_1$pred_rf) / test_1$y)
mape_rf
H2O-3 does not support traditional time series algorithms (e.g., ARIMA). Instead, the recommendation is to treat the time series use case as a supervised learning problem and perform time-series specific pre-processing.
For example, if your goal was to predict the sales for a store tomorrow, you could treat this as a regression problem where your target would be the Sales. If you try to train a supervised learning model on the raw data, however, chances are your performance would be pretty poor. So the trick is to add historical attributes like lags as a pre-processing step.
If we trained a model on an unaltered dataset, the Mean Absolute Error is around 35%.
If we start adding historical features like the sales from the previous day for that store, we can reduce the Mean Absolute Error to about 15%.
While H2O-3 does not support lagging, you can leverage Sparkling Water to perform this pre-processing. You can use Spark to generate the lags per group and then use H2O-3 to train the regression model. Here is an example of this process: https://github.com/h2oai/h2o-tutorials/tree/master/best-practices/forecasting
The training data, train_df has to have all the columns listed in both x (c("month", "lag12", "trend", "trend_sqr")) and y ("y"), whereas the data you give to h2o.predict() just has to have the columns in x; the y-column is what will be returned as the prediction.
As you have features (in x) that are things like lag, trend, etc. the fact that it is a time series does not matter. (But you do have to be very careful when preparing those features to make sure you do not use any information in them that was not known at that point in time - but I would imagine the book has already been emphasizing that.)
Normally with a time series, for a given row in the training data, your x data is the data known at time t, and the value in the y column is the value of interest at time t+1. So when doing a prediction, you give the x values as the values at the moment, and the prediction returned is what will happen next.

arima model for multiple seasonalities in R

I'm learning to create a forecasting model for time series that has multiple seasonalities. Following is the subset of dataset that I'm refering to. This dataset includes hourly data points and I wish to include daily as well as weekly seasonalities in my arima model. Following is the subset of dataset:
data= c(4,4,1,2,6,21,105,257,291,172,72,10,35,42,77,72,133,192,122,59,29,25,24,5,7,3,3,0,7,15,91,230,284,147,67,53,54,55,63,73,114,154,137,57,27,31,25,11,4,4,4,2,7,18,68,218,251,131,71,43,55,62,63,80,120,144,107,42,27,11,10,16,8,10,7,1,4,3,12,17,58,59,68,76,91,95,89,115,107,107,41,40,25,18,14,15,6,12,2,4,1,6,9,14,43,67,67,94,100,129,126,122,132,118,68,26,19,12,9,5,4,2,5,1,3,16,89,233,304,174,53,55,53,52,59,92,117,214,139,73,37,28,15,11,8,1,2,5,4,22,103,258,317,163,58,29,37,46,54,62,95,197,152,58,32,30,17,9,8,1,3,1,3,16,109,245,302,156,53,34,47,46,54,65,102,155,116,51,30,24,17,10,7,4,8,0,11,0,2,225,282,141,4,87,44,60,52,74,135,157,113,57,44,26,29,17,8,7,4,4,2,10,57,125,182,100,33,27,41,39,35,50,69,92,66,30,11,10,11,9,6,5,10,4,1,7,9,17,24,21,29,28,48,38,30,21,26,25,35,10,9,4,4,4,3,5,4,4,4,3,5,10,16,28,47,63,40,49,28,22,18,27,18,10,5,8,7,3,2,2,4,1,4,19,59,167,235,130,57,45,46,42,40,49,64,96,54,27,17,18,15,7,6,2,3,1,2,21,88,187,253,130,77,47,49,48,53,77,109,147,109,45,41,35,16,13)
The code I'm trying to use is following:
tsdata = ts (data, frequency = 24)
aicvalstemp = NULL
aicvals= NULL
for (i in 1:5) {
for (j in 1:5) {
xreg1 = fourier(tsdata,i,24)
xreg2 = fourier(tsdata,j,168)
xregs = cbind(xreg1,xreg2)
armodel = auto.arima(bike_TS_west, xreg = xregs)
aicvalstemp = cbind(i,j,armodel$aic)
aicvals = rbind(aicvals,aicvalstemp)
}
}
The cbind command in the above command fails because the number of rows in xreg1 and xreg2 are different. I even tried using 1:length(data) argument in the fourier function but that also gave me an error. If someone can rectify the mistakes in the above code to produce a forecast of next 24 hours using an arima model with minimum AIC values, it would be really helpful. Also if you can include datasplitting in your code by creating training and testing data sets, it would be totally awesome. Thanks for your help.
I don't understand the desire to fit a weekly "season" to these data as there is no evidence for one in the data subset you provided. Also, you should really log-transform the data because they do not reflect a Gaussian process as is.
So, here's how you could fit models with a some form of hourly signals.
## the data are not normal, so log transform to meet assumption of Gaussian errors
ln_dat <- log(tsdata)
## number of hours to forecast
hrs_out <- 24
## max number of Fourier terms
max_F <- 5
## empty list for model fits
mod_res <- vector("list", max_F)
## fit models with increasing Fourier terms
for (i in 1:max_F) {
xreg <- fourier(ln_dat,i)
mod_res[[i]] <- auto.arima(tsdata, xreg = xreg)
}
## table of AIC results
aic_tbl <- data.frame(F=seq(max_F), AIC=sapply(mod_res, AIC))
## number of Fourier terms in best model
F_best <- which(aic_tbl$AIC==min(aic_tbl$AIC))
## forecast from best model
fore <- forecast(mod_res[[F_best]], xreg=fourierf(ln_dat,F_best,hrs_out))

Resources