Related
Here's the background. I have a time series representing daily sales. And I had built some models, like, ARIMA (arima), STL Decomposition (stl), Holt-winter (hw), and exponential smoothing state space model (result from the ets reduced to a hw model, because the returned model showed additive error/trend/seasonality), etc.
Anyway, the data is non-stationary, representing trend and weekly seasonality, which can also be proved with spectral analysis/periodogram. By using cross validation of 1-step to 15-step ahead forecast, I found stl decomposition gave me the best result in MAE.
Then, I started working on a dynamic linear model to see if I can build a better one for forecasting. The model is also written in R, with dlm, which is local linear + seasonal + ARMA model, and the code as below
build <- function(parm) {
level0 <- 20
slope0 <- 1
# Level + Trend
trend <- dlmModPoly(order = 2, dV = parm[1], dW = exp(parm[2:3]),
m0 = c(level0, slope0),
C0 = 400*diag(2))
# Seasonal Term
# Season Factor model
# season <- dlmModSeas(frequency = 7, dW = c(parm[4], rep(0, 5)))
# Fourier Form Seasonal Model
season <- dlmModTrig(s = 7, q = 2, dW = rep(c(parm[4], parm[5]), each = 2))
# ARMA Term
arma <- dlmModARMA(ar = ARtransPars(parm[6:7]), ma = parm[8:9], sigma2 = parm[10])
return(trend + season + arma)
}
# MLE for parameter estimation
init <- c(1e-07, -3, -1, 5, 4, 0.5, 0.4, 0.7, 0.3, 1)
fit_dlm <- dlmMLE(y, parm = init, build, hessian = TRUE)
dlmSales <- build(fit_dlm$par)
f1 <- dlmForecast(dlmSales, n = 16)
My first question is, in the math modeling, there should be only one observation error, indicating that, there also should be only one dV in the R model. So, I only used the dV argument in dlmModPoly, is this correct?
I know that MLE is not considered the best way for estimating unknown parameters. As I read the book Dynamic Linear Model with R, it also gives methods like Bayesian Inferences with discount factor (with/without time-variant dV) or Simulation-based Bayesian inference.
Here are my questions
How do I use dlmFilterDF if my model consist of three parts, which is mod <- dlmModPoly + dlmModTrig + dlmModARMA? Can I do this modFilt <- dlmFilterDF(y, mod, DF = 0.9) directly? I think that Bayesian Inference assumes that Ft and Gt are known, but the AR and MA in my case are unknown parameters, so this method might not work?
When applying dlmGibbsDIG, can I assign vectors to a.theta and b.theta? because the unknown parameters psi in my case is of length 6 (1 for dV, 2 for local linear dW, 2 for seasonality dW, 1 for ARMA dW). Also, the same concern as the question above, AR and MA are unknown parameters in my case, can they be estimated with Gibbs Sampling? Probably Hybrid Sampling is a better choice, as suggested in section 4.6.1.
I am working with arima0() and co2. I would like to plot arima0() model over my data. I have tried fitted() and curve() with no success.
Here is my code:
###### Time Series
# format: time series
data(co2)
# format: matrix
dmn <- list(month.abb, unique(floor(time(co2))))
co2.m <- matrix(co2, 12, dimnames = dmn)
co2.dt <- pracma::detrend(co2.m, tt = 'linear')
co2.dt <- ts(as.numeric(co2.dt), start = c(1959,1), frequency=12)
# first diff
co2.dt.dif <- diff(co2.dt,lag = 12)
# Second diff
co2.dt.dif2 <- diff(co2.dt.dif,lag = 1)
With the data prepared, I ran the following arima0:
results <- arima0(co2.dt.dif2, order = c(2,0,0), method = "ML")
resultspredict <- predict(results, n.ahead = 36)
I would like to plot the model and the prediction. I am hoping there is a way to do this in base R. I would also like to be able to plot the predictions as well.
Session 1: To begin with...
To be honest, I am pretty much worried about your way in modelling co2 time series. Something wrong happened already when you de-trended co2. Why use tt = "linear"? You fit a linear trend within each period (i.e., year), and take the residuals for further inspection. This is often not recommended as it tends to introduce artificial effects to the residual series. I would incline to do tt = "constant", i.e., simply dropping off yearly average. This would at least preserve the with-season correlation as in the original data.
Perhaps you want to see some evidence here. Consider using ACF to help you diagnose.
data(co2)
## de-trend by dropping yearly average (no need to use `pracma::detrend`)
yearlymean <- ave(co2, gl(39, 12), FUN = mean)
co2dt <- co2 - yearlymean
## de-trend by dropping within season linear trend
co2.m <- matrix(co2, 12)
co2.dt <- pracma::detrend(co2.m, tt = "linear")
co2.dt <- ts(as.numeric(co2.dt), start = c(1959, 1), frequency = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt); acf(co2dt)
ts.plot(co2.dt); acf(co2.dt)
Both de-trended series have strong seasonal effect, thus a further seasonal differencing is required.
## seasonal differencing
co2dt.dif <- diff(co2dt, lag = 12)
co2.dt.dif <- diff(co2.dt, lag = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt.dif); acf(co2dt.dif)
ts.plot(co2.dt.dif); acf(co2.dt.dif)
The ACF for co2.dt.dif has more significant negative correlations. This is the sign of over-de-trending. So we prefer to co2dt. co2dt is already stationary, and no more differencing is needed (otherwise you just over-difference it and introduce more negative autocorrelation).
The big negative spike at lag 1 for ACF of co2dt.dif suggests that we want seasonal MA. Also, the positive spike with the season implies a mild AR process in general. So consider:
## we exclude mean because we found estimation of mean is 0 if we include it
fit <- arima0(co2dt.dif, order = c(1,0,0), seasonal = c(0,0,1), include.mean = FALSE)
Whether this model is doing good, we need to inspect ACF of residuals:
acf(fit$residuals)
Looks like this model is decent (actually pretty great).
For prediction purpose, it is actually a better idea to integrate seasonal differencing of co2dt with model fitting of co2dt.dif. Let's do
fit <- arima0(co2dt, order = c(1,0,0), seasonal = c(0,1,1), include.mean = FALSE)
This will give exactly as same estimate for AR and MA coefficients as above two-stage work, but now prediction is fairly easy to be dealt with a single predict call.
## 3 years' ahead prediction (no prediction error; only mean)
predco2dt <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot co2dt, fitted model and prediction together:
fittedco2dt <- co2dt - fit$residuals
ts.plot(co2dt, fittedco2dt, predco2dt, col = 1:3)
The result looks very promising!
Now the final stage, is to actually map this back to the original co2 series. For fitted values, we just add back the yearly mean we have dropped off:
fittedco2 <- fittedco2dt + yearlymean
But for prediction it is more difficult, because we don't know what yearly mean in the future would be. In this regard, our modelling though looks good, is not practically useful. I will talk about a better idea in another answer. To finish this session, we plot co2 with its fitted values only:
ts.plot(co2, fittedco2, col = 1:2)
Session 2: A better idea for time series modelling
In previous session, we have seen the difficulty in prediction if we separate de-trending and modelling of de-trended series. Now, we try to combine those two stages in one go.
The seasonal pattern of co2 is really strong, so we need a seasonal differencing anyway:
data(co2)
co2dt <- diff(co2, lag = 12)
par(mfrow = c(1,2)); ts.plot(co2dt); acf(co2dt)
After this seasonal differencing, co2dt does not look stationary. So we need further a non-seasonal differencing.
co2dt.dif <- diff(co2dt)
par(mfrow = c(1,2)); ts.plot(co2dt.dif); acf(co2dt.dif)
The negative spikes within season and between season suggest that a MA process is needed for both. I will not work with co2dt.dif; we can work with co2 directly:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
acf(fit$residuals)
Now the residuals are perfectly uncorrelated! So we have an ARIMA(0,1,1)(0,1,1)[12] model for co2 series.
As usual, fitted values are obtained by subtracting residuals from data:
co2fitted <- co2 - fit$residuals
Predictions are made by a single call to predict:
co2pred <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot them together:
ts.plot(co2, co2fitted, co2pred, col = 1:3)
Oh, this is just gorgeous!
Session 3: Model selection
The story should have finished by now; but I would like to make a comparison with auto.arima from forecast, that can automatically decide on the "best" model.
library(forecast)
autofit <- auto.arima(co2)
#Series: co2
#ARIMA(1,1,1)(1,1,2)[12]
#
#Coefficients:
# ar1 ma1 sar1 sma1 sma2
# 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
#s.e. 0.1406 0.1204 0.5880 0.5701 0.4819
#
#sigma^2 estimated as 0.08576: log likelihood=-84.39
#AIC=180.78 AICc=180.97 BIC=205.5
auto.arima has chosen ARIMA(1,1,1)(1,1,2)[12], which is much more complicated as it involves both seasonal differencing and non-seasonal differencing.
Our model based on step-by-step investigation suggests an ARIMA(0,1,1)(0,1,1)[12]:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
#Call:
#arima0(x = co2, order = c(0, 1, 1), seasonal = c(0, 1, 1))
#
#Coefficients:
# ma1 sma1
# -0.3495 -0.8515
#s.e. 0.0497 0.0254
#
#sigma^2 estimated as 0.08262: log likelihood = -85.98, aic = 177.96
AIC values suggest our model better. So does BIC:
BIC = -2 * loglik + log(n) * p
We have n <- length(co2) data, and p <- length(fit$coef) + 1 parameters (the additional one for sigma2), thus our model has BIC
-2 * fit$loglik + log(n) * p
# [1] 196.5503
So, auto.arima has over-fitted data.
In fact, as soon as we see ARIMA(1,1,1)(1,1,2)[12], we have strong suspicion for its over-fitting. Because different effects "cancel off" each other. This happens to the additional seasonal MA and non-seasonal AR introduced by auto.arima, as AR introduces positive autocorrelation while MA introduces negative one.
I have generated an ARIMA Model for data I have and need to simulate the model generated into the future by 10 years (approximately 3652 days as the data is daily). This was the best fit model for the data generated by auto.arima, my question is how to simulate it into the future?
mydata.arima505 <- arima(d.y, order=c(5,0,5))
The forecast package has the simulate.Arima() function which does what you want. But first, use the Arima() function rather than the arima() function to fit your model:
library(forecast)
mydata.arima505 <- arima(d.y, order=c(5,0,5))
future_y <- simulate(mydata.arima505, 100)
That will simulate 100 future observations conditional on the past observations using the fitted model.
If your question is to simulate an specific arima process you can use the function arima.sim(). But I am not sure if that really is what you want. Usually you would use your model for predictions.
library(forecast)
# True Data Generating Process
y <- arima.sim(model=list(ar=0.4, ma = 0.5, order =c(1,0,1)), n=100)
#Fit an Model arima model
fit <- auto.arima(y)
#Use the estimaes for a simulation
arima.sim(list(ar = fit$coef["ar1"], ma = fit$coef["ma1"]), n = 50)
#Use the model to make predictions
prediced values <- predict(fit, n.ahead = 50)
I'm trying to forecast time in time out ("TiTo") for someone ordering food at a restaurant using the code below. TiTo is the total time it takes someone from the time they walk through the door to the time they get their food. TimeTT is the time the customer spends talking to the waiter. I believe TimeTT is a predictor of TiTo and I would like to use it as a covariate in the forecast for TiTo. I've read some about ARIMA, and as I understand it you add the predictors to the model in the xreg parameter. I'm thinking of the xreg parameter as something like the independent variable for a regression model, like lm(TiTo ~ TimeTT). Is this the correct way to think of the xreg parameter? Also what does the error message below mean? Do I need to convert TimeTT into a time series to use it in the xreg parameter? I'm new to forecasting so all help is very appreciated.
Forecast Attempt:
OV<-zoo(SampleData$TiTo, order.by=SampleData$DateTime)
eData <- ts(OV, frequency = 24)
Train <-eData[1:15000]
Test <- eData[15001:20809]
Arima.fit <- auto.arima(Train)
Acast<-forecast(Arima.fit, h=5808, xreg = SampleData$TimeTT)
Error:
Error in if (ncol(xreg) != ncol(object$call$xreg)) stop("Number of regressors does not match fitted model") :
argument is of length zero
Data:
dput(Train[1:5])
c(1152L, 1680L, 1680L, 968L, 1680L)
dput(SampleData[1,]$TimeTT)
structure(1156L, .Label = c("0.000000", "0.125000", "0.142857",
"96.750000", "97.800000", "99.000000", "99.600000", "NULL"), class = "factor")
You need to define the xreg when you estimate the model itself, and these need to be forecasted ahead as well. So this will look something like:
Arima.fit <- auto.arima(Train, xreg = SampleData$TimeTT)
forecast(Arima.fit, h = 508, xreg = NewData$TimeTT)
Here is an example using Arima and xreg from Rob Hyndman (here is the link to the example, but to read more about using contemporaneous covariates in ARIMA models go here), this is analogous to auto.arima.
n <- 2000
m <- 200
y <- ts(rnorm(n) + (1:n)%%100/30, f=m)
library(forecast)
fit <- Arima(y, order=c(2,0,1), xreg=fourier(y, K=4))
plot(forecast(fit, h=2*m, xreg=fourierf(y, K=4, h=2*m)))
Hope this helps.
I am using arima() and auto.arima() of R to get the prediction of sales. The data is at week level for three years.
my code looks like:
x<-c(1571,1501,895,1335,2306,930,2850,1380,975,1080,990,765,615,585,838,555,1449,615,705,465,165,630,330,825,555,720,615,360,765,1080,825,525,885,507,884,1230,342,615,1161,
1585,723,390,690,993,1025,1515,903,990,1510,1638,1461.67,1082,1075,2315,1014,2140,1572,794,1363,1184,1248,1344,1056,816,720,896,608,624,560,512,304,640,640,704,1072,768,
816,640,272,1168,736,1003,864,658.67,768,841,1727,944,848,432,704,850.67,1205,592,1104,976,629,814,1626,933.33,1100.33,1730,2742,1552,1038,826,1888,1440,1372,824,1824,1392,1424,768,464,
960,320,384,512,478,1488,384,338.67,176,624,464,528,592,288,544,418.67,336,752,400,1232,477.67,416,810.67,1256,1040,823,240,1422,704,718,1193,1541,1008,640,752,
1008,864,1507,4123,2176,899,1717,935)
length_data<-length(x)
length_train<-round(length_data*0.80)
forecast_period<-length_data-length_train
train_data<-x[1:length_train]
train_data<-ts(train_data,frequency=52,start=c(1,1))
validation_data<-x[(length_train+1):length_data]
validation_data<-ts(validation_data,frequency=52,start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_output<-auto.arima(train_data) # fit the ARIMA Model
arima_validate <- Arima(x=validation_data,model=arima_output)
Error:
Error in stats::arima(x = x, order = order, seasonal = seasonal, include.mean = include.mean, :
too few non-missing observations
What I am doing wrong?
What does it mean by "too few non-missing observations"? I have searched it now net, but did not get any better explanation.
Thanks for any kind of help!
arima_output is a seasonal ARIMA model:
> arima_output
Series: train_data
ARIMA(1,0,1)(0,1,0)[52]
Arima() then attempts to refit this particular model to validation_data. But to fit a seasonal model to a time series, you need at least one full year of observations, since seasonal ARIMA depends on seasonal differencing.
As an illustration, note that Arima() will happily and without errors refit a time series that is double as long as validation_data:
validation_data <- x[(length_train+1):length_data]
validation_data<-ts(rep(validation_data,2),frequency=52,
start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_validate <- Arima(x=validation_data,model=arima_output)
One way of dealing with this would be to force auto.arima() to use a nonseasonal model, by specifying D=0:
validation_data <- x[(length_train+1):length_data]
validation_data<-ts(validation_data,frequency=52,
start=c(ceiling((length_train)/52),((length_train)%%52+1)))
arima_output<-auto.arima(train_data, D=0) # fit the ARIMA Model
arima_validate <- Arima(x=validation_data,model=arima_output)
So this did turn out to be more of a CrossValidated question...
Your chosen model is ARIMA(1,0,1)(0,1,0)[52]. That is, it has a seasonal difference of lag 52. Your validation data has 32 observations. So you cannot take the seasonal differences on the validation data without knowing what the training data is.
One way around this is to fit the model to the full time series, and then extract what you want (presumably residuals from the validation portion).
You can also improve the readability of your code:
x <- ts(x, frequency=52, start=c(1,1))
length_data <- length(x)
length_train <- round(length_data*0.80)
train_data <- ts(head(x, length_train),
frequency=frequency(x), start=start(x))
validation_data <- ts(tail(x, length_data-length_train),
frequency=frequency(x), end=end(x))
library(forecast)
arima_train <- auto.arima(train_data)
arima_full <- Arima(x, model=arima_train)
res <- window(residuals(arima_full), start=start(validation_data))