ARIMA modelling, prediction and plotting with CO2 dataset in R - r

I am working with arima0() and co2. I would like to plot arima0() model over my data. I have tried fitted() and curve() with no success.
Here is my code:
###### Time Series
# format: time series
data(co2)
# format: matrix
dmn <- list(month.abb, unique(floor(time(co2))))
co2.m <- matrix(co2, 12, dimnames = dmn)
co2.dt <- pracma::detrend(co2.m, tt = 'linear')
co2.dt <- ts(as.numeric(co2.dt), start = c(1959,1), frequency=12)
# first diff
co2.dt.dif <- diff(co2.dt,lag = 12)
# Second diff
co2.dt.dif2 <- diff(co2.dt.dif,lag = 1)
With the data prepared, I ran the following arima0:
results <- arima0(co2.dt.dif2, order = c(2,0,0), method = "ML")
resultspredict <- predict(results, n.ahead = 36)
I would like to plot the model and the prediction. I am hoping there is a way to do this in base R. I would also like to be able to plot the predictions as well.

Session 1: To begin with...
To be honest, I am pretty much worried about your way in modelling co2 time series. Something wrong happened already when you de-trended co2. Why use tt = "linear"? You fit a linear trend within each period (i.e., year), and take the residuals for further inspection. This is often not recommended as it tends to introduce artificial effects to the residual series. I would incline to do tt = "constant", i.e., simply dropping off yearly average. This would at least preserve the with-season correlation as in the original data.
Perhaps you want to see some evidence here. Consider using ACF to help you diagnose.
data(co2)
## de-trend by dropping yearly average (no need to use `pracma::detrend`)
yearlymean <- ave(co2, gl(39, 12), FUN = mean)
co2dt <- co2 - yearlymean
## de-trend by dropping within season linear trend
co2.m <- matrix(co2, 12)
co2.dt <- pracma::detrend(co2.m, tt = "linear")
co2.dt <- ts(as.numeric(co2.dt), start = c(1959, 1), frequency = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt); acf(co2dt)
ts.plot(co2.dt); acf(co2.dt)
Both de-trended series have strong seasonal effect, thus a further seasonal differencing is required.
## seasonal differencing
co2dt.dif <- diff(co2dt, lag = 12)
co2.dt.dif <- diff(co2.dt, lag = 12)
## compare time series and ACF
par(mfrow = c(2, 2))
ts.plot(co2dt.dif); acf(co2dt.dif)
ts.plot(co2.dt.dif); acf(co2.dt.dif)
The ACF for co2.dt.dif has more significant negative correlations. This is the sign of over-de-trending. So we prefer to co2dt. co2dt is already stationary, and no more differencing is needed (otherwise you just over-difference it and introduce more negative autocorrelation).
The big negative spike at lag 1 for ACF of co2dt.dif suggests that we want seasonal MA. Also, the positive spike with the season implies a mild AR process in general. So consider:
## we exclude mean because we found estimation of mean is 0 if we include it
fit <- arima0(co2dt.dif, order = c(1,0,0), seasonal = c(0,0,1), include.mean = FALSE)
Whether this model is doing good, we need to inspect ACF of residuals:
acf(fit$residuals)
Looks like this model is decent (actually pretty great).
For prediction purpose, it is actually a better idea to integrate seasonal differencing of co2dt with model fitting of co2dt.dif. Let's do
fit <- arima0(co2dt, order = c(1,0,0), seasonal = c(0,1,1), include.mean = FALSE)
This will give exactly as same estimate for AR and MA coefficients as above two-stage work, but now prediction is fairly easy to be dealt with a single predict call.
## 3 years' ahead prediction (no prediction error; only mean)
predco2dt <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot co2dt, fitted model and prediction together:
fittedco2dt <- co2dt - fit$residuals
ts.plot(co2dt, fittedco2dt, predco2dt, col = 1:3)
The result looks very promising!
Now the final stage, is to actually map this back to the original co2 series. For fitted values, we just add back the yearly mean we have dropped off:
fittedco2 <- fittedco2dt + yearlymean
But for prediction it is more difficult, because we don't know what yearly mean in the future would be. In this regard, our modelling though looks good, is not practically useful. I will talk about a better idea in another answer. To finish this session, we plot co2 with its fitted values only:
ts.plot(co2, fittedco2, col = 1:2)

Session 2: A better idea for time series modelling
In previous session, we have seen the difficulty in prediction if we separate de-trending and modelling of de-trended series. Now, we try to combine those two stages in one go.
The seasonal pattern of co2 is really strong, so we need a seasonal differencing anyway:
data(co2)
co2dt <- diff(co2, lag = 12)
par(mfrow = c(1,2)); ts.plot(co2dt); acf(co2dt)
After this seasonal differencing, co2dt does not look stationary. So we need further a non-seasonal differencing.
co2dt.dif <- diff(co2dt)
par(mfrow = c(1,2)); ts.plot(co2dt.dif); acf(co2dt.dif)
The negative spikes within season and between season suggest that a MA process is needed for both. I will not work with co2dt.dif; we can work with co2 directly:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
acf(fit$residuals)
Now the residuals are perfectly uncorrelated! So we have an ARIMA(0,1,1)(0,1,1)[12] model for co2 series.
As usual, fitted values are obtained by subtracting residuals from data:
co2fitted <- co2 - fit$residuals
Predictions are made by a single call to predict:
co2pred <- predict(fit, n.ahead = 36, se.fit = FALSE)
Let's plot them together:
ts.plot(co2, co2fitted, co2pred, col = 1:3)
Oh, this is just gorgeous!

Session 3: Model selection
The story should have finished by now; but I would like to make a comparison with auto.arima from forecast, that can automatically decide on the "best" model.
library(forecast)
autofit <- auto.arima(co2)
#Series: co2
#ARIMA(1,1,1)(1,1,2)[12]
#
#Coefficients:
# ar1 ma1 sar1 sma1 sma2
# 0.2569 -0.5847 -0.5489 -0.2620 -0.5123
#s.e. 0.1406 0.1204 0.5880 0.5701 0.4819
#
#sigma^2 estimated as 0.08576: log likelihood=-84.39
#AIC=180.78 AICc=180.97 BIC=205.5
auto.arima has chosen ARIMA(1,1,1)(1,1,2)[12], which is much more complicated as it involves both seasonal differencing and non-seasonal differencing.
Our model based on step-by-step investigation suggests an ARIMA(0,1,1)(0,1,1)[12]:
fit <- arima0(co2, order = c(0,1,1), seasonal = c(0,1,1))
#Call:
#arima0(x = co2, order = c(0, 1, 1), seasonal = c(0, 1, 1))
#
#Coefficients:
# ma1 sma1
# -0.3495 -0.8515
#s.e. 0.0497 0.0254
#
#sigma^2 estimated as 0.08262: log likelihood = -85.98, aic = 177.96
AIC values suggest our model better. So does BIC:
BIC = -2 * loglik + log(n) * p
We have n <- length(co2) data, and p <- length(fit$coef) + 1 parameters (the additional one for sigma2), thus our model has BIC
-2 * fit$loglik + log(n) * p
# [1] 196.5503
So, auto.arima has over-fitted data.
In fact, as soon as we see ARIMA(1,1,1)(1,1,2)[12], we have strong suspicion for its over-fitting. Because different effects "cancel off" each other. This happens to the additional seasonal MA and non-seasonal AR introduced by auto.arima, as AR introduces positive autocorrelation while MA introduces negative one.

Related

Lagged Residual as Independent Variable in R

I am building a factor model to estimate future equity returns. I'd like to include an autoregressive residual term in this model. I'd like to have yesterday's error (the difference between yesterday's predicted return and actual return) to be included in the regression as an independent variable. What type of autoregressive model is this called? I've searched through various time series econometrics texts and have not found this particular model described. My current solution in R is to rerun the regression at every discrete time step (t), and manually include yesterday's residual, but I am curious if there is a more efficient method or package that does this.
Below is some sample code without the residual term included:
Data:
# fake data
set.seed(333)
df <- data.frame(seq(as.Date("2017/1/1"), as.Date("2017/2/19"), "days"),
matrix(runif(50*506), nrow = 50, ncol = 506))
names(df) <- c("Date", paste0("var", 1:503), c("mktrf", "smb", "hml"))
Then I store my necessary variables for regression:
1.All the dep var
x = df[,505:507]
2.All the indep var
y <- df[,2:504]
4.Fit all the models
list_models_AR= lapply(y, function(y)
with(x, lm(y ~ mktrf + smb + hml , na.action = na.exclude)))
It’s a ARIMA(0, 0, 1), with regressors model

Building dynamic linear model in R with dlm package, MLE and Bayesian inference for parameter estimation

Here's the background. I have a time series representing daily sales. And I had built some models, like, ARIMA (arima), STL Decomposition (stl), Holt-winter (hw), and exponential smoothing state space model (result from the ets reduced to a hw model, because the returned model showed additive error/trend/seasonality), etc.
Anyway, the data is non-stationary, representing trend and weekly seasonality, which can also be proved with spectral analysis/periodogram. By using cross validation of 1-step to 15-step ahead forecast, I found stl decomposition gave me the best result in MAE.
Then, I started working on a dynamic linear model to see if I can build a better one for forecasting. The model is also written in R, with dlm, which is local linear + seasonal + ARMA model, and the code as below
build <- function(parm) {
level0 <- 20
slope0 <- 1
# Level + Trend
trend <- dlmModPoly(order = 2, dV = parm[1], dW = exp(parm[2:3]),
m0 = c(level0, slope0),
C0 = 400*diag(2))
# Seasonal Term
# Season Factor model
# season <- dlmModSeas(frequency = 7, dW = c(parm[4], rep(0, 5)))
# Fourier Form Seasonal Model
season <- dlmModTrig(s = 7, q = 2, dW = rep(c(parm[4], parm[5]), each = 2))
# ARMA Term
arma <- dlmModARMA(ar = ARtransPars(parm[6:7]), ma = parm[8:9], sigma2 = parm[10])
return(trend + season + arma)
}
# MLE for parameter estimation
init <- c(1e-07, -3, -1, 5, 4, 0.5, 0.4, 0.7, 0.3, 1)
fit_dlm <- dlmMLE(y, parm = init, build, hessian = TRUE)
dlmSales <- build(fit_dlm$par)
f1 <- dlmForecast(dlmSales, n = 16)
My first question is, in the math modeling, there should be only one observation error, indicating that, there also should be only one dV in the R model. So, I only used the dV argument in dlmModPoly, is this correct?
I know that MLE is not considered the best way for estimating unknown parameters. As I read the book Dynamic Linear Model with R, it also gives methods like Bayesian Inferences with discount factor (with/without time-variant dV) or Simulation-based Bayesian inference.
Here are my questions
How do I use dlmFilterDF if my model consist of three parts, which is mod <- dlmModPoly + dlmModTrig + dlmModARMA? Can I do this modFilt <- dlmFilterDF(y, mod, DF = 0.9) directly? I think that Bayesian Inference assumes that Ft and Gt are known, but the AR and MA in my case are unknown parameters, so this method might not work?
When applying dlmGibbsDIG, can I assign vectors to a.theta and b.theta? because the unknown parameters psi in my case is of length 6 (1 for dV, 2 for local linear dW, 2 for seasonality dW, 1 for ARMA dW). Also, the same concern as the question above, AR and MA are unknown parameters in my case, can they be estimated with Gibbs Sampling? Probably Hybrid Sampling is a better choice, as suggested in section 4.6.1.

ARIMA loop in R

I'm pretty new to R and I've run into a problem with finding the optimal ARIMA model. So far I've modeled the trend and a seasonal component, and now I want to model the cyclical component with an ARIMA model. I want the output in the end to include coefficients for the time variable, the seasonal variables and also the ARIMA variables. I've tried to use a loop to find the optimal ARIMA model and the coefficients, but I just get this message:
"Error in optim(init[mask], armaCSS, method = optim.method, hessian = FALSE, :
non-finite value supplied by optim"
I've tried looking for other answers in here, but I just can't seem to figure out what I'm doing wrong.
I've included the entire code in case it is necessary, but the error appears after running the loop in the end.
I appreciate any help I can get, thank you!
#clear workspace
rm(list=ls())
#load data
setwd("~/Desktop/CBS/HA almen year 3 /Forecasting /R koder ")
data <- scan("onlineretail.txt")
data <- data[2:69] #cut off first period + two last periods for whole years
T=length(data)
s=4
years=T/s
styear=2000
st=c(styear,1)
data = ts(data,start=st, frequency = s)
plot(data)
summary(data)
#plot shows increasing variance - log transform data
lndata <- log(data)
plot(lndata)
dataTSE = decompose(lndata, type="additive")
plot(dataTSE)
########### Trend ##########
t=(1:T)
t2=t^2
lny <- lndata
lmtrend.model <- lm(lny~t)
summary(lmtrend.model)
#linear trend T_t = 8,97 + 0,039533*TIME - both coefficeients significant
#Project 2, explanation why linear is better than quadratic
qtrend.model <- lm(lny~t+t2)
summary(qtrend.model)
lntrend = fitted(lmtrend.model)
lntrend = ts(lntrend, start=st, frequency = s)
#lntrend2 = fitted(qtrend.model)
#lntrend2 = ts(lntrend2, start=st, frequency = s)
residuals=lny-lntrend
par(mar=c(5,5,5,5))
plot(lny, ylim=c(5,12), main="Log e-commerce retail sales")
lines(lntrend, col="blue")
#lines(lntrend2, col="red")
par(new=T)
plot(residuals,ylim=c(-0.2,0.8),ylab="", axes=F)
axis(4, pretty(c(-0.2,0.4)))
abline(h=0, col="grey")
mtext("Residuals", side=4, line=2.5, at=0)
############# Season #################
#The ACF of the residuals confirms the neglected seasonality, because there
#is a clear pattern for every k+4 lags:
acf(residuals)
#Remove trend to observe seasonal factors without the trend:
detrended = residuals
plot(detrended, ylab="ln sales", main="Seasonality in ecommerce retail sales")
abline(h=0, col="grey")
#We can check out the average magnitude of seasonal factors
seasonal.matrix=matrix(detrended, ncol=s, byrow=years)
SeasonalFactor = apply(seasonal.matrix, 2, mean)
SeasonalFactor=ts(SeasonalFactor, frequency = s)
SeasonalFactor
plot(SeasonalFactor);abline(h=0, col="grey")
#We add seasonal dummies to our model of trend and omit the last quarter
library("forecast")
M <- seasonaldummy(lny)
ST.model <- lm(lny ~ t+M)
summary(ST.model)
#ST.model <- tslm(lny~t+season)
#summary(ST.model)
#Both the trend and seasonal dummies appears highly significant
#We will use a Durbin-Watson test to detect serial correlation
library("lmtest")
dwtest(ST.model)
#The DW value is 0.076396. This is quite small, as the value should be around
2
#and we should therefore try to improve the model with a cyclical component
#I will construct a plot that shows how the model fits the data and
#how the residuals look
lntrend=fitted(ST.model)
lntrend = ts(lntrend, start=st, frequency = s)
residuals=lny-lntrend
par(mar=c(5,5,5,5))
plot(lny, ylim=c(5,12), main="Log e-commerce retail sales")
lines(lntrend, col="blue")
#tell R to draw over the current plot with a new one
par(new=T)
plot(residuals,ylim=c(-0.2,0.8),ylab="", axes=F)
axis(4, pretty(c(-0.2,0.4)))
abline(h=0, col="grey")
mtext("Residuals", side=4, line=2.5, at=0)
############## Test for unit root ############
#We will check if the data is stationary, and to do so we will
#test for unit root.
#To do so, we will perform a Dickey-Fuller test. First, we have to remove
seasonal component.
#We can also perform an informal test with ACF and PACF
#the autocorrelation function shows that the data damps slowly
#while the PACF is close to 1 at lag 1 and then lags become insignificant
#this is informal evidence of unit root
acf(residuals)
pacf(residuals)
#Detrended and deseasonalized data
deseason = residuals
plot(deseason)
#level changes a lot over time, not stationary in mean
#Dickey-Fuller test
require(urca)
test <- ur.df(deseason, type = c("trend"), lags=3, selectlags = "AIC")
summary(test)
#We do not reject that there is a unit root if
# |test statistics| < |critical value|
# 1,97 < 4,04
#We can see from the output that the absolute value of the test statistics
#is smaller than the critical value. Therefore, there is no evidence against
the unit root.
#We check the ACF and PACF in first differences. There should be no
significant lags
#if the data is white noise in first differences.
acf(diff(deseason))
pacf(diff(deseason))
deseasondiff = diff(deseason, differences = 2)
plot(deseasondiff)
test2 <- ur.df(deseasondiff, type=c("trend"), lags = 3, selectlags = "AIC")
summary(test2)
#From the plot and the Dickey-Fuller test, it looks like we need to difference
twice
############# ARIMA model ############
S1 = rep(c(1,0,0,0), T/s)
S2 = rep(c(0,1,0,0), T/s)
S3 = rep(c(0,0,1,0), T/s)
TrSeas = model.matrix(~ t+S1+S2+S3)
#Double loop for finding the best fitting ARIMA model and since there was
#a drift, we include this in the model
best.order <- c(0, 2, 0)
best.aic <- Inf
for (q in 1:6) for (p in 1:6) {
fit.aic <- AIC(arima(lny,order = c(p,2, q),include.mean = TRUE,xreg=TrSeas))
print(c(p,q,fit.aic))
if (fit.aic < best.aic) {
best.order <- c(p, 0, q)
best.arma <- arima(lny,order = c(p, 2, q),include.mean = TRUE,xreg=TrSeas)
best.aic <- fit.aic
}
}
best.order
Please use the forecast package from Prof. Hyndman.
The call to:
auto.arima(data)
will return you the most optimal ARIMA model for your time series. You will find https://www.otexts.org/fpp/8/7 a great reference as well.

SARIMAX model in R

I would fit a SARIMAX model with temperature as exogenous variable in R. Can I do that with xreg function present in the package TSA?
I thought to fit the model as:
fit1 = arima(x, order=c(p,d,q), seasonal=list(order=c(P,D,Q), period=S), xreg=temp)
is that correct or I have to use other function of R?
if it itsn't correct: which steps should I use?
Thanks.
Check out the forecast package, it's great:
# some random data
x <- ts(rnorm(120,0,3) + 1:120 + 20*sin(2*pi*(1:120)/12), frequency=12)
temp = rnorm(length(x), 20, 30)
require(forecast)
# build the model (check ?auto.arima)
model = auto.arima(x, xreg = data.frame(temp = temp))
# some random predictors
temp.reg = data.frame(temp = rnorm(10, 20, 30))
# forecasting
forec = forecast(model, xreg = temp.reg)
# quick way to visualize things
plot(forec)
# model diagnosis
tsdiag(model)
# model info
summary(forec)
I won't suggest you to use auto.arima(). Depending on the model you want to fit it may return poor results, as for example when working with some complex SARIMA models the difference between the models done manually and with auto.arima() were noticeable, auto.arima() do not even returned white noise innovations (as it is expected), while manual fits, of course, did.

How to simulate an AR(1) process with arima.sim and an estimated model?

I want to do the following two steps:
Based on a given time series, I want to calibrate an AR(1) process, i.e. I want to estimate the parameters.
Based on the estimated parameters, I want to simulate an AR(1) processes.
Here was my approach:
set.seed(123)
#Just generate random AR(1) time series; based on this, I want to estimate the parameters
ts_AR <- arima.sim(n=10000, list(ar=c(0.5)))
#1. Estimate parameters with arima()
model_AR <- arima(ts_AR, order=c(1,0,0))
#Looks actually good
model_AR
Series: ts_AR
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.4891 -0.0044
s.e. 0.0087 0.0195
sigma^2 estimated as 0.9974: log likelihood=-14176.35
AIC=28358.69 AICc=28358.69 BIC=28380.32
#2. Simulate based on model
arima.sim(model=model_AR, n = 100)
Error in arima.sim(model = model_AR, n = 100) :
'ar' part of model is not stationary
I'm not the biggest time-series expert, but I'm pretty sure that an AR(1) process with a persistence parameter of below one should result in a stationary model. However, the error message tells me somethings
different. So do I do something stupid here? If so, why and what should I do to simulate the AR(1) process based on my estimated parameters. Or can't you just pass the output of arima as the model input into arima.sim? Then, however, I don't understand how I get such an error message...I would expect something like "model input cannot be read. It should be something like ..."
It's not the clearest interface in the world, but the model argument is meant to be a list giving the ARMA order, not an actual arima model.
arima.sim(model=as.list(coef(model_AR)), n=100)
This will create a simulated series with AR coefficient .489 as estimated from your starting data. Note that the intercept is ignored.
I don't think you are using the right approach since there's uncertainty about your coefficient estimate.
The best way to achieve what you want in a proper way is to incorporate uncertainty in the generation process, there are probably parametric way to do that but I think bootstrap can be handy here.
Lets generate the AR process first
set.seed(123)
ts_AR <- arima.sim(n = 10000, list(ar = 0.5))
We'll define two helper functions that will used in the boostrap. The first one generate the statistics we need (here the coef of the AR process and the actual time series) and the second function implement our resampling scheme (it'll be based on residuals)
ar_fun <- function(ts) c(ar = coef(arima(ts, order = c(1, 0, 0),
include.mean = FALSE)), ts = ts)
ar_sim <- function(res, n.sim, ran.args) {
rg <- function(n, res) sample(res, n, replace = TRUE)
ts <- ran.args$ts
model <- ran.args$model
arima.sim(model = model, n = n.sim,
rand.gen = rg, res = c(res))
}
Now we can start our simulation
ar_fit <- arima(ts_AR, order = c(1, 0, 0), include.mean = FALSE)
ts_res <- residuals(ar_fit)
ts_res <- ts_res - mean(ts_res)
ar_model <- list(ar = coef(ar_fit))
require(boot)
set.seed(1)
ar_boot <- tsboot(ts_res, ar_fun,
R = 99, sim = "model",
n.sim = 100, orig.t = FALSE,
ran.gen = ar_sim,
ran.args = list(ts = ts_AR, model = ar_model))
If you want to get all the coefficient generated and the associated time series
coefmat <- apply(ar_boot$t, 1, "[", 1)
seriesmat <- apply(ar_boot$t, 1, "[", -1)
You can get more details in help file of tsboot and in Bootstrap Methods and Their Application, chap 8.

Resources