I used an R code with an auto.arima function on a time series data set to forecast. From here, Id like to know how to find the p,d,q values for the arima. Is there a quick way to determine that, thank you.
The forecast::auto.arima() function was written to pick the optimal p, d, and q with respect to some optimization criterion (e.g. AIC). If you want to see which model was picked, use the summary() function.
For example:
fit <- auto.arima(lynx)
summary(fit)
Series: lynx
ARIMA(2,0,2) with non-zero mean
Coefficients:
ar1 ar2 ma1 ma2 mean
1.3421 -0.6738 -0.2027 -0.2564 1544.4039
s.e. 0.0984 0.0801 0.1261 0.1097 131.9242
sigma^2 estimated as 761965: log likelihood=-932.08
AIC=1876.17 AICc=1876.95 BIC=1892.58
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.608903 853.5488 610.1112 -63.90926 140.7693 0.7343143 -0.01267127
Where you can see the particular specification in the second row of the output. In this example, auto.arima picks an ARIMA(2,0,2).
Note that I did this naively here for demonstration purposes. I didn't check whether this is an accurate representation of the dependency structure in the lynx data set.
Other than summary(), you could also use arimaorder(fit) to get the vector c(p,d,q) or as.character(fit) to get "ARIMA(p,d,q)".
I have a return time series in daily frequency, which is stationary(proofed by ADF test) , has no autocorrelation up to lag10(proofed by lbq test with lag10) and has ARCH effect(proofed by LM test). My initial though is just directly applying GARCH model. Rather than the usual procedure: first using ARMA(p,q) to get the residuals, and then fit GARCH to this ARMA residuals.
However, out of curiosity, I still use ARMA(p,q) model loop through (p,q) lags with range [0,1,..,10] to see whether ARMA(0,0) has the smallest AIC among all. After looping though those 121 (p,q) combinations, I find the smallest AIC does NOT belong to ARMA(0,0) model, but ARMA(2,7). Then I check the coef of this ARMA(2,7) model and find the may lags included are significant. The two AR lags are both significant at 1% level.
Now, I am quite confusing. Based on the result of lbq(10) test, I should use ARMA(0,0). Based on the results of smallest AIC of ARMA models, I should use ARMA(2,7). May I please ask, in this case, should I use ARMA(0,0) or ARMA(2,7)? My preference is to use ARMA(2,7), but how can I explain to others when they ask: why still use ARMA model when lbq test shows no autocorrelation?
Any of your kind thoughts is greatly appreciated!
Please see the code and results below
lbqtest(returns,'Lags',1:10)
I could also use the following code to only get autocorrelation up to lag10
lbqtest(returns,'Lags',10)
The p results of lbq(1) till lbq(10) are:
p =
0.3425 0.5612 0.4180 0.5356 0.6637 0.7696 0.7770 0.8448 0.8995 0.9198
The AIC results of ARMA(2,7) and ARMA(0,0) are
AIC AR MA
-1498.252431 2 7
-1494.028 0 0
The estimation result of ARMA(2,7) using R is
arima(x = returns, order = c(2, 0, 7))
Coefficients:
ar1 ar2 ma1 ma2 ma3 ma4 ma5 ma6 ma7 intercept
-1.6786 -0.8756 1.6808 0.8128 -0.1691 -0.1736 -0.1065 0.0419 0.0411 -0.0006
s.e. 0.0381 0.0308 0.0660 0.1044 0.1078 0.1097 0.1082 0.1017 0.0642 0.0015
I am trying to use the R-package forecast to fit arima models (with the function Arima) and automatically select an appropriate model (with the function auto.arima). I first estimated two possible models with the function Arima:
tt.1 <- Arima(x, order=c(1,0,1), seasonal=list(order=c(0,1,1)),
include.drift=F)
tt.2 <- Arima(x, order=c(1,0,1), seasonal=list(order=c(0,1,0)),
include.drift=F)
Then, I used the function auto.arima to automatically select an appropriate model for the same data. I fixed d=0 and D=1 just as in the two models above. Furthermore, I set the maximum to 1 for all other parameters, did not use approximation of the selection criterion and did not use stepwise selection (note that the settings I use here are only for demonstration of the strange behavior, not what I really intend to use). I used BIC as criterion for selection the model. Here is the function call:
tt.auto <- auto.arima(x, ic="bic", approximation=F, seasonal=T, stepwise=F,
max.p=1, max.q=1, max.P=1, max.Q=1, d=0, D=1, start.p=1,
start.q=1, start.P=1, start.Q=1, trace=T,
allowdrift=F)
Now, I would have expected that auto.arima selects the model with the lower BIC from the two models above or a model not estimated above by Arima. Furthermore, I would have expected that the output generated by auto.arima when trace=T is exactly the same as the BIC calculated by Arima for the two models above. This is indeed true for the second model but not for the first one. For the first model, the BIC calculated by Arima is 10405.81 but the screen output of auto.arima for the model (1,0,1)(0,1,1) is Inf. Consequently, the second model is selected by auto.arima although the first model has a lower BIC when comparing the two models estimated by Arima. Does anyone have an idea why the BIC calculated by Arima does not correspond to the BIC calculated by auto.arima in case of the first model?
Here is the screen output of auto.arima:
ARIMA(0,0,0)(0,1,0)[96] : 11744.63
ARIMA(0,0,0)(0,1,1)[96] : Inf
ARIMA(0,0,0)(1,1,0)[96] : Inf
ARIMA(0,0,0)(1,1,1)[96] : Inf
ARIMA(0,0,1)(0,1,0)[96] : 11404.67
ARIMA(0,0,1)(0,1,1)[96] : Inf
ARIMA(0,0,1)(1,1,0)[96] : Inf
ARIMA(0,0,1)(1,1,1)[96] : Inf
ARIMA(1,0,0)(0,1,0)[96] : 11120.72
ARIMA(1,0,0)(0,1,1)[96] : Inf
ARIMA(1,0,0)(1,1,0)[96] : Inf
ARIMA(1,0,0)(1,1,1)[96] : Inf
ARIMA(1,0,1)(0,1,0)[96] : 10984.75
ARIMA(1,0,1)(0,1,1)[96] : Inf
ARIMA(1,0,1)(1,1,0)[96] : Inf
ARIMA(1,0,1)(1,1,1)[96] : Inf
And here are summaries of the models calculated by Arima:
> summary(tt.1)
Series: x
ARIMA(1,0,1)(0,1,1)[96]
Coefficients:
ar1 ma1 sma1
0.9273 -0.5620 -1.0000
s.e. 0.0146 0.0309 0.0349
sigma^2 estimated as 867.7: log likelihood=-5188.98
AIC=10385.96 AICc=10386 BIC=10405.81
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.205128 28.16286 11.14871 -7.171098 18.42883 0.3612059 -0.03466711
> summary(tt.2)
Series: x
ARIMA(1,0,1)(0,1,0)[96]
Coefficients:
ar1 ma1
0.9148 -0.4967
s.e. 0.0155 0.0320
sigma^2 estimated as 1892: log likelihood=-5481.93
AIC=10969.86 AICc=10969.89 BIC=10984.75
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.1942746 41.61086 15.38138 -8.836059 24.55919 0.49834 -0.02253845
Note: I am not allowed to make the data available. But I would be happy to provide more output or run modified calls of the functions if necessary.
EDIT: I now looked at the source code of auto.arima and found out that the behavior is caused by a check on the roots which sets the information criterion used for selecting a model to Inf if the model fails the check. The paper cited in the help for auto.arima confirms that (Hyndman, R.J. and Khandakar, Y. (2008) "Automatic time series forecasting: The forecast package for R", Journal of Statistical Software, 26(3), page 11). Sorry for the question, I should have read the paper before asking a question here!
auto.arima tries to find the best model subject to some constraints, avoiding models with parameters that are close to the non-stationarity and non-invertibility boundaries.
Your tt.1 model has a seasonal MA(1) parameter of -1 which lies on the non-invertibility boundary. So you don't want to use that model as it will lead to numerical instabilities. The seasonal difference operator is confounded with the seasonal MA operator.
Internally, auto.arima gives an AIC/AICc/BIC value of Inf to any model that doesn't satisfy the constraints to avoid it being selected.
This is out of my curiosity trying to compare time series input to an ARMA model and reconstructed series after an ARMA estimate is obtained. These are the steps I am thinking:
construct simulation time series
arma.sim <- arima.sim(model=list(ar=c(0.9),ma=c(0.2)),n = 100)
estimate the model from arma.sim, assuming we know it is a (1,0,1) model
arma.est1 <- arima(arma.sim, order=c(1,0,1))
also say we get arma.est1 in this form, which is close to the original (0.9,0,0.2):
Coefficients:
ar1 ma1 intercept
0.9115 0.0104 -0.4486
s.e. 0.0456 0.1270 1.1396
sigma^2 estimated as 1.15: log likelihood = -149.79, aic = 307.57
If I try to reconstruct another time series from arma.est1, how do I incorporate intercept or s.e. in arima.sim? Something like this doesn't seem to work well because arma.sim and arma.rec are far off:
arma.rec <- arima.sim(n=100, list(ar=c(0.9115),ma=c(0.0104)))
Normally we use predict() to check the estimate. But is this a legit way to look at the estimate?
I can not insert a random slope in this model with lme4(1.1-7):
> difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit)
Error: number of observations (=274) <= number of random effects (=278) for term
(Tempo | id); the random-effects parameters and the residual variance (or scale
parameter) are probably unidentifiable
With nlme it is working:
> JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit)
> summary(JSprova)
Linear mixed-effects model fit by REML Data: dat
AIC BIC logLik
769.6847 791.3196 -378.8424
Random effects:
Formula: ~1 + Tempo | id
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.1981593 (Intr)
Tempo 0.5409468 -0.692
Residual 0.5597984
Fixed effects: JS ~ Tempo
Value Std.Error DF t-value p-value
(Intercept) 4.116867 0.14789184 138 27.837013 0.0000
Tempo -0.207240 0.08227474 134 -2.518874 0.0129
Correlation:
(Intr)
Tempo -0.837
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.79269550 -0.39879115 0.09688881 0.41525770 2.32111142
Number of Observations: 274
Number of Groups: 139
I think it is a problem of missing data as I have few cases where there is a missing data in time two of the DV but with na.action=na.omit should not the two package behave in the same way?
It is "working" with lme, but I'm 99% sure that your random slopes are indeed confounded with the residual variation. The problem is that you only have two measurements per subject (or only one measurement per subject in 4 cases -- but that's not important here), so that a random slope plus a random intercept for every individual gives one random effect for every observation.
If you try intervals() on your lme fit, it will give you an error saying that the variance-covariance matrix is unidentifiable.
You can force lmer to do it by disabling some of the identifiability checks (see below).
library("lme4")
library("nlme")
library("plyr")
Restrict the data to only two points per individual:
sleepstudy0 <- ddply(sleepstudy,"Subject",
function(x) x[1:2,])
m1 <- lme(Reaction~Days,random=~Days|Subject,data=sleepstudy0)
intervals(m1)
## Error ... cannot get confidence intervals on var-cov components
lmer(Reaction~Days+(Days|Subject),data=sleepstudy0)
## error
If you want you can force lmer to fit this model:
m2B <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy0,
control=lmerControl(check.nobs.vs.nRE="ignore"))
## warning messages
The estimated variances are different from those estimated by lme, but that's not surprising since some of the parameters are jointly unidentifiable.
If you're only interested in inference on the fixed effects, it might be OK to ignore these problems, but I wouldn't recommend it.
The sensible thing to do is to recognize that the variation among slopes is unidentifiable; there may be among-individual variation among slopes, but you just can't estimate it with this model. Don't try; fit a random-intercept model and let the implicit/default random error term take care of the variation among slopes.
There's a recent related question on CrossValidated; there I also refer to another example.