which function should I use to estimate a specific ARIMA model in R? - r

I have a mydata.ts which is around 200 rows. I used stationary tests, took differences and examined ACF and PACF. So I decided to try ARIMA(1,1,1)(0,1,1) for instance.
Which R function should I use to find fitted values and forecasts? Arima, arima or auto.arima?
And can I trust the MAPE, MAD and other error results on summary(model)? Because I read an answer and it was saying the results are not the real but approximated or something.

auto.arima will find the whole model specification that is the 'best' based on AIC, BIC.
IF you know the order (1,1,1) or (0,1,1) then use Arima from forecast package(same as arima, but little more general)
Arima(your_data, order=c(1,1,1)) will give the basic answer.
Seee the documentation for forecast.
then actual out-of-sample forecast can be done with the forecast function.

Related

When predicting using R ARIMA object, how to declare the time series' history?

Suppose I fit AR(p) model using R arima function from stats package. I fit it using a sample x_1,...,x_n. In theory, when predicting x_{n+1} using this model, it needs an access x_n,...x_{n-p}.
How does the model know which observation I want to predict? What if I wanted to actually predict x_n based on x_{n-1},...,x_{n-p-1} and how my code would differ in this case? Can I make in-sample forecasts, similar to Python's functionality?
If my questions imply that I think about forecasting in a wrong way, please kindly correct my understanding of the subject.

Interpreting ACF and PACF plots for SARIMA model

I'm new to time series and used the monthly ozone concentration data from Rob Hyndman's website to do some forecasting.
After doing a log transformation and differencing by lags 1 and 12 to get rid of the trend and seasonality respectively, I plotted the ACF and PACF shown [in this image][2]. Am I on the right track and how would I interpret this as a SARIMA?
There seems to be a pattern every 11 lags in the PACF plot, which makes me think I should do more differencing (at 11 lags), but doing so gives me a worse plot.
I'd really appreciate any of your help!
EDIT:
I got rid of the differencing at lag 1 and just used lag 12 instead, and this is what I got for the ACF and PACF.
From there, I deduced that: SARIMA(1,0,1)x(1,1,1) (AIC: 520.098)
or SARIMA(1,0,1)x(2,1,1) (AIC: 521.250)
would be a good fit, but auto.arima gave me (3,1,1)x(2,0,0) (AIC: 560.7) normally and (1,1,1)x(2,0,0) (AIC: 558.09) without stepwise and approximation.
I am confused on which model to use, but based on the lowest AIC, SAR(1,0,1)x(1,1,1) would be the best? Also, the thing that concerns me is that none of the models pass the Ljung-Box test. Is there any way I can fix this?
It is quite difficult to manually select a model order that will perform well at forecasting a dataset. This is why Rob has built the 'auto.arima' function in his R forecast package, to figure out the model that may perform best based on certain metrics.
When you see a pacf plot with significantly negative lags that usually means you have over differenced your data. Try removing the 1st order difference and keeping the 12 order difference. Then carry on making your best guess.
I'd recommend trying his auto.arima function and passing it a time series object with frequency = 12. He has a good writeup of seasonal arima models here:
https://www.otexts.org/fpp/8/9
If you would like more insight into manually selecting a SARIMA model order, this is a good read:
https://onlinecourses.science.psu.edu/stat510/node/67
In response to your Edit:
I think it would be beneficial to this post if you clarify your objective. Which of the following are you trying to achieve?
Find a model where residuals satisfy Ljung Box Test
Produce the most accurate out of sample forecast
Manually select lag orders such that ACF and PACF plots show no significant lags remaining.
In my opinion, #2 is the most sought after objective so I'll assume that is your goal. From my experience, #3 produces poor results out of sample. In regards to #1, I am usually not concerned about correlations remaining in the residuals. We know we do not have the true model for this time-series, so I do not feel there's any reason to expect an approximate model that performs well out of sample to not have left something behind in the residuals that is more complex perhaps, or nonlinear etc.
To provide you another SARIMA result, I ran this data through some code I've developed and found the following equation produced the minimal error on a cross-validation period.
Final model is:
SARIMA [0,1,1] [1,1,1]12 with a constant using the log normal of the time-series.
The errors in the cross validation period are:
MAPE = 16%
MAE = 0.46
RSQR = 74%
Here is the Partial Autocorrelation plot of the residuals for your information.
This is roughly similar in methodology to selecting an equation based on AICc to my understanding, but is ultimately a different approach. Regardless, if your objective is out of sample accuracy, I'd recommend evaluating equations in terms of their out of sample accuracy versus in-sample fit, tests, or plots.

Simulating a basic sarima model in R

I'm looking for something like
arima.sim()
but for sarima models. I've looked at simulate.Arima() in the forecast package, but it seems to require an input dataset parsed by Arima(), which I don't want to do. I also looked at the gsarima library, but it seems to be only able to simulate seasonal AR models. Is there any way to simulate a sarima model if you only want to provide the following information:
The values of all pure ARMA and seasonal ARMA terms for different lags.
The number of differences for both the seasonal and non-seasonal integrated terms.
The length of a season.
The number of terms I want to simulate.

randomForest in R: Is there a possibility of calculating casewise confidence intervals?

R package randomForest reports mean squared errors for each tree in the forest. I need, however, a measure of confidence for each case in the data. Since randomForest calculates the casewise predictions by averaging the predictions of the single trees, I guess that it should also be possible to calculate a casewise standard error and thus a confidence interval. Can this be done using the output randomForest object (if so: how?) or do I have to dig into the source code?
No need to dig into the source code. You only need to read the documentation. ?predict.randomForest states that one of its arguments is called predict.all:
predict.all Should the predictions of all trees be kept?
So setting that to TRUE will keep a prediction for each case, for each tree, which you can then use to calculate standard error for each case.
I have recently been made aware of this paper by Stefan Wager, Trevor Hastie and Brad Efron which investigates more rigorously the idea of standard errors for the predictions generated by random forests (and other bagged predictors).

In R, how to add an external variable to an ARIMA model?

Does anyone here know how I can specify additional external variables to an ARIMA model ?
In my case I am trying to make a volatility model and I would like to add the squared returns to model an ARCH.
The reason I am not using GARCH models, is that I am only interested in the volatility forecasts and the GARCH models present their errors on their returns which is not the subject of my study.
I would like to add an external variable and see the R^2 and p-values to see if the coefficient is statistically significant.
I know that this is a very old question but for people like me who were wondering this you need to use cbind with xreg.
For Example:
Arima(X,order=c(3,1,3),xreg = cbind(ts1,ts2,ts3))
Each external time series should be the same length as the original.

Resources