How to specify newxreg in prediction model of ARIMA? - r

I have fit the model below to my time series data. The xreg consists of a time vector that goes from 1 through 1000 and of 12 indicator variables (1 or 0) that represent the month. The data that I'm dealing with has some strong weekly and monthly seasonal patterns.
fit <- arima(x, order = c(3, 0, 0),
seasonal = list(order = c(1, 0, 1), period = 7),
xreg = cbind(t, M1, M2, M3, M4, M5,
M6, M7, M8, M9, M10, M11, M12), include.mean = FALSE,
transform.pars = TRUE,
fixed = NULL, init = NULL,
method = c("CSS-ML", "ML", "CSS"),
optim.method = "BFGS",
optim.control = list(), kappa = 1e6)
At this time I'm trying to figure out how I can predict 14 values for the month of January (M1=1).
So when I use the predict function in R, I think I need to specify in the newxreg portion that I want M1=1 and M2,...,M12=0 for my prediction - correct?
I've played around with the code, but I couldn't get it to work and I was not able to find very detailed information about the newxreg portion of the predict formula online.
Can anyone explain to me how I can get predictions for one partigular month, say January?
And how do I need to note that in the newxreg part of the predict function?
Many thanks in advance!

I have finally found a way out and wanted to post it - in case it helps someone else.
So basically, newxreg should be a matrix that contains values of the regressors that you want predictions for.
So in my case, my regressors were all 1 or 0 (coded variables) to specify a particular month.
So what I did is I created a matrix of 0's and 1's to be used as my newxreg.
What I did is I defined a matrix mx, and then in the predict function I set newxreg=mx. I made sure that the number of rows of mx>= number of rows of n.ahead.
pred <- predict(fit,n.ahead=n, newxreg=mx)
Hope this is helpful for others as well!

Related

Implementation of time series cross-validation

I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.

acf() function at lag.max = 0

This might be a very simple question, but what exactly is calculated when acf lag.max = 0?
When lag.max = 1, I am assuming it is only calculating the autocovariance (when type = "covariance") given the previous observation, such that given an observation at time t, it is checking covariance with observation at t-1, for all observations. So what is the number generated when lag.max = 0? I notice it is very close to the actual variance of the data, but not precisely the same.
The acf function using type = "covariance", compute the covariance for your data at lag 0 up to lag = lag.max. If lag.max is 0, the output of 'acf(your_data,lag.max = 0, type = 'covariance')' will be actually the same to compute the covariance of your data with cov: 'cov(your_data,your_data)'. The numerical difference is because acf round down the significants number by default. Also to know in essence "acf" using type = "covariance" compute the function "cov" moving the start point of your data in the second argument like this:
n <- length(your_data)
cov(your_data[1:(n-nlag)],your_data[(1+nlag):n]) # for lag nlag

Calculating Cook's Distance in R manually...running into issues with the for loop

I have been trying to calculate Cook's distance manually for a multiple linear regression dataset, but running into problems with the for loop. What I have been doing is this:
This is the original linear model, and the associated fitted values, length = 'n'.
{fitted = lm10$fitted.values}
This is the new, n X n, blank matrix, I created to hold the new fitted values.
{lev.mat <- matrix(rep(0, nrow(X.des)^2), nrow = nrow(X.des))}
I wanted to save time, so I filled in the first column of the matrix manually.
{newData = as.data.frame(X.des[-1,])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(fitted[1],newFitted)
lev.mat[,1] = newDist}
I then tried to fill in the rest of the columns of the lev.mat similarly, using the for loop.
for(i in 2:nrow(lev.mat)){
newData = as.data.frame(X.des[-i, ])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(newFitted[1:(i-1)],fitted[i],newFitted[i:length(newFitted)])
lev.mat[,i] = newDist
}
But I keep getting this error repeatedly:
{Error in lev.mat[, i] <- newDist :
number of items to replace is not a multiple of replacement length}
I have been at this for three hours now, and it's getting frustrating. Can anybody point out the error and help me move along? My net steps are to calculate the difference between the original fitted values and each column of values in the new fitted values matrix, sum the differences, and divide by the product of the number of predictors and the MSE.
Thanks!
Thanks a lot to #Harlan Nelson for providing me with a wonderful link! I used the background provided in the link here to complete my work. Here is the rest of my code:
Hmat = hatvalues(lm10)
Leverage = Hmat/(1 - Hmat)
mse = (lm10$residuals)^2/var(lm10$residuals)
CooksD <- (1/6)*(mse)*Leverage
lm10 was the name of my linear model, and I had 6 predictors in the model. This helped me calculate Cook's Distance for the model. Thanks again!

What does really "n.roll" in "dccforecast" does?

I am having difficulties understanding how n.roll in dccforecast really does. I will give the code and graphs produced. Please help me understand and fix the issue.
First, let me introduce my data. I have two time series: 4171 daily observations (16 years). I am interested in using the first 6 years data for estimation of coefficients of dcc and forecast the correlation for
i) the remaining 10 years (2609 daily observations)
[t=0]----estimation period (1562)-----[t=1562]---forecasting period (2609)---[t=4171]
ii) the following 5 years (1305 daily observations).
[t=0]----estimation period (1562)-----[t=1562]---forecasting period (1305)---[t=2864]
At each point of t in my forecasting period, I want to use all the data available to time t and get a forecast for t+1.
For a forecast at time t+1, use all the data from 0 to t
For a forecast at time t+2, use all the data from 0 to t+1
For a forecast at time t+3, use all the data from 0 to t+2
For a forecast at time t+4, use all the data from 0 to t+3
and so on.
Here is my code for case i:
xspec = ugarchspec(mean.model = list(armaOrder = c(1, 0)), variance.model = list(garchOrder = c(1,1), model = 'gjrGARCH'), distribution.model = 'norm')
uspec = multispec(replicate(2, xspec)) # 2 is the number of variables
dccspec = dccspec(uspec = uspec, dccOrder = c(1, 1), model='aDCC', distribution = 'mvnorm')
dcc.fit.focast = dccfit(dccspec, data = tst, out.sample = 2609, forecast.length = 2609, fit.control = list(eval.se=T))
dcc.focast=dccforecast(dcc.fit.focast, n.ahead = 1, n.roll = 0)
plot(dcc.focast, which = 3, series=c(1,2))
I am defining "n.roll = 0" because as it is explained in "rmgarch" manual (When n.roll = 0, all forecasts are based on an unconditional n-ahead forecast routine based on the approximation method described in ENGLE and SHEPPARD (2001) paper), I am interested in using the approximation method used in Engle and Sheppard (2001).
However, when I change the n.roll from 0 to 1305;
dcc.focast=dccforecast(dcc.fit.focast, n.ahead = 1, n.roll = 1305)
plot(dcc.focast, which = 3, series=c(1,2))
then I get the following graph:
And when I increase n.roll, the ending point of forecast also increases. "rmgarch" manual describes n.roll for unconditional mean. But somehow it has to do with the number of forecasts produced.
Can someone please explain this and give recommendations on how to get the forecasts I need please?
Thanks in advance,
Martin
I ended up doing a loop to do the forecast by 1 step. and it worked.

Random Forest - Caret - Time Series

I have a time series (apple stock prices -closing prices- turn into a data frame to fit a random forest using caret. I lagged on 1 day, 2 days and 6 days. I want to predict the next 2 days. Two step ahead forecast. But caretuses the predictfunction that does not allow the argument has the forecastfunction. And i have seen that some people try to put the argument n.ahead but is not working for me. Any advice? See the code
df<-data.frame(APPL)
df$f1<-lag(df$APPL,1)
df$f2=lag(df$APPL,2)
df$f3=lag(df$APPL,6)
# change column names
colnames(df)<-c("price", "price_1", "price_2", "price_6")
# remove rows (days) with NA.
df<-df[complete.cases(df),]
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 1,
classProbs = FALSE,
verboseIter = TRUE,
preProcOptions=list(thresh = 0.95, na.remove = TRUE, verbose = TRUE))
set.seed(1234)
rf_grid= expand.grid(mtry = c(1:3))
fit <- train(price~.,
data=df,
method="rf",
preProcess=c("center","scale"),
tuneGrid = rf_grid,
trControl=fitControl,
ntree = 200,
metric="RMSE")
nextday <- predict(fit,`WHAT GOES HERE?`)
If i put just predict(fit)uses as newdatathe whole dataset. Which i think is wrong. The other thing i was thinking about is to do a loop. Predict for 1 step ahead, because i have the data of 1,2 and 6 days ago. And the fill for the 2 step ahead forecast the 1 day ago "cell" with the forecast i did before.
Right now, you can't pass other options to the underlying predict method. There is a proposed change that might enable this though.
In your case, you should give the predict function a data frame that has the appropriate predictors for the next few observations.
#1:: colnames(df)<-c("price","price_1","price_2","price_6") ;; "after price6
#2:: Predict{stats} is a generic function for predictions from the results of various model fitting functions
::predict(model object , dataframe)
we have 3 cases here for dataframe ::
case 1 :: train data::on which model is fitted :: Insample prediction
case 2 :: test data::Out of sample prediction
case 3 :: forecasted data :: forecasted values of the independent variables : we get the forecasted values of the dependent variable according to the model
The column names in case 2 & 3 should be same as column names of the train data

Resources