How do you forecast ARIMA with multiple regressors? - r

The complete R data and code for my question is here: https://pastebin.com/QtG6A7ZX.
I am new to R and still a beginner when it comes to time series analysis, so please forgive my ignorance.
I am attempting to model and forecast some enrollment data with 2 dummy-coded regressors. I have already used auto.arima to fit the model:
model <- auto.arima(enroll, xreg=x)
Before I forecast with this model, I am attempting to test its accuracy by selecting only a part of the time series (1:102 instead of 1:112), and likewise, a partial list of regressors.
Based on auto.arima, I fit the partial model as follows:
model_par <-arima((enroll_partial), c(1, 1, 1),seasonal = list(order = c(1, 0, 0), period = 5), xreg=x_par)
I have tried three different ways to forecast and get essentially the same error:
fcast_par <- forecast(model_par, h=10) #error
fcast_par <- forecast(model_par, h=10, xreg=x_par) #error
fcast_par <- forecast(model_par, h=10, xreg=forecast(x_par,h=10)) #error
'xreg' and 'newxreg' have different numbers of columns
I have tested using auto.arima with the partial data. That works, but gives me a different model and, although I specified 10 predictions, I get over 50:
model_par2 <- auto.arima(enroll_partial, xreg=x_par)
fcast_par <- forecast(model_par2, h=12, xreg=x_par)
fcast_par
So, my main question is, how do I specify an exact model and predict using more than 1 regressor given my data (see Paste Bin link above)?

The forecast() function is from the forecast package, and works with model functions that are from that package. This is why it is possible to produce forecasts from auto.arima() using forecast(model_par2,xreg=x_fcst).
The arima() function comes from the stats package, and so there are no guarantees that it would work with forecast(). To specify your own ARIMA model, you can use the Arima() function, which behaves very similarly to arima(), but you will be able to produce forecasts from it using forecast(model_par2,xreg=x_fcst).

You have two problems. One of them is that the various forecasting functions in R are making it (intentionally?) difficult on you.
The first problem is that you need to define the values of your regressors for the forecasting period. Extract the relevant data from x by using window():
x_fcst <- window(x,start=c(2017,4))
The second problem is that forecast() (which dispatches to forecast.Arima()) is a red herring here. You need to use predict() (which dispatches to predict.Arima() - note the capitalization in both cases!):
predict(model_par,newxreg=x_fcst,h=nrow(x_fcst))
which yields
$pred
Time Series:
Start = c(2017, 3)
End = c(2019, 1)
Frequency = 5
[1] 52.00451 52.00451 52.00451 52.00451 52.00451 52.00451 52.00451 52.00451
[9] 52.00451
$se
Time Series:
Start = c(2017, 3)
End = c(2017, 3)
Frequency = 5
[1] 17.13345
You can also use auto.arima(). Confusingly enough, this time forecast() (which still dispatches to forecast.Arima()) does work:
model_par2 <- auto.arima(enroll_partial, xreg=x_par)
forecast(model_par2,xreg=x_fcst)
which yields
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2017.40 39.91035 17.612358 62.20834 5.808514 74.01219
2017.60 59.51003 32.783451 86.23661 18.635254 100.38481
2017.80 69.81000 39.290834 100.32917 23.134962 116.48505
2018.00 57.49140 23.601444 91.38136 5.661183 109.32162
2018.20 55.45759 18.503034 92.41214 -1.059524 111.97470
2018.40 34.57866 -7.306747 76.46406 -29.479541 98.63686
2018.60 52.30199 6.702068 97.90192 -17.437074 122.04106
2018.80 61.61591 12.582055 110.64977 -13.374900 136.60672
2019.00 50.47661 -1.765945 102.71917 -29.421485 130.37471
And yes, you do get five times as many predictions. The first column is an expectation forecast, and the others give prediction intervals. These are governed by the level parameter to forecast().

Related

ets: Error in ets(timeseries, model = "MAM") : Nonseasonal data

I'm trying to create a forecast using an exponential smoothing method, but get the error "nonseasonal data". This is clearly not true - see code below.
Why am I getting this error? Should I use a different function (it should be able to perform simple, double, damped trend, seasonal, Winters method)?
library(forecast)
timelen<-48 # use 48 months
dates<-seq(from=as.Date("2008/1/1"), by="month", length.out=timelen)
# create seasonal data
time<-seq(1,timelen)
season<-sin(2*pi*time/12)
constant<-40
noise<-rnorm(timelen,mean=0,sd=0.1)
trend<-time*0.01
values<-constant+season+trend+noise
# create time series object
timeseries<-as.ts(x=values,start=min(dates),end=max(dates),frequency=1)
plot(timeseries)
# forecast MAM
ets<-ets(timeseries,model="MAM") # ANN works, why MAM not?
ets.forecast<-forecast(ets,h=24,level=0.9)
plot(ets.forecast)
Thanks&kind regards
You should use ts simply to create a time series from a numeric vector. See the help file for more details.
Your start and end values aren't correctly specified.
And setting the frequency at 1 is not a valid seasonality, it's the same as no seasonality at all.
Try:
timeseries <- ts(data=values, frequency=12)
ets <- ets(timeseries, model="MAM")
print(ets)
#### ETS(M,A,M)
#### Call:
#### ets(y = timeseries, model = "MAM")
#### ...
The question in your comments, why ANN works is because the third N means no seasonnality, so the model can be computed even with a non-seasonal timeseries.

arima model for multiple seasonalities in R

I'm learning to create a forecasting model for time series that has multiple seasonalities. Following is the subset of dataset that I'm refering to. This dataset includes hourly data points and I wish to include daily as well as weekly seasonalities in my arima model. Following is the subset of dataset:
data= c(4,4,1,2,6,21,105,257,291,172,72,10,35,42,77,72,133,192,122,59,29,25,24,5,7,3,3,0,7,15,91,230,284,147,67,53,54,55,63,73,114,154,137,57,27,31,25,11,4,4,4,2,7,18,68,218,251,131,71,43,55,62,63,80,120,144,107,42,27,11,10,16,8,10,7,1,4,3,12,17,58,59,68,76,91,95,89,115,107,107,41,40,25,18,14,15,6,12,2,4,1,6,9,14,43,67,67,94,100,129,126,122,132,118,68,26,19,12,9,5,4,2,5,1,3,16,89,233,304,174,53,55,53,52,59,92,117,214,139,73,37,28,15,11,8,1,2,5,4,22,103,258,317,163,58,29,37,46,54,62,95,197,152,58,32,30,17,9,8,1,3,1,3,16,109,245,302,156,53,34,47,46,54,65,102,155,116,51,30,24,17,10,7,4,8,0,11,0,2,225,282,141,4,87,44,60,52,74,135,157,113,57,44,26,29,17,8,7,4,4,2,10,57,125,182,100,33,27,41,39,35,50,69,92,66,30,11,10,11,9,6,5,10,4,1,7,9,17,24,21,29,28,48,38,30,21,26,25,35,10,9,4,4,4,3,5,4,4,4,3,5,10,16,28,47,63,40,49,28,22,18,27,18,10,5,8,7,3,2,2,4,1,4,19,59,167,235,130,57,45,46,42,40,49,64,96,54,27,17,18,15,7,6,2,3,1,2,21,88,187,253,130,77,47,49,48,53,77,109,147,109,45,41,35,16,13)
The code I'm trying to use is following:
tsdata = ts (data, frequency = 24)
aicvalstemp = NULL
aicvals= NULL
for (i in 1:5) {
for (j in 1:5) {
xreg1 = fourier(tsdata,i,24)
xreg2 = fourier(tsdata,j,168)
xregs = cbind(xreg1,xreg2)
armodel = auto.arima(bike_TS_west, xreg = xregs)
aicvalstemp = cbind(i,j,armodel$aic)
aicvals = rbind(aicvals,aicvalstemp)
}
}
The cbind command in the above command fails because the number of rows in xreg1 and xreg2 are different. I even tried using 1:length(data) argument in the fourier function but that also gave me an error. If someone can rectify the mistakes in the above code to produce a forecast of next 24 hours using an arima model with minimum AIC values, it would be really helpful. Also if you can include datasplitting in your code by creating training and testing data sets, it would be totally awesome. Thanks for your help.
I don't understand the desire to fit a weekly "season" to these data as there is no evidence for one in the data subset you provided. Also, you should really log-transform the data because they do not reflect a Gaussian process as is.
So, here's how you could fit models with a some form of hourly signals.
## the data are not normal, so log transform to meet assumption of Gaussian errors
ln_dat <- log(tsdata)
## number of hours to forecast
hrs_out <- 24
## max number of Fourier terms
max_F <- 5
## empty list for model fits
mod_res <- vector("list", max_F)
## fit models with increasing Fourier terms
for (i in 1:max_F) {
xreg <- fourier(ln_dat,i)
mod_res[[i]] <- auto.arima(tsdata, xreg = xreg)
}
## table of AIC results
aic_tbl <- data.frame(F=seq(max_F), AIC=sapply(mod_res, AIC))
## number of Fourier terms in best model
F_best <- which(aic_tbl$AIC==min(aic_tbl$AIC))
## forecast from best model
fore <- forecast(mod_res[[F_best]], xreg=fourierf(ln_dat,F_best,hrs_out))

Weighted Portmanteau Test for Fitted GARCH process

I have fitted a GARCH process to a time series and analyzed the ACF for squared and absolute residuals to check the model goodness of fit. But I also want to do a formal test and after searching the internet, The Weighted Portmanteau Test (originally by Li and Mak) seems to be the one.
It's from the WeightedPortTest package and is one of the few (perhaps the only one?) that properly tests the GARCH residuals.
While going through the instructions in various documents I can't wrap my head around what the "h.t" argument wants. It says in the info in R that I need to assign "a numeric vector of the conditional variances". This may be simple to an experienced user, though I'm struggling to understand. What is it that I need to do and preferably how would I code it in R?
Thankful for any kind of help
Taken directly from the documentation:
h.t: a numeric vector of the conditional variances
A little toy example using the fGarch package follows:
library(fGarch)
library(WeightedPortTest)
spec <- garchSpec(model = list(alpha = 0.6, beta = 0))
simGarch11 <- garchSim(spec, n = 300)
fit <- garchFit(formula = ~ garch(1, 0), data = simGarch11)
Weighted.LM.test(fit#residuals, fit#h.t, lag = 10)
And using garch() from the tseries package:
library(tseries)
fit2 <- garch(as.numeric(simGarch11), order = c(0, 1))
summary(fit2)
# comparison of fitted values:
tail(fit2$fitted.values[,1]^2)
tail(fit#h.t)
# comparison of residuals after unstandardizing:
unstd <- fit2$residuals*fit2$fitted.values[,1]
tail(unstd)
tail(fit#residuals)
Weighted.LM.test(unstd, fit2$fitted.values[,1]^2, lag = 10)

Linear regression for multivariate time series in R

As part of my data analysis, I am using linear regression analysis to check whether I can predict tomorrow's value using today's data.
My data are about 100 time series of company returns. Here is my code so far:
returns <- read.zoo("returns.csv", header=TRUE, sep=",", format="%d-%m-%y")
returns_lag <- lag(returns)
lm_univariate <- lm(returns_lag$companyA ~ returns$companyA)
This works without problems, now I wish to run a linear regression for every of the 100 companies. Since setting up each linear regression model manually would take too much time, I would like to use some kind of loop (or apply function) to shorten the process.
My approach:
test <- lapply(returns_lag ~ returns, lm)
But this leads to the error "unexpected symbol in "test2" " since the tilde is not being recognized there.
So, basically I want to run a linear regression for every company separately.
The only question that looks similar to what I wanted is Linear regression of time series over multiple columns , however there the data seems to be stored in a matrix and the code example is quite messy compared to what I was looking for.
Formulas are great when you know the exact name of the variables you want to include in the regression. When you are looping over values, they aren't so great. Here's an example that uses indexing to extract the columns of interest for each iteration
#sample data
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14) - 1
returns <- zoo(cbind(companya=rnorm(10), companyb=rnorm(10)), x.Date)
returns_lag <- lag(returns)
$loop over columns/companies
xx<-lapply(setNames(1:ncol(returns),names(returns)), function(i) {
today <-returns_lag[,i]
yesterday <-head(returns[,i], -1)
lm(today~yesterday)
})
xx
This will return the results for each column as a list.
Using the dyn package (which loads zoo) we can do this:
library(dyn)
z <- zoo(EuStockMarkets) # test data
lapply(as.list(z), function(z) dyn$lm(z ~ lag(z, -1)))

Predict likelihood of each failure type with competing risks model in R

I'm looking to run a competing risks model on historical data and predict the likelihood of each type of death in a new dataset for a specified period (let's say one period). So far I've looked into comp.risk in the timereg package and crr in cmprsk, but am having trouble figuring out how to use their predict methods to return these likelihoods.
Using the bmt dataset (from timereg package) and comp.risk as an example, I'd like to do something like:
m <- comp.risk(Surv(time, cause>0)~platelet+age+tcell, data=bmt,
bmt$cause, causeS=1, resample.iid=1)
ndata <- data.frame(platelet=c(1,0,0), age=c(0,1,0), tcell=c(0,0,1),
start.time=c(1, 1, 1), end.time=c(2, 2, 2))
out <- predict(m, newdata=ndata)
This would ideally predict the likelihood of each type of death between t=1 and t=2, but the predict function returns other types of results.
The final line won't work because the model wasn't built with the start/stop Survival object type and comp.risk doesn't seem to take left-truncated data. To illustrate, the below model statement including start times returns the error Error in comp.risk(Surv(start.time, time, cause > 0) ~ platelet + age + :
only right censored data.
m <- comp.risk(Surv(rep(0, nrow(bmt)), time, cause>0)~platelet+age+tcell, data=bmt,
bmt$cause, causeS=1, resample.iid=1)

Resources