ARIMA giving forecasts with higher RMSE than AR - r

I am trying to argue that ARIMA models are better than AR models i.e since AR is a subset of ARIMA, the best ARIMA model will not be worse than the best AR model, but may be better. I have used an AR(6) model, and then used auto.arima() in R which has told me that an ARIMA(1,0,2) model is optimal using AICc. I have used both of these to do a rolling window forecast, but am getting an RMSE of 3.901 for AR(6) and 4.503 for ARIMA(1,0,2). My code for the forecasting is below (I know it is not very advanced but I'm a beginner and this is the best way I could find - it matches my results by hand):
#find moving averages and residual errors
ma=rep(NA,14976)
for (i in 3:14976){
ma[i] = mean(ds[(i-2):(i-1)])
}
frame <- ds-ma
#fit model
model <- arima(ds[1:14676],order=c(1,0,2),include.mean=TRUE,method="ML")
`%+=%` = function(e1,e2){eval.parent(substitute(e1 <- e1 + e2))}
training_data <- data[1:14676]
test_data <- data[14677:14976]
window <- 1
window1 <- 2
coef <- model$coef
history <- training_data[(length(training_data)-window+1):14676]
predictions <- list()
for (i in (1:length(test_data))){
length <- length(history)
lag <- array()
for (d in ((length-window+1):length)){
lag[d-i+1] <- history[d]}
yhat <- coef[length(coef)]-1
for (t in (1:window)){
yhat %+=% (coef[t]*lag[window-t+1])}
if (window1 != 0){
for (j in ((window+1):(window+window1))){
yhat %+=% (coef[j]*frame[14676+i-j+1])}
}
obs <- test_data[i]
predictions <- append(predictions,yhat)
history <- append(history,obs)
print(predictions)
}
The graph that comes out for the ARIMA(1,0,2) forecast (compared to the actual values in the test set) looks better, but is quite raised. It seems like the intercept needs to be lower, which does give a better RMSE, but arima() gave the intercept it did so I haven't changed it.

Related

How to specify zero-inflated negative binomial model in JAGS

I'm currently working on constructing a zero-inflated negative binomial model in JAGS to model yearly change in abundance using count data and am currently a bit lost on how best to specify the model. I've included an example of the base model I'm using below. The main issue I'm struggling with is that in the model output I'm getting poor convergence (high Rhat values, low Neff values) and the 95% credible intervals are huge. I realize that without seeing/running the actual data there's probably not much anyone can help with but I thought I'd at least try and see if there are any obvious errors in the way I have the basic model specified. I also tried fitting a variety of other model types (regular negative binomial, Poisson, and zero-inflated Poisson) but decided to go with the ZINB since it had the lowest DIC scores of all the models and also makes the most intuitive sense to me, given my data structure.
library(R2jags)
# Create example dataframe
years <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
sites <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
months <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
# Count data
day1 <- floor(runif(18,0,7))
day2 <- floor(runif(18,0,7))
day3 <- floor(runif(18,0,7))
day4 <- floor(runif(18,0,7))
day5 <- floor(runif(18,0,7))
df <- as.data.frame(cbind(years, sites, months, day1, day2, day3, day4, day5))
# Put count data into array
y <- array(NA,dim=c(2,3,3,5))
for(m in 1:2){
for(k in 1:3){
sel.rows <- df$years == m &
df$months==k
y[m,k,,] <- as.matrix(df)[sel.rows,4:8]
}
}
# JAGS model
sink("model1.txt")
cat("
model {
# PRIORS
for(m in 1:2){
r[m] ~ dunif(0,50)
}
t.int ~ dlogis(0,1)
b.int ~ dlogis(0,1)
p.det ~ dunif(0,1)
# LIKELIHOOD
# ECOLOGICAL SUBMODEL FOR TRUE ABUNDANCE
for (m in 1:2) {
zero[m] ~ dbern(pi[m])
pi[m] <- ilogit(mu.binary[m])
mu.binary[m] <- t.int
for (k in 1:3) {
for (i in 1:3) {
N[m,k,i] ~ dnegbin(p[m,k,i], r)
p[m,k,i] <- r[m] / (r[m] + (1 - zero[m]) * lambda.count[m,k,i]) - 1e-10 * zero[m]
lambda.count[m,k,i] <- exp(mu.count[m,k,i])
log(mu.count[m,k,i]) <- b.int
# OBSERVATIONAL SUBMODEL FOR DETECTION
for (j in 1:5) {
y[m,k,i,j] ~ dbin(p.det, N[m,k,i])
}#j
}#i
}#k
}#m
}#END", fill=TRUE)
sink()
win.data <- list(y = y)
Nst <- apply(y,c(1,2,3),max)+1
inits <- function()list(N = Nst)
params <- c("N")
nc <- 3
nt <- 1
ni <- 50000
nb <- 5000
out <- jags(win.data, inits, params, "model1.txt",
n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb,
working.directory = getwd())
print(out)
Tried fitting a ZINB model in JAGS using the code specified above but am having issues with model convergence.
The way that I have tended to specify zero-inflated models is to model the data as being Poisson distributed with mean that is either zero if that individual is part of the zero-inflated group, or distributed according to a gamma distribution otherwise. Something like:
Obs[i] ~ dpois(lambda[i] * is_zero[i])
is_zero[i] ~ dbern(zero_prob)
lambda[i] ~ dgamma(k, k/mean)
Something similar to this was first used in this paper: https://www.researchgate.net/publication/5231190_The_distribution_of_the_pathogenic_nematode_Nematodirus_battus_in_lambs_is_zero-inflated
These models usually converge OK, although the performance is not as good as for simpler models of course. You also need to make sure to supply initial values for is_zero so that the model starts with all individuals with positive counts in the appropriate group.
In your case, you have multiple timepoints, so you need to decide if the zero-inflation is fixed over time points (i.e. an individual cannot switch to or from zero-inflated group over time), or if each observation is completely independent with respect to zero-inflation status. You also need to decide if you want to have co-variates of year/month/site affecting the mean count (i.e. the gamma part) or the probability of a positive count (i.e. the zero-inflation part). For the former, you need to index mean (in my formulation) by i and then use a GLM-like formula (probably using log link) to relate this to the appropriate covariates. For the latter, you need to index zero_prob by i and then use a GLM-like formula (probably using logit link) to relate this to the appropriate covariates. It is also possible to do both, but if you try to use the same covariates in both parts then you can expect convergence problems!
It would arguably be better to replace the separate Poisson-Gamma distributions with a single Negative Binomial distribution using the 'ecology parameterisation' with mean and k. This is not currently implemented in JAGS, but I will add it for the next update.

How to convert one-fold cross-validation to K-fold cross-validation in R

I have a GAM model for which I would like to calculate AUC, TSS (True Skill Statistic) and RMSE through 5-fold cross-validation in R. Unfortunately, the caret package does not support GAM and therefore cannot be used. As I didn’t find any alternative, I tried to build the code for cross-validation myself, and it works well, with the only problem that it is only one-fold cross-validation. Could anybody help me to make this 5-fold? Sorry if this is an elementary question, I am new to R.
sample <- sample(c(TRUE, FALSE), nrow(DF), replace=TRUE, prob=c(0.8,0.2))
train <- DF[sample, ]
test <- DF[!sample, ]
predicted <- predict(GAM, test, type="response")
# Calculating RMSE
RMSE(test$Y, predicted)
# Calculating AUC
auc(test$Y, predicted)
GAM_TSS <- gam(Y ~ X1 + X2 + X3 + X4 + s(X5, k = 3), train, family = "binomial")
test$pred <- predict(GAM_TSS, type="response", newdata=test)
roc.curve <- roc(test$Y, test$pred, ci=T)
plot(roc.curve)
threshold <- 0.1
CM <- confusionMatrix(factor(test$pred>threshold), factor(test$P_A==1), positive="TRUE")
CM <- CM$byClass
Sensitivity <- CM[['Sensitivity']]
Specificity <- CM[['Specificity']]
# Calculating TSS
TSS = Sensitivity + Specificity - 1
TSS
I have come across precisely this problem with GAM in the past. My approach was to create a vector to split data randomly into parts as equally sized as possible, then loop through the fold ids as follows:
k <- 5
FoldID <- rep(1:k, ceiling(nrow(modelData)/k))
length(FoldID) <- nrow(modelData)
FoldID <- sample(FoldID, replace = FALSE)
for(fold in 1:k){
train_data <- modelData[FoldID != fold, ]
val_data <- modelData[FoldID == fold, ]
# Create training model and predictions
# Calculate RMSE data etc.
# Add a line with fold validation results to a dataframe
}
# Calculate column means of your validation results frame
I will leave you to fill in the gaps to suit your own requirements. It would also be a good idea to add an outer loop (outside the FoldID creation) for repeats.

Two methods of recovering fitted values from a Bayesian Structural Time Series model yield different results

Two conceptually plausible methods of retrieving in-sample predictions (or "conditional expectations") of y[t] given y[t-1] from a bsts model yield different results, and I don't understand why.
One method uses the prediction errors returned by bsts (defined as e=y[t] - E(y[t]|y[t-1]); source: https://rdrr.io/cran/bsts/man/one.step.prediction.errors.html):
library(bsts)
get_yhats1 <- function(fit){
# One step prediction errors defined as e=y[t] - yhat (source: )
# Recover yhat by y-e
bsts.pred.errors <- bsts.prediction.errors(fit, burn=SuggestBurn(0.1, fit))$in.sample
predictions <- t(apply(bsts.pred.errors, 1, function(e){fit$original.series-e}))
return(predictions)
}
Another sums the contributions of all model component at time t.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
Fit a model:
## Air passengers data
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(y, state.specification=ss, niter=500, family='gaussian')
Compute and compare predictions using each of the functions
p1 <- get_yhats1(bsts.model)
p2 <- get_yhats2(bsts.model)
# Compare predictions for t=1:5, first MCMC iteration:
p1[1,1:5]; p2[1,1:5]
I'm the author of bsts.
The 'prediction errors' in bsts come from the filtering distribution. That is, they come from p(state | past data). The state contributions come from the smoothing distribution, i.e. p(state | all data). The filtering distribution looks backward in time, while the smoothing distribution looks both forward and backward. One typically needs the filtering distribution while using a fitted model, and the smoothing distribution while fitting the model in the first place.

How to recover fitted values from BSTS poisson model (in R)?

I am trying to recover in-sample predictions (fitted values) from a bsts model with a specified poisson response using the bsts package in R. The following results in an error: Prediction errors are not supported for Poisson or logit models.
data("AirPassengers")
# 11 years, monthly data (timestep=monthly) --> 132 observations
Y <- stats::window(AirPassengers, start=c(1949,1), end=c(1959,12))
y <- log10(Y)
ss <- AddLocalLinearTrend(list(), y)
ss <- AddSeasonal(ss, y, nseasons=12, season.duration=1)
bsts.model <- bsts(Y, state.specification=ss, niter=150, family='poisson')
bsts.prediction.errors(bsts.model)
Is there a way to retrieve predictions on model-training data with a poisson model in bsts?
One way to do it is to extract the contribution of each model component at time t and sum them.
get_yhats2 <- function(fit){
burn <- SuggestBurn(0.1, fit)
X <- fit$state.contributions
niter <- dim(X)[1]
ncomp <- dim(X)[2]
nobs <- dim(X)[3]
# initialize final fit/residuals matrices with zeros
predictions <- matrix(data = 0, nrow = niter - burn, ncol = nobs)
p0 <- predictions
comps <- seq_len(ncomp)
for (comp in comps) {
# pull out the state contributions for this component and transpose to
# a niter x (nobs - burn) array
compX <- X[-seq_len(burn), comp, ]
# accumulate the predictions across each component
predictions <- predictions + compX
}
return(predictions)
}
get_yhats2(bsts.model)
But I also posted here, showing that this method didn't necessarily match expectations I had even in the Gaussian case.

Confusion matrix for multinomial logistic regression & ordered logit

I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!
You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.

Resources