Passing xreg in auto.arima - r

I'm trying to fit a model to my data set using the auto.arima function but I get an error message of no suitable ARIMA model found which I suspect can be attributed to what I'm passing for the xreg portion. My data set contains 1176 total observation, including 1 variable I'm trying to forecast and the rest being dummy variables (holidays, days of the week, etc.) which I'm trying to pass into auto.arima as regressors.
library(forecast)
data <- read.csv(...)
#extract variable to be forecasted and extract regressors
forcast.var <- data[, 29]
regressors <- data[, 2:27]
#split forecast variable and regressors into train and test sets
train.r <- regressors[1:1000, ]
test.r <- regressors[1001:1176, ]
train.f <- forecast.var[1:1000]
test.f <- forecast.var[1001:1176]
#fit the data, pass 'train.r' into data.matrix and into 'xreg' since
#documentation for this function says it must be a vector or matrix
fit <- auto.arima(train.f, stepwise = FALSE, approximation = FALSE
, xreg = data.matrix(train.r))
If I attempt to run this, I get the aforementioned error message. I do get a fitted model if I don't pass anything for xreg, but the fitted values or nowhere near close to the actuals. I should mention that train.r does already have column names. So what is it that I'm doing wrong? How do I successfully pass the regressors in hopes that my model comes out more accurate?

I managed to fix this by excluding one of the dummy variables. That is, for days of the week package dummies created 7 variables for me. However, if you have 7 categories only 6 dummies are needed. I excluded one and then arima worked fine.

Related

Obtaining predictions from a pooled imputation model

I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)

Is it possible to perform a zero inflated poisson regression model in R with more than 4 variables?

This is my first time posting on here, so I apologize if this isn't the correct format/info. But I'm attempting to run a model in R with the zeroinfl function (pscl package). My data consists of insect count data and 5 different variables, which are 5 different habitat types.
Zero inflation poisson model
summary(m1 <- zeroinfl(count~Hab_1+Hab_2+Hab_3+Hab_4+Hab_5, data = insect_data))
I'm able to run the model when I only use 4 variables in the equation, but when I add the fifth variable it gives me this error code:
Error in optim(fn = loglikfun, gr = gradfun, par = c(start$count, start$zero, :
non-finite value supplied by optim
Is there a way to run a zero inflated model using all 5 of these variables or am I missing something? Any input would be greatly appreciated, thank you!

Forecast using auto.arima using exogenous variables

I would love to be able to use the exogenous variables to help in the arima forecast. I run into issues in everyway i try to use the variables outside of the one I am trying to forecast.
I would also love for the actual plot to be more beautiful than the default r.
Error in auto.arima(datats1$Slots, seasonal = TRUE, xreg = datats1ts) :
xreg is rank deficient
All of the problems associated with having rank deficient dataframe or matrix do not hold. There are no linear combinations in the dataset.
#Load Data
datats1 <- read.csv("ProjectTS2.CSV") # Time Series I want to forecast
xreg <- read.csv("ProjectTS4.CSV") #Data is want to use as exogenous
datats1$Slots <- ts(datats1$slots, start=2015,frequency=365)
dfTS<-as.matrix(ts(xreg))
new<-auto.arima(datats1$slots,seasonal=TRUE,xreg=dfTS)
seas_fcast <- forecast(new, h=30)
ts.plot(seas_fcast,xlim=c(2018,2018.2))

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Back Test ARIMA model with Exogenous Regressors

Is there a way to create a holdout/back test sample in following ARIMA model with exogenous regressors. Lets say I want to estimate the following model using the first 50 observations and then evaluate model performance on the remaining 20 observations where the x-variables are pre-populated for all 70 observations. What I really want at the end is a graph that plots actual and fitted values in development period and validation/hold out period (also known as Back Testing in time series)
library(TSA)
xreg <- cbind(GNP, Time_Scaled_CO) # two time series objects
fit_A <- arima(Charge_Off,order=c(1,1,0),xreg) # Charge_Off is another TS object
plot(Charge_Off,col="red")
lines(predict(fit_A, Data),col="green") #Data contains Charge_Off, GNP, Time_Scaled_CO
You don't seem to be using the TSA package at all, and you don't need to for this problem. Here is some code that should do what you want.
library(forecast)
xreg <- cbind(GNP, Time_Scaled_CO)
training <- window(Charge_Off, end=50)
test <- window(Charge_Off, start=51)
fit_A <- Arima(training,order=c(1,1,0),xreg=xreg[1:50,])
fc <- forecast(fit_A, h=20, xreg=xreg[51:70,])
plot(fc)
lines(test, col="red")
accuracy(fc, test)
See http://otexts.com/fpp/9/1 for an intro to using R with these models.

Resources