Can I test autocorrelation from the generalized least squares model? - r

I am trying to use a generalized least square model (gls in R) on my panel data to deal with autocorrelation problem.
I do not want to have any lags for any variables.
I am trying to use Durbin-Watson test (dwtest in R) to check the autocorrelation problem from my generalized least square model (gls).
However, I find that the dwtest is not applicable over gls function while it is applicable to other functions such as lm.
Is there a way to check the autocorrelation problem from my gls model?

Durbin-Watson test is designed to check for presence of autocorrelation in standard least-squares models (such as one fitted by lm). If autocorrelation is detected, one can then capture it explicitly in the model using, for example, generalized least squares (gls in R). My understanding is that Durbin-Watson is not appropriate to then test for "goodness of fit" in the resulting models, as gls residuals may no longer follow the same distribution as residuals from the standard lm model. (Somebody with deeper knowledge of statistics should correct me, if I'm wrong).
With that said, function durbinWatsonTest from the car package will accept arbitrary residuals and return the associated test statistic. You can therefore do something like this:
v <- gls( ... )$residuals
attr(v,"std") <- NULL # get rid of the additional attribute
car::durbinWatsonTest( v )
Note that durbinWatsonTest will compute p-values only for lm models (likely due to the considerations described above), but you can estimate them empirically by permuting your data / residuals.

Related

Does the function multinom() from R's nnet package fit a multinomial logistic regression, or a Poisson regression?

The documentation for the multinom() function from the nnet package in R says that it "[f]its multinomial log-linear models via neural networks" and that "[t]he response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes." Even when I go to add a tag for nnet on this question, the description says that it is software for fitting "multinomial log-linear models."
Granting that statistics has wildly inconsistent jargon that is rarely operationally defined by whoever is using it, the documentation for the function even mentions having a count response and so seems to indicate that this function is designed to model count data. Yet virtually every resource I've seen treats it exclusively as if it were fitting a multinomial logistic regression. In short, everyone interprets the results in terms of logged odds relative to the reference (as in logistic regression), not in terms of logged expected count (as in what is typically referred to as a log-linear model).
Can someone clarify what this function is actually doing and what the fitted coefficients actually mean?
nnet::multinom is fitting a multinomial logistic regression as I understand...
If you check the source code of the package, https://github.com/cran/nnet/blob/master/R/multinom.R and https://github.com/cran/nnet/blob/master/R/nnet.R, you will see that the multinom function is indeed using counts (which is a common thing to use as input for a multinomial regression model, see also the MGLM or mclogit package e.g.), and that it is fitting the multinomial regression model using a softmax transform to go from predictions on the additive log-ratio scale to predicted probabilities. The softmax transform is indeed the inverse link scale of a multinomial regression model. The way the multinom model predictions are obtained, cf.predictions from nnet::multinom, is also exactly as you would expect for a multinomial regression model (using an additive log-ratio scale parameterization, i.e. using one outcome category as a baseline).
That is, the coefficients predict the logged odds relative to the reference baseline category (i.e. it is doing a logistic regression), not the logged expected counts (like a log-linear model).
This is shown by the fact that model predictions are calculated as
fit <- nnet::multinom(...)
X <- model.matrix(fit) # covariate matrix / design matrix
betahat <- t(rbind(0, coef(fit))) # model coefficients, with expicit zero row added for reference category & transposed
preds <- mclustAddons::softmax(X %*% betahat)
Furthermore, I verified that the vcov matrix returned by nnet::multinom matches that when I use the formula for the vcov matrix of a multinomial regression model, Faster way to calculate the Hessian / Fisher Information Matrix of a nnet::multinom multinomial regression in R using Rcpp & Kronecker products.
Is it not the case that a multinomial regression model can always be reformulated as a Poisson loglinear model (i.e. as a Poisson GLM) using the Poisson trick (glmnet e.g. uses the Poisson trick to fit multinomial regression models as a Poisson GLM)?

R code to get Log-likelihood for Binary logistic regression

I have developed a binomial logistic regression using glm function in R. I need three outputs which are
Log likelihood (no coefficients)
Log likelihood (constants only)
Log likelihood (at optimal)
What functions or packages do I need to obtain these outputs?
Say we have a fitted model m.
log-likelihood of full model (i.e., at MLE): logLik(m)
log-likelihood of intercept-only model: logLik(update(m, . ~ 1))
although the latter can probably be retrieved without refitting the model if we think carefully enough about the deviance() and $null.deviance components (these are defined with respect to the saturated model)

re-estimating confidence of C5.0 rules using another dataset

Context: I have fitted a glmnet to my data. But for operational reason we would actually like to have rule-set. So I then fitted a C5.0Rules model to the predicted class from my glmnet. i.e. the C5.0Rules is essentially approximating my glmnet. However, as a result, the C5.0Rules will report a very high confidence (and other performance metrics), because its target is easy. A natural approach to correct this is to re-estimate the confidence (and other performance metrics) using the real response, or another dataset. But I need to do this so that the model remembers this new confidence, so in the future, it will report the corrected confidence level along with the prediction. How do I do that?
Reproducible example:
library(glmnet)
library(C50)
library(caret)
data(churn)
## original glmnet
glmnet=train(churn~.-state-area_code-international_plan-voice_mail_plan,data=churnTrain,method="glmnet")
## only retain useful predictors
temp=varImp(glmnet)$importance
reducedVar=rownames(temp)[temp>0]
churnTrain2=data.frame(churnTrain[,match(reducedVar,colnames(churnTrain))],
prediction=fitted(glmnet))
## fit my C5.0 which approximates the glmnet prediction
C5=train(prediction~.,data=churnTrain2,method="C5.0Rules")
summary(C5) ## notice the high confidence and performance measure.
(An alternative approach I can think of is to get C5.0 to predict the predicted probability instead of class, but this turns it into a regression problem so I won't be able to use C5.0)

Which function/package for robust linear regression works with glmulti (i.e., behaves like glm)?

Background: Multi-model inference with glmulti
glmulti is a R function/package for automated model selection for general linear models that constructs all possible general linear models given a dependent variable and a set of predictors, fits them via the classic glm function and allows then for multi-model inference (e.g., using model weights derived from AICc, BIC). glmulti works in theory also with any other function that returns coefficients, the log-likelihood of the model and the number of free parameters (and maybe other information?) in the same format that glm does.
My goal: Multi-model inference with robust errors
I would like to use glmulti with robust modeling of the errors of a quantitative dependent variable to guard against the effect out outliers.
For example, I could assume that the errors in the linear model are distributed as a t distribution instead of as a normal distribution. With its kurtosis parameter the t distribution can have heavy tails and is thus more robust to outliers (as compared to the normal distribution).
However, I'm not committed to using the t distribution approach. I'm happy with any approach that gives back a log-likelihood and thus works with the multimodel approach in glmulti. But that means, that unfortunately I cannot use the well-known robust linear models in R (e.g., lmRob from robust or lmrob from robustbase) because they do not operate under the log-likelihood framework and thus cannot work with glmulti.
The problem: I can't find a robust regression function that works with glmulti
The only robust linear regression function for R I found that operates under the log-likelihood framework is heavyLm (from the heavy package); it models the errors with a t distribution. Unfortunately, heavyLm does not work with glmulti (at least not out of the box) because it has no S3 method for loglik (and possibly other things).
To illustrate:
library(glmulti)
library(heavy)
Using the dataset stackloss
head(stackloss)
Regular Gaussian linear model:
summary(glm(stack.loss ~ ., data = stackloss))
Multi-model inference with glmulti using glm's default Gaussian link function
stackloss.glmulti <- glmulti(stack.loss ~ ., data = stackloss, level=1, crit=bic)
print(stackloss.glmulti)
plot(stackloss.glmulti)
Linear model with t distributed error (default is df=4)
summary(heavyLm(stack.loss ~ ., data = stackloss))
Multi-model inference with glmulti calling heavyLm as the fitting function
stackloss.heavyLm.glmulti <- glmulti(stack.loss ~ .,
data = stackloss, level=1, crit=bic, fitfunction=heavyLm)
gives the following error:
Initialization...
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "heavyLm".
If I define the following function,
logLik.heavyLm <- function(x){x$logLik}
glmulti can get the log-likelihood, but then the next error occurs:
Initialization...
Error in .jcall(molly, "V", "supplyErrorDF",
as.integer(attr(logLik(fitfunc(as.formula(paste(y, :
method supplyErrorDF with signature ([I)V not found
The question: Which function/package for robust linear regression works with glmulti (i.e., behaves like glm)?
There is probably a way to define further functions to get heavyLm working with glmulti, but before embarking on this journey I wanted to ask whether anybody
knows of a robust linear regression function that (a) operates under the log-likelihood framework and (b) behaves like glm (and will thus work with glmulti out-of-the-box).
got heavyLm already working with glmulti.
Any help is very much appreciated!
Here is an answer using heavyLm. Even though this is a relatively old question, the same problem that you mentioned still occurs when using heavyLm (i.e., the error message Error in .jcall(molly, "V", "supplyErrorDF"…).
The problem is that glmulti requires the degrees of freedom of the model, to be passed as an attribute of you need to provide as an attribute of the value returned by function logLik.heavyLm; see the documentation for the function logLik for details. Moreover, it turns out that you also need to provide a function to return the number of data points that were used for fitting the model, since the information criteria (AIC, BIC, …) depend on this value too. This is done by function nobs.heavyLm in the code below.
Here is the code:
nobs.heavyLm <- function(mdl) mdl$dims[1] # the sample size (number of data points)
logLik.heavyLm <- function(mdl) {
res <- mdl$logLik
attr(res, "nobs") <- nobs.heavyLm(mdl) # this is not really needed for 'glmulti', but is included to adhere to the format of 'logLik'
attr(res, "df") <- length(mdl$coefficients) + 1 + 1 # I am also considering the scale parameter that is estimated; see mdl$family
class(res) <- "logLik"
res
}
which, when put together with the code that you provided, produces the following result:
Initialization...
TASK: Exhaustive screening of candidate set.
Fitting...
Completed.
> print(stackloss.glmulti)
glmulti.analysis
Method: h / Fitting: glm / IC used: bic
Level: 1 / Marginality: FALSE
From 8 models:
Best IC: 117.892471265874
Best model:
[1] "stack.loss ~ 1 + Air.Flow + Water.Temp"
Evidence weight: 0.709174196998897
Worst IC: 162.083142797858
2 models within 2 IC units.
1 models to reach 95% of evidence weight.
producing therefore 2 models within the 2 BIC units threshold.
An important remark though: I am not sure that the expression above for the degrees of freedom is strictly correct. For a standard linear model, the degrees of freedom would be equal to p + 1, where p is the number of parameters in the model, and the extra parameter (the + 1) is the "error" variance (which is used to calculate the likelihood). In function logLik.heavyLm above, it is not clear to me whether one should also count the "scale parameter" that is estimated by heavyLm as an extra degree of freedom, and hence the p + 1 + 1, which would be the case if the likelihood is also a function of this parameter. Unfortunately, I cannot confirm this, since I don’t have access to the reference that heavyLm cites (the paper by Dempster et al., 1980). Because of this, I am counting the scale parameter, thereby providing a (slightly more) conservative estimate of model complexity, penalizing "complex" models. This difference should be negligible, except in the small sample case.

GLM with autoregressive term to correct for serial correlation

I have a stationary time series to which I want to fit a linear model with an autoregressive term to correct for serial correlation, i.e. using the formula At = c1*Bt + c2*Ct + ut, where ut = r*ut-1 + et
(ut is an AR(1) term to correct for serial correlation in the error terms)
Does anyone know what to use in R to model this?
Thanks
Karl
The GLMMarp package will fit these models. If you just want a linear model with Gaussian errors, you can do it with the arima() function where the covariates are specified via the xreg argument.
There are several ways to do this in R. Here are two examples using the "Seatbelts" time series dataset in the datasets package that comes with R.
The arima() function comes in package:stats that is included with R. The function takes an argument of the form order=c(p, d, q) where you you can specify the order of the auto-regressive, integrated, and the moving average component. In your question, you suggest that you want to create a AR(1) model to correct for first-order autocorrelation in the errors and that's it. We can do that with the following command:
arima(Seatbelts[,"drivers"], order=c(1,0,0),
xreg=Seatbelts[,c("kms", "PetrolPrice", "law")])
The value for order specifies that we want an AR(1) model. The xreg compontent should be a series of other Xs we want to add as part of a regression. The output looks a little bit like the output of summary.lm() turned on its side.
Another alternative process might be more familiar to the way you've fit regression models is to use gls() in the nlme package. The following code turns the Seatbelt time series object into a dataframe and then extracts and adds a new column (t) that is just a counter in the sorted time series object:
Seatbelts.df <- data.frame(Seatbelts)
Seatbelts.df$t <- 1:(dim(Seatbelts.df)[1])
The two lines above are only getting the data in shape. Since the arima() function is designed for time series, it can read time series objects more easily. To fit the model with nlme you would then run:
library(nlme)
m <- gls(drivers ~ kms + PetrolPrice + law,
data=Seatbelts.df,
correlation=corARMA(p=1, q=0, form=~t))
summary(m)
The line that begins with "correlation" is the way you pass in the ARMA correlation structure to GLS. The results won't be exactly the same because arima() uses maximum likelihood to estimate models and gls() uses restricted maximum likelihood by default. If you add method="ML" to the call to gls() you will get identical estimates you got with the ARIMA function above.
What is your link function?
The way you describe it sounds like a basic linear regression with autocorrelated errors. In that case, one option is to use lm to get a consistent estimate of your coefficients and use Newey-West HAC standard errors.
I'm not sure the best answer for GLM more generally.

Resources