Residuals and plots in ordered multinomial regression - r

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks

For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.

In polr(), there is no function that returns residual. You should manually calculate it using its definition.

There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Related

How to plot multi-level meta-analysis by study (in contrast to the subgroup)?

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Calculating logLik by hand from a logistic regression

I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?
No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667

Collinearity after accounting for random/mixed effects

could two/more predictors become more/less collinear after accounting for random effects?
In my case I have tested for collinearity prior to modelling, e.g. using VIF, and everything checks out. However, the ranking (using IC) of different models makes me uncertain whether it truly can separate between the predictors.
Any ideas?
ps! Can someone with higher rep than I add a more relevant tag such as collinearity?
There are some solutions listed at this blog post. They use some code to create a function that will calculate VIFs for lmer and lme model objects from the lmer and nlme R packages, respectively. I have copied the code for the function below.
vif.lme <- function (fit) {
## adapted from rms::vif
v <- vcov(fit)
nam <- names(fixef(fit))
## exclude intercepts
ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)] }
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam
v }
Once you run that code once, you will be able to execute a new function, vif.lme within the R environment. I give an example below using a random data set, and an uninformative random effect. I use an uninformative random effect so that the results of lme within nlme will generate the same parameter values for predictors as lm in base R. Then, I use the above code to calculate variance inflation factors, as well as the vif functino from the car package used to calculate VIFs for linear models, to show that they give the same output.
#make 4 vectors- c is used as an uninformative random effect for the lme model
a<-c(1:10)
b1<-c(2,4,6,8,10,100,14,16,18,20)
b2<-c(1,9,2,4,5,6,4,3,2,-1)
c<-c(1,1,1,1,1,1,1,1,1,1)
test<-data.frame(a,b1,b2,c)
#model a as a function of b1 and b2, and c as a random effect
require(nlme)
fit<-lme(a~b1+b2, random=~1|c,data=test)
#see how the model fits
summary(fit)
#check variance inflation factors
vif.lme(fit)
#create a new regular linear regression model and check VIF using the car package.
#answers should be the same, as our random effect above was totally uninformative
require(car)
fit2<- lm(a~b1+b2,data=test)
#check to see that parameter fits are the same.
summary(fit2)
#check to see that variance inflation factors are the same
vif(fit2)

Do I need to set refit=FALSE when testing for random effects in lmer() models with anova()?

I am currently testing whether I should include certain random effects in my lmer model or not. I use the anova function for that. My procedure so far is to fit the model with a function call to lmer() with REML=TRUE (the default option). Then I call anova() on the two models where one of them does include the random effect to be tested for and the other one doees not. However, it is well known that the anova() function refits the model with ML but in the new version of anova() you can prevent anova() from doing so by setting the option refit=FALSE. In order to test for random effects should I set refit=FALSE in my call to anova() or not? (If I do set refit=FALSE the p-values tend to be lower. Are the p-values anti-conservative when I set refit=FALSE?)
Method 1:
mod0_reml <- lmer(x ~ y + z + (1 | w), data=dat)
mod1_reml <- lmer(x ~ y + z + (y | w), data=dat)
anova(mod0_reml, mod1_reml)
This will result in anova() refitting the models with ML instead of REML. (Newer versions of the anova() function will also output an info about this.)
Method 2:
mod0_reml <- lmer(x ~ y + z + (1 | w), data=dat)
mod1_reml <- lmer(x ~ y + z + (y | w), data=dat)
anova(mod0_reml, mod1_reml, refit=FALSE)
This will result in anova() performing its calculations on the original models, i.e. with REML=TRUE.
Which of the two methods is correct in order to test whether I should include a random effect or not?
Thanks for any help
In general I would say that it would be appropriate to use refit=FALSE in this case, but let's go ahead and try a simulation experiment.
First fit a model without a random slope to the sleepstudy data set, then simulate data from this model:
library(lme4)
mod0 <- lmer(Reaction ~ Days + (1|Subject), data=sleepstudy)
## also fit the full model for later use
mod1 <- lmer(Reaction ~ Days + (Days|Subject), data=sleepstudy)
set.seed(101)
simdat <- simulate(mod0,1000)
Now refit the null data with the full and the reduced model, and save the distribution of p-values generated by anova() with and without refit=FALSE. This is essentially a parametric bootstrap test of the null hypothesis; we want to see if it has the appropriate characteristics (i.e., uniform distribution of p-values).
sumfun <- function(x) {
m0 <- refit(mod0,x)
m1 <- refit(mod1,x)
a_refit <- suppressMessages(anova(m0,m1)["m1","Pr(>Chisq)"])
a_no_refit <- anova(m0,m1,refit=FALSE)["m1","Pr(>Chisq)"]
c(refit=a_refit,no_refit=a_no_refit)
}
I like plyr::laply for its convenience, although you could just as easily use a for loop or one of the other *apply approaches.
library(plyr)
pdist <- laply(simdat,sumfun,.progress="text")
library(ggplot2); theme_set(theme_bw())
library(reshape2)
ggplot(melt(pdist),aes(x=value,fill=Var2))+
geom_histogram(aes(y=..density..),
alpha=0.5,position="identity",binwidth=0.02)+
geom_hline(yintercept=1,lty=2)
ggsave("nullhist.png",height=4,width=5)
Type I error rate for alpha=0.05:
colMeans(pdist<0.05)
## refit no_refit
## 0.021 0.026
You can see that in this case the two procedures give practically the same answer and both procedures are strongly conservative, for well-known reasons having to do with the fact that the null value of the hypothesis test is on the boundary of its feasible space. For the specific case of testing a single simple random effect, halving the p-value gives an appropriate answer (see Pinheiro and Bates 2000 and others); this actually appears to give reasonable answers here, although it is not really justified because here we are dropping two random-effects parameters (the random effect of slope and the correlation between the slope and intercept random effects):
colMeans(pdist/2<0.05)
## refit no_refit
## 0.051 0.055
Other points:
You might be able to do a similar exercise with the PBmodcomp function from the pbkrtest package.
The RLRsim package is designed precisely for fast randomization (parameteric bootstrap) tests of null hypotheses about random effects terms, but doesn't appear to work in this slightly more complex situation
see the relevant GLMM faq section for similar information, including arguments for why you might not want to test the significance of random effects at all ...
for extra credit you could redo the parametric bootstrap runs using the deviance (-2 log likelihood) differences rather than the p-values as output and check whether the results conformed to a mixture between a chi^2_0 (point mass at 0) and a chi^2_n distribution (where n is probably 2, but I wouldn't be sure for this geometry)

Resources