I'd like to quickly compare AICs that are provided as output when running summary() on individual polr() models created using the MASS package in R. I have no problem compiling this info, but what I can't figure out is where exactly the AIC info is being stored in the polr model objects themselves.
I've tried using str() and attributes() on my model objects, and I've even tried using getAnywhere("polr") to look at the source code itself. Nothing is standing out to me.
Anyone know how to extract AIC output from summary(polr_mod)?
Example for reference:
library(MASS)
dat <- data.frame(v1 = factor(rep(0:2,each=3),ordered = T), v2 = rep(1:3,each=3))
mod_polr <- polr(v1 ~ v2, data = dat, Hess = T, method = "logistic" )
summary(mod_polr)
Call:
polr(formula = v1 ~ v2, data = dat, Hess = T, method = "logistic")
Coefficients:
v2
46.7233
Intercepts:
0|1 1|2
73.62014 117.05781
Residual Deviance: 1.560782e-08
AIC: 6.00
^ See, at the bottom of the output is AIC: 6.00. Where is this stored in the object? I'd like to call/extract it.
AIC is a generic function with no method for objects of class "polr" but with a default method.
The default method's code can be seen by running
getAnywhere("AIC.default")
and what it does is to call logLik on its first argument and then to compute the AIC with k = 2.
The number of model parameters is given by
attr(lls, "df")
And the computation is
lls <- logLik(mod_polr)
-2*as.numeric(lls) + k*attr(lls, "df")
In its turn, logLik is also generic but with a method for objects of class "polr".
The code returned by
getAnywhere("logLik.polr")
is very simple, a one-liner. Note that the df in its code was edf in the object "polr". Indented it's the following.
logLik.polr <- function(object, ...) {
structure(
-0.5 * object$deviance,
df = object$edf,
nobs = object[["nobs"]],
class = "logLik"
)
}
The explanation for the deviance, edf and nobs are found in help("polr), section Value.
deviance
the residual deviance.
edf
the (effective) number of degrees of freedom used by the model
And like that section says, nobs is used for stepAIC (but not for logLik.polr).
So this can all be pieced together in one function.
Write a AIC method for objects of class "polr" simplifying the code above.
AIC.polr <- function(x, k = 2){
dev <- x$deviance
nparams <- x$edf
dev + k*nparams
}
AIC(mod_polr)
#[1] 6
I know that tihs is an old question ut I just want to share in case others have the same question and want another option.
Using modelsummary() to create your regression tabel then you also get AIC, BIC and RMSE. If you need example let me know.
Related
I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...
I have a question regarding how to apply the delta method when I have clustered standard errors. Consider the following dataset and (simple) regression ((Please note that this question is not necessarily about whether it makes sense to cluster around "us" or the correctness / usefulness of this regression).
#Use packages
library(multiwayvcov)
library(sandwich)
library(lmtest)
library(msm)
#load the data
data(mtcars)
# Run the regression
model1<-lm(mpg~cyl+gear+drat, data = mtcars)
#Calculate variance covariance matrix for clustered standard errors
vcov<-cluster.vcov(model1, mtcars$vs)
coeftest(model1, vcov)
# Apply delta method results in error
g<-model1$coefficients[2] / model1$coefficients[1]
deltamethod(g, mean, cov = vcov, ses=TRUE)
# Error I get is this one: "Error in deltamethod(g, mean = g, cov = vcov, ses = TRUE) :
# Covariances should be a 1 by 1 matrix"
Now I want to calculate the standard error for the coefficient (cyl) divided by (intercept) when using my matrix for clustered standard errors around "vs" (i.e. the vcov matrix). Does anyone know how to do this? I looked at this website, but for some reason I got an error when applying this (https://rdrr.io/rforge/msm/man/deltamethod.html). I appreciate any help.
Just editing the deltamethod call to output an answer - I don't know if this answer actually makes sense for what you want to do.
deltamethod(
g = formula('~x2/x1'),
mean = model1$coefficients,
cov = vcov,
ses = TRUE)
How do I get an AIC from an lm_robust object (package estimatr)? I'm using lm_robust because I want to use a robust estimator for calculating the SE. Unlike the lm function, AIC is not provided when you run the summary function and running the AIC function on a lm_robust object produces an error. Below is a toy example of the kind of model I'm trying to run.
library(estimatr)
fake_data<-data.frame(outcome=rnorm(100,3.65,1),
pred1=rnorm(100,15,7),
pred2=as.factor(sample(1:5, 100, replace = T)))
mod1<-lm_robust(outcome~pred1+pred2,data=fake_data)
AIC(mod1)
here is what the error message looks like:
> AIC(mod1)
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "lm_robust"
If you have to do it with lm_robust, you may choose to calculate it by yourself as below,
The formula of AIC,
AIC = 2*k + n [Ln( 2(pi) RSS/n ) + 1]
# n : Number of observation
# k : All variables including all distinct factors and constant
# RSS : Residual Sum of Square
If we apply it to R for your case,
# Note that, I take k=7 since you have, 5 factors + 1 continuous and 1 constant
AIC_calculated <- 2*7 + 100* (log( 2*pi* (1-mod1$r.squared)*mod1$tss/100 ) + 1)
[1] 332.2865
which is same with both lm and glm outputs.
mod2<-lm(outcome~pred1+pred2,data=fake_data)
> AIC(mod2)
[1] 332.2865
And finally, of course, you can put this calculation into a function to call whenever you want by just giving lm_robust model inside it without having to set the N and k parameters for any given data like,
myAIC <- function(data) {
2*(data$k+1) + data$N * (log(2*pi* (1-data$r.squared)*data$tss/data$N ) + 1)
}
> myAIC(mod1)
[1] 332.2865
Note: Results may be shown different in your computer because of the seeding differences when running the sample() function in dataframe.
Here's a workaround
mod1 = lm_robust(outcome ~ pred1 + pred2, data = fake_data)
#Create any fitted model using 'lm' as a placeholder
mod2 = with(list(x = rnorm(10), y = rnorm(10)), lm(y ~ x))
#Copy values in `mod2` from `mod1`
mod2[names(mod2)] = mod1[names(mod2)]
#Calculate residuals in `mod2`
mod2$residuals = mod2$fitted.values - fake_data$outcome
AIC(mod2)
#[1] 326.6092
I've created a function which fits polynomial regression models with increasing degree upto the input degree. I also collect all such models in a list.
After executing this function for a given set of inputs, I want to inspect the model list to calculate the MSE. However I see that the individual models refer to parameter names within the function.
Question: How do I make the glm objects refer to actual variables
Function definition:
poly.iter = function(dep,indep,dat,deg){ #Function to iterate through polynomial fits upto input degree
set.seed(1)
par(mfrow=c(ceiling(sqrt(deg)),ceiling(sqrt(deg)))) #partitioning the plotting window
MSE.CV = rep(0,deg)
modlist = list()
xvar = seq(from=min(indep),to=max(indep),length.out = nrow(dat))
for (i in 1:deg){
mod = glm(dep~poly(indep,i),data=dat)
#MSE.CV[i] = cv.glm(dat,mod,K=10)$delta[2] #Inside of this function, cv.glm is generating warnings. Googling has not helped as it can typically happen with missing obs but we don't have any in Auto data
modlist = c(modlist,list(mod))
MSE.CV[i] = mean(mod$residuals^2) #GLM part is giving 5x the error i.e. delta is 5x of MSE. Not sure why
plot(jitter(indep),jitter(dep),cex=0.5,col="darkgrey")
preds = predict(mod,newdata=list(indep=xvar),se=T)
lines(xvar,preds$fit,col="blue",lwd=2)
matlines(xvar,cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit),lty=3,col="blue")
}
return(list("models"=modlist,"errors"=MSE.CV))
}
Function Call:
output.mpg.disp = poly.iter(mpg,displacement,Auto,9)
Inspecting 3rd degree model:
> output.mpg.disp[[1]][[3]]
Call: glm(formula = dep ~ poly(indep, i), data = dat)
Coefficients:
(Intercept) poly(indep, i)1 poly(indep, i)2 poly(indep, i)3
23.446 -124.258 31.090 -4.466
Degrees of Freedom: 391 Total (i.e. Null); 388 Residual
Null Deviance: 23820
Residual Deviance: 7392 AIC: 2274
Now I can't use this object inside cv.glm with 'Auto' dataset as it will not recognize indep, dep and i
You can use the as.formula() function to transform a string with your formula before calling glm(). This will solve your question (How do I make the glm objects refer to actual variables), but I'm not sure if it is enough for the calling cv.glm later (I couldn't reproduce your code here, without errors). To be clear, you replace the line
mod = glm(dep~poly(indep,i),data=dat)
with something like:
myexp = paste0(dep, "~ poly(", indep, ",", i, ")")
mod = glm(as.formula(myexp), data=dat)
it's required then to make the variables dep and indep to be characters with names of the variables that you want to refer to (e.g. indep="displ").
The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.