Get Regression Coefficient Names with R Bootstrap - r

I'm using the boot package in R to calculate bootstrapped SEs and confidence intervals. I'm trying to find an elegant and efficient way of getting the names of my parameters along with the bootstrap distribution of their estimates. For instance, consider the simple example given here:
# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs = function(data, indices, formula) {
d = data[indices,] # allows boot to select sample
fit = lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results = boot(
data=mtcars,
statistic=bs,
R=1000,
formula=mpg~wt+disp)
This works fine, except that the results just appear as numerical indices:
# view results
results
Bootstrap Statistics :
original bias std. error
t1* 34.96055404 0.1559289371 2.487617954
t2* -3.35082533 -0.0948558121 1.152123237
t3* -0.01772474 0.0002927116 0.008353625
Particularly when getting into long, complicated regression formulas, involving a variety of factor variables, it can take some work to keep track of precisely which indices go with which coefficient estimates.
I could of course just re-fit my model again outside of the bootstrap function, and extract the names with names(coef(fit)) or something, or likely use something else such as a call to model.matrix(). These seem cumbersome, both in terms of extra coding but also in terms of extra CPU and ram resources.
How can I more easily get a nice vector of the coefficient names to pair a vector of coefficient standard errors in situations like this?
UPDATE
Based on the great answer from lmo, here is my basic code to get a basic regression table:
Names = names(results$t0)
SEs = sapply(data.frame(results$t), sd)
Coefs = as.numeric(results$t0)
zVals = Coefs / SEs
Pvals = 2*pnorm(-abs(zVals))
Formatted_Results = cbind(Names, Coefs, SEs, zVals, Pvals)

The estimates from calling the "boot strapped" function, here lm, on the original data, are stored in an element of the list called "t0".
results$t0
(Intercept) wt disp
34.96055404 -3.35082533 -0.01772474
This object preserves the names of the estimates from original function call, which you can then access with names.
names(results$t0)
[1] "(Intercept)" "wt" "disp"

Related

User specified variance-covariance matrix in car::Anova not working

I am trying to use the car::Anova function to carry out joint Wald chi-squared tests for interaction terms involving categorical variables.
I would like to compare results when using bootstrapped variance-covariance matrix for the model coefficients. I have some concerns about the normality of residuals and am doing this as a first step before considering permutation tests as an alternative to joint Wald chi-squared tests.
I have found the variance covariance from the model fitted on 1000 bootstrap resamples of the data. The problem is that the car::Anova.merMod function does not seem to use the user-specified variance covariance matrix. I get the same results whether I specify vcov. or not.
I have made a very simple example below where I try to use the identity matrix in Anova(). I have tried this with the more realistic bootstrapped var-cov as well.
I looked at the code on github and it looks like there is a line where vcov. is overwritten using vcov(mod), so that might be an error. However I thought I'd see if anyone here had come across this issue or could see if I had made a mistake.
Any help would be great!
df1 = data.frame( y = rbeta(180,2,5), x = rnorm(180), group = letters[1:30] )
mod1 = lmer(y ~ x + (1|group), data = df1)
# Default, uses variance-covariance from the model
Anova(mod1)
# Should use user-specified varcov matrix but does not - same results as above
Anova(mod1, vcov. = diag(2))
# I'm not bootstrapping the var-cov matrix here to save space/time
p.s. Using car::linearHypothesis works for user-specified vcov, but this does not give results using type 3 sums of squares. It is also more laborious to use for more than one interaction term. Therefore I'd prefer to use car::Anova if possible.

Calculating logLik by hand from a logistic regression

I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?
No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667

Plot the Profile Deviance for a GLM fit in R

I would like to be able to plot the profile deviance for a parameter estimate fitted using the function glm() in R. The profile Deviance is the deviance function for different values of the parameter estimate in question, after estimating all other parameters. I need to plot the deviance for several values around the fitted parameter, to check the assumption of quadratic deviance function.
My model is predicting reconviction of offenders. The formula is of the form:
reconv ~ [other variables] + sex, where reconv is a binary yes/no factor, and sex is binary male/female factor. I'd like to plot the profile deviance of the parameter estimated for sex=female (sex=male is the reference level).
The glm() function estimated the parameter as -0.22, with std error 0.12.
[I'm asking this question because there was no answer I could find, but I worked it out, and wanted to post a solution to be of use to others. Of course, additional help is welcome. :-)]
See the profileModel package by Ioannis Kosmidis. He had a paper in the R Journal (R News it would appear) illustrating the package:
Ioannis Kosmidis. The profilemodel R package: Profiling objectives for models with linear predictors. R News, 8(2):12-18, October 2008.
The PDF is here (entire newsletter).
See ?profile.glm (and example("profile.glm")) in the MASS package -- I think it will do everything you want (this is not loaded by default, but it is mentioned in ?profile, which might have been the first place you looked ...) (Note that the profiles are generally plotted on a signed-square-root scale, so that a truly quadratic profile will appear as a straight line.)
The way I found to do this involves using the offset() function (as detailed in Pawitan, Y. (2001) 'In All Likelihood' p172).
The answers given by #BenBolker and #GavinSimpson are better than this, in that they referenced packages which will do everything this does and a lot more.
I'm posting this because its another way round it, and also, plotting things "manually" is sometimes nice for learning. It taught me a lot.
sexi <- as.numeric(data.frame$sex)-1 #recode a factor as 0/1 numeric
beta <- numeric(60) #Set up vector to Store the betas
deviance <- numeric(60) #Set up vector to Store the deviances
for (i in 1:60){
beta[i] <- 0.5 - (0.01*i)
#A vector of values either side of the fitted MLE (in this case -0.22)
mod <- update(model,
.~. - sex #Get rid of the fitted variable
+ offset( I(sexi*beta[i]) ) #Replace with offset term.
)
deviance[i] <- mod$deviance #Store i'th deviance
}
best <- which.min(deviance)
#Find the index of best deviance. Should be the fitted value from the model
deviance0 <- deviance - deviance[best]
#Scale deviance to zero by subtracting best deviance
betahat <- beta[best] #Store best beta. Should be the fitted value.
stderror <- 0.12187 #Store the std error of sex, found in summary(model)
quadratic <- ((beta-betahat)^2)*(1/(stderror^2))
#Quadratic reference function to check quadratic assumption against
x11()
plot(beta,deviance0,type="l",xlab="Beta(sex)",ylim=c(0,4))
lines(beta,quadratic,lty=2,col=3) #Add quadratic reference line
abline(3.84,0,lty=3) #Add line at Deviance = 3.84

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Logistic Regression Using R

I am running logistic regressions using R right now, but I cannot seem to get many useful model fit statistics. I am looking for metrics similar to SAS:
http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm
Does anyone know how (or what packages) I can use to extract these stats?
Thanks
Here's a Poisson regression example:
## from ?glm:
d.AD <- data.frame(counts=c(18,17,15,20,10,20,25,13,12),
outcome=gl(3,1,9),
treatment=gl(3,3))
glm.D93 <- glm(counts ~ outcome + treatment,data = d.AD, family=poisson())
Now define a function to fit an intercept-only model with the same response, family, etc., compute summary statistics, and combine them into a table (matrix). The formula .~1 in the update command below means "refit the model with the same response variable [denoted by the dot on the LHS of the tilde] but with only an intercept term [denoted by the 1 on the RHS of the tilde]"
glmsumfun <- function(model) {
glm0 <- update(model,.~1) ## refit with intercept only
## apply built-in logLik (log-likelihood), AIC,
## BIC (Bayesian/Schwarz Information Criterion) functions
## to models with and without intercept ('model' and 'glm0');
## combine the results in a two-column matrix with appropriate
## row and column names
matrix(c(logLik(glm.D93),BIC(glm.D93),AIC(glm.D93),
logLik(glm0),BIC(glm0),AIC(glm0)),ncol=2,
dimnames=list(c("logLik","SC","AIC"),c("full","intercept_only")))
}
Now apply the function:
glmsumfun(glm.D93)
The results:
full intercept_only
logLik -23.38066 -26.10681
SC 57.74744 54.41085
AIC 56.76132 54.21362
EDIT:
anova(glm.D93,test="Chisq") gives a sequential analysis of deviance table containing df, deviance (=-2 log likelihood), residual df, residual deviance, and the likelihood ratio test (chi-squared test) p-value.
drop1(glm.D93) gives a table with the AIC values (df, deviances, etc.) for each single-term deletion; drop1(glm.D93,test="Chisq") additionally gives the LRT test p value.
Certainly glm with a family="binomial" argument is the function most commonly used for logistic regression. The default handling of contrasts of factors is different. R uses treatment contrasts and SAS (I think) uses sum contrasts. You can look these technical issues up on R-help. They have been discussed many, many times over the last ten+ years.
I see Greg Snow mentioned lrm in 'rms'. It has the advantage of being supported by several other functions in the 'rms' suite of methods. I would use it , too, but learning the rms package may take some additional time. I didn't see an option that would create SAS-like output.
If you want to compare the packages on similar problems that UCLA StatComputing pages have another resource: http://www.ats.ucla.edu/stat/r/dae/default.htm , where a large number of methods are exemplified in SPSS, SAS, Stata and R.
Using the lrm function in the rms package may give you the output that you are looking for.

Resources