Issue with Car ANOVA function in R? - r

I am running car version 2.1.4 and trying to use the Anova function to get Wald-based p-values for a power analysis using logistic regression with success/failure setup. If I run the following simple factorial, the function through an error due to 0 residual degrees of freedom but clearly the sample size is extremely large. What am I doing or thinking about this wrong?
Is the problem with the glm() call since that likewise says zero residual df?
X <- matrix(c(100,66566,73,66593,1201,398799,165,66501),
nrow = 4,ncol = 2,byrow = TRUE)
x_df <- data.frame(premium = c(300,300,500,500),
restrict = c(500,2500,500,2500))
x_df$int <- x_df$premium * x_df$restrict
mod <- glm(X~premium+restrict+int,
data=x_df,family=binomial)
summary(mod)
car::Anova(mod,type="III","Wald")
ADD #1:
It appears that the success/failure syntax doesn't work properly. When I manually expand out the data to ~600,000 rows the fit is the same but the res.df are correct:
X<-matrix(c(100,66566,73,66593,1201,398799,165,66501),nrow = 4,ncol = 2,byrow = TRUE)
x_df<-data.frame(premium=c(300,300,500,500),restrict=c(500,2500,500,2500))
x_df$int<-x_df$premium*x_df$restrict
mod<-glm(X~premium+restrict+premium*restrict, data=x_df,family=binomial)
summary(mod)
Anova(mod,type = "III",test.statistic = "Wald")
y=c(rep(1,100),rep(0,66566),rep(1,73),rep(0,66593),rep(1,1201),rep(0,398799),rep(1,165),rep(0,66501))
premium<-c(rep(300,66666*2),rep(500,1201+398799+165+66501))
restrict<-c(rep(500,66666),rep(2500,66666),rep(500,1201+398799),rep(2500,165+66501))
x<-data.frame(y=y,premium=premium,restrict=restrict)
mod2<-glm(y~premium+restrict+premium*restrict,data=x,family=binomial)
summary(mod2)
Anova(mod2,type = "III",test.statistic = "Wald")

The residual degrees of freedom is (number of observations)-(number of parameters). You have four observations and four parameters. I'm not sure what else to say ...

Related

Can glmmLasso be used with the Tweedie distribution?

I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.

How to extract AIC from polr summary output in R

I'd like to quickly compare AICs that are provided as output when running summary() on individual polr() models created using the MASS package in R. I have no problem compiling this info, but what I can't figure out is where exactly the AIC info is being stored in the polr model objects themselves.
I've tried using str() and attributes() on my model objects, and I've even tried using getAnywhere("polr") to look at the source code itself. Nothing is standing out to me.
Anyone know how to extract AIC output from summary(polr_mod)?
Example for reference:
library(MASS)
dat <- data.frame(v1 = factor(rep(0:2,each=3),ordered = T), v2 = rep(1:3,each=3))
mod_polr <- polr(v1 ~ v2, data = dat, Hess = T, method = "logistic" )
summary(mod_polr)
Call:
polr(formula = v1 ~ v2, data = dat, Hess = T, method = "logistic")
Coefficients:
v2
46.7233
Intercepts:
0|1 1|2
73.62014 117.05781
Residual Deviance: 1.560782e-08
AIC: 6.00
^ See, at the bottom of the output is AIC: 6.00. Where is this stored in the object? I'd like to call/extract it.
AIC is a generic function with no method for objects of class "polr" but with a default method.
The default method's code can be seen by running
getAnywhere("AIC.default")
and what it does is to call logLik on its first argument and then to compute the AIC with k = 2.
The number of model parameters is given by
attr(lls, "df")
And the computation is
lls <- logLik(mod_polr)
-2*as.numeric(lls) + k*attr(lls, "df")
In its turn, logLik is also generic but with a method for objects of class "polr".
The code returned by
getAnywhere("logLik.polr")
is very simple, a one-liner. Note that the df in its code was edf in the object "polr". Indented it's the following.
logLik.polr <- function(object, ...) {
structure(
-0.5 * object$deviance,
df = object$edf,
nobs = object[["nobs"]],
class = "logLik"
)
}
The explanation for the deviance, edf and nobs are found in help("polr), section Value.
deviance
the residual deviance.
edf
the (effective) number of degrees of freedom used by the model
And like that section says, nobs is used for stepAIC (but not for logLik.polr).
So this can all be pieced together in one function.
Write a AIC method for objects of class "polr" simplifying the code above.
AIC.polr <- function(x, k = 2){
dev <- x$deviance
nparams <- x$edf
dev + k*nparams
}
AIC(mod_polr)
#[1] 6
I know that tihs is an old question ut I just want to share in case others have the same question and want another option.
Using modelsummary() to create your regression tabel then you also get AIC, BIC and RMSE. If you need example let me know.

Looping cv.glmnet and get the 'best' coefficients

As noted in the help of cv.glmnet, "the results of cv.glmnet are random, since the folds are selected at random. Users can reduce this randomness by running cv.glmnet many times, and averaging the error curves.".
If I make a loop doing n-times cv.glmnet, how can I extract the 'best' coefficients? I usually take the coefficients using this command:
coe<- coef(cvfit, s = "lambda.min")
If I use the mean of all the "lambda.min" then I don't know how to choose the right cvfit out of the many I generated. Do I have to use the mean of cvfit$cvm or MSE or other things?
Thanks
when you do coef(cvfit, s = "lambda.min"), you are taking the lambda that is 1 standard error from the best lambda, see this discusssion So you can average the MSEs across different cv runs
library(glmnet)
library(mlbench)
data(BostonHousing)
X = as.matrix(BostonHousing[,-c(4,14)])
Y = BostonHousing[,14]
nfolds = 5
nreps = 10
res = lapply(1:nreps,function(i){
fit = cv.glmnet(x=X,y=Y,nfolds=nfolds)
data.frame(MSE_mean=fit$cvm,lambda=fit$lambda,se=fit$cvsd)
})
res = do.call(rbind,res)
We can summarize the results, the standard deviation is approximated by just taking the mean, but if you wanna be precise, might have to look into the formula for pooled standard deviation:
library(dplyr)
summarized_res = res %>%
group_by(lambda) %>%
summarise(MSE=mean(MSE_mean),se=mean(se)) %>%
arrange(desc(lambda))
idx = which.min(summarized_res$MSE)
lambda.min = summarized_res$lambda[idx]
lambda.min
[1] 0.019303
index_1se = with(summarized_res,which(MSE < MSE[idx]+se[idx])[1])
lambda_1se = summarized_res$lambda[index_1se]
lambda_1se
[1] 0.3145908
We can plot this:
library(ggplot2)
ggplot(res,aes(x=log(lambda),y=MSE_mean)) + stat_summary(fun=mean,size=2,geom="point") +
geom_vline(xintercept=c(lambda.min,lambda_1se))

Manual LOOCV vs cv.glm

In Introduction to Statistical Learning we're asked to do the Leave Out One Cross Validation over logistic regression manually. The code for it is here:
count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
##fitting a logistic regression model, not including ith data in the training data
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
is_true_up = Weekly[i, ]$Direction == "Up"
if (is_up != is_true_up)
count[i] = 1
}
sum(count)
##[1] 490
The source of this code can be found here.
Which means that the error rate is approximately 45 %.
But when we do it, using the cv.glm() function of the boot library, the result is far different.
> library(boot)
> glm.fit = glm(Direction~Lag1+Lag2,data=Weekly,family=binomial)
> cv.glm = cv.glm(Weekly,glm.fit)
> cv.glm$delta
[1] 0.2464536 0.2464530
Why does this occur? What does the cv.glm() function exactly do?
I believe there may be a bug in the cv.glm function. On line 23 it calculates
cost(glm.y, fitted(glmfit)) where fitted(glmfit) are fitted probabilities. In order to calculate cross-validated error rate (= total number of misclassified observations over n), we first need to map these to classes. In other words, if you replace
cost.0 <- cost(glm.y, fitted(glmfit))
with
cost.0 <- cost(glm.y, ifelse(fitted(glmfit)>0.5, 1, 0))
I believe you should get the same thing as what you coded up manually.

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

Resources