How to get AIC from lm_robust object - r

How do I get an AIC from an lm_robust object (package estimatr)? I'm using lm_robust because I want to use a robust estimator for calculating the SE. Unlike the lm function, AIC is not provided when you run the summary function and running the AIC function on a lm_robust object produces an error. Below is a toy example of the kind of model I'm trying to run.
library(estimatr)
fake_data<-data.frame(outcome=rnorm(100,3.65,1),
pred1=rnorm(100,15,7),
pred2=as.factor(sample(1:5, 100, replace = T)))
mod1<-lm_robust(outcome~pred1+pred2,data=fake_data)
AIC(mod1)
here is what the error message looks like:
> AIC(mod1)
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "lm_robust"

If you have to do it with lm_robust, you may choose to calculate it by yourself as below,
The formula of AIC,
AIC = 2*k + n [Ln( 2(pi) RSS/n ) + 1]
# n : Number of observation
# k : All variables including all distinct factors and constant
# RSS : Residual Sum of Square
If we apply it to R for your case,
# Note that, I take k=7 since you have, 5 factors + 1 continuous and 1 constant
AIC_calculated <- 2*7 + 100* (log( 2*pi* (1-mod1$r.squared)*mod1$tss/100 ) + 1)
[1] 332.2865
which is same with both lm and glm outputs.
mod2<-lm(outcome~pred1+pred2,data=fake_data)
> AIC(mod2)
[1] 332.2865
And finally, of course, you can put this calculation into a function to call whenever you want by just giving lm_robust model inside it without having to set the N and k parameters for any given data like,
myAIC <- function(data) {
2*(data$k+1) + data$N * (log(2*pi* (1-data$r.squared)*data$tss/data$N ) + 1)
}
> myAIC(mod1)
[1] 332.2865
Note: Results may be shown different in your computer because of the seeding differences when running the sample() function in dataframe.

Here's a workaround
mod1 = lm_robust(outcome ~ pred1 + pred2, data = fake_data)
#Create any fitted model using 'lm' as a placeholder
mod2 = with(list(x = rnorm(10), y = rnorm(10)), lm(y ~ x))
#Copy values in `mod2` from `mod1`
mod2[names(mod2)] = mod1[names(mod2)]
#Calculate residuals in `mod2`
mod2$residuals = mod2$fitted.values - fake_data$outcome
AIC(mod2)
#[1] 326.6092

Related

Can glmmLasso be used with the Tweedie distribution?

I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.

Error: $ operator not defined for this S4 class while running hoslem.test

I'm working on an optimization of a logistic regression model made with glm, the optimization is a lasso regression using glmnet. I want to compare both models using the output of a Hosmer Lemeshow test and I get this output.
For the glm I get
> hl <- hoslem.test(trainingDatos$Exited, fitted(logit.Mod))
> hl
Hosmer and Lemeshow goodness of fit (GOF) test
data: trainingDatos$Exited, fitted(logit.Mod)
X-squared = 2.9161, df = 8, p-value = 0.9395
And when I try to run the test for the lasso regression I get
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model), g=10)
Error in cut.default(yhat, breaks = qq, include.lowest = TRUE) :
'x' must be numeric
I also tried to use the coefficients of the lasso regression to make it numeric and I get
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model$beta), g=10)
Error: $ operator not defined for this S4 class
But when I treat it as an S4
> hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model#beta), g=10)
Error in fitted(lasso.model#beta) :
trying to get slot "beta" from an object (class "lognet") that is not an S4 object
Any way to run the test for my lasso regression?
Here is my full code for the lasso regression, can't share the database right now sorry
#Creation of Training Data Set
input_ones <- Datos[which(Datos$Exited == 1), ] #All 1s
input_zeros <- Datos[which(Datos$Exited == 0), ] #All 0s
set.seed(100)
#Training 1s
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))
#Training 0s
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingDatos <- rbind(training_ones, training_zeros)
library(glmnet)
#Conversion of training data into matrix form
x <- model.matrix(Exited ~ CreditScore + Geography + Gender
+ Age + Tenure + Balance + IsActiveMember
+ EstimatedSalary, trainingDatos)[,-1]
#Defining numeric response variable
y <- trainingDatos$Exited
sed.seed(100)
#Grid search to find best lambda
cv.lasso<-cv.glmnet(x, y, alpha = 1, family = "binomial")
#Creation of the model
lasso.model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
coef(cv.lasso, cv.lasso$lambda.1se)
#Now trying to run the test
library(ResourceSelection)
set.seed(12657)
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model), g=10)#numeric value error
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model$beta), g=10)#$ not defined for S4
hll <- hoslem.test(trainingDatos$Exited, fitted(lasso.model#beta), g=10)#saying that beta is nos S4
glmnet uses a unique predict() method for obtaining fitted values. As rightly mentioned, the errors came from using fitted(). Meanwhile, running such tests could be easier with the gofcat package. Supported objects are passed directly to the functions. Your glm model, for instance, goes hosmerlem(logit.Mod).

Manual LOOCV vs cv.glm

In Introduction to Statistical Learning we're asked to do the Leave Out One Cross Validation over logistic regression manually. The code for it is here:
count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
##fitting a logistic regression model, not including ith data in the training data
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
is_true_up = Weekly[i, ]$Direction == "Up"
if (is_up != is_true_up)
count[i] = 1
}
sum(count)
##[1] 490
The source of this code can be found here.
Which means that the error rate is approximately 45 %.
But when we do it, using the cv.glm() function of the boot library, the result is far different.
> library(boot)
> glm.fit = glm(Direction~Lag1+Lag2,data=Weekly,family=binomial)
> cv.glm = cv.glm(Weekly,glm.fit)
> cv.glm$delta
[1] 0.2464536 0.2464530
Why does this occur? What does the cv.glm() function exactly do?
I believe there may be a bug in the cv.glm function. On line 23 it calculates
cost(glm.y, fitted(glmfit)) where fitted(glmfit) are fitted probabilities. In order to calculate cross-validated error rate (= total number of misclassified observations over n), we first need to map these to classes. In other words, if you replace
cost.0 <- cost(glm.y, fitted(glmfit))
with
cost.0 <- cost(glm.y, ifelse(fitted(glmfit)>0.5, 1, 0))
I believe you should get the same thing as what you coded up manually.

Leave-One-Out CV implementation for lin. regression

I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.
I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:
cafe <- read.table("C:/.../cafedata.txt", header=T)
cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")
#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]
library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales
#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
# Attributes:
#------------------------------
# fit: Fit of the model
# dataset: Dataset to be used
# -----------------------------
# Returns mean of squared errors for each fold - MSE
MSEP_=c()
for (idx in 1:nrow(dataset)){
train <- dataset[-c(idx),]
test <- dataset[idx,]
MSEP_[idx]<-(predict(fit, newdata = test) - dataset[idx,]$Sales)^2
}
return(mean(MSEP_))
}
Then when I fit the simple linear model and call the function, it works:
#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)
#MSE:258
LOOCV(fit.all.c, cafe.c)
However when I fit the same lm() only with subset of columns, I get an error:
#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")
fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
+ cafe.c$Wraps.Waste + cafe.c$Muffins.Sold
+ cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold
+ cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees
+ cafe.c$Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
5495.069
There were 50 or more warnings (use warnings() to see the first 50)
If I look closer:
idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]
(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2
I get MSE for all rows and an error:
Warning message:
'newdata' had 1 row but variables found have 47 rows
EDIT
I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.
Change your code like the following:
fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
+ Wraps.Waste + Muffins.Sold
+ Muffins.Waste + Fruit.Cup.Sold
+ Chips + Sodas + Coffees
+ Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
# [1] 278.8984
idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column
(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
# 1
# 51.8022
Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

Resources