Cluster robust standard errors for mixed effect/LMER models? - r

I'm estimating a mixed effects model using simulated data. The basis of this is a conjoint experiment: there are N number of countries in the study with P participants and each respondent is shown the experiment twice. This means that there are NxPx2 observations. Heterogeneity is introduced into the data at the country level and so I run a mixed effect model using lmer with random effects varying by country to account for this variance. However, because each respondent does the experiment twice, I also want to cluster my standard errors at the individual level. My data and model looks something like this:
library(lme4)
data(iris)
# generating IDs for observations
iris <- iris %>% mutate(id = rep(1:(n()/2), each = 2))
#run model
mod <- lmer(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width + (Sepal.Width+Petal.Length+Petal.Width || Species), data=iris, REML = F, control = lmerControl(optimizer = 'bobyqa'))
I then attempt to get clustered SEs using the parameters package:
library(parameters)
param <- model_parameters(
mod,
robust = TRUE,
vcov_estimation = "CR",
vcov_type = "CR1",
vcov_args = list(cluster = iris$id)
)
This returns an error:
Error in vcovCR.lmerMod(obj = new("lmerModLmerTest", vcov_varpar = c(0.00740122363004, : Non-nested random effects detected. clubSandwich methods are not available for such models.
I'm not married to any one method or anything. I just want to return clustered SEs for this type of model specification. As of now I can't find any package that does this. Does anyone know how this can be done, or if such a model even makes sense? I'm new to MLMs but I was thinking if I were to run this as a simple linear model I would lm_robust and cluster by individual so it makes sense to me that I should do the same here as well.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

How to use Cross Validation to Determine a Final Model using Training, Validation, & Test Sets

I am having trouble understanding which datasets: training, validation, and test need to be used for the model selection phase vs the Final Model testing phase. I try to explain as much of it in detail below while posting reproducible code at the bottom. Thank you for any and all advice / suggestions!
Let's say we use the open "Life Expectancy (WHO)" dataset available on Kaggle to create predictions on the feature Life expectancy while using RMSE as our measurement of error. (I am asking more so about the concepts behind CV here rather than targeting the lowest RMSE). We first partition a training and test set led_train and led_test from the original dataset led.
Next we create a linear model with y = Life expectancy and x = GDP with data = led_train and do the same for random forest and knn models using repeated cross validation using the Caret Package. We then run predictions with the newly created models and led_test. The RMSE can be calculated using a function of true vs predicted ratings.
I now have RMSEs of Linear Model = 9.81141, Random Forest = 9.828415, kNN = 8.923281 on the test set. Based on these values, I would obviously select the kNN Model to be my "Final Model," however I am not sure how to test it on new "unseen" data to see how well it actually performs.
Do I need to split "led" into 3 sets (training, validation, and test) then use validation for the model selection phase, saving test for the "Final Model?" Additionally, if I choose the kNN model, would I change the data inside the train function = led_train to led so that it is run on ALL of the data, after which I use the led_test for the prediction? In the Final Model, would I again set trControl and run cross validation or is this no longer necessary because this was done on the training data? Please find my reproducible code posted below (you will have to read in the .csv according to your wd) and thank you again for taking a look!
*The seed is set to 123 for reproducibility and I am running R 3.63.
library(pacman)
pacman::p_load(readr, caret, tidyverse, dplyr)
# Download the dataset:
download.file("https://raw.githubusercontent.com/christianmckinnon/StackQ/master/LifeExpectancyData.csv", "LifeExpectancyData.csv")
# Read in the data:
led <-read_csv("LifeExpectancyData.csv")
# Check for NAs
sum(is.na(led))
# Set all NAs to 0
led[is.na(led)] <- 0
# Rename `Life expectancy` to life_exp to avoid using spaces
led <-led %>% rename(life_exp = `Life expectancy`)
# Partition training and test sets
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = led$life_exp, times = 1, p = 0.2, list = F)
led_train <- led[-test_index,]
led_test <- led[test_index,]
# Add RMSE as unit of error measurement
RMSE <-function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
# Create a linear model
led_lm <- lm(life_exp ~ GDP, data = led_train)
# Create prediction
lm_preds <-predict(led_lm, led_test)
# Check RMSE
RMSE(led_test$life_exp, lm_preds)
# The linear Model achieves an RMSE of 9.81141
# Create a Random Forest Model with Repeated Cross Validation
led_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 3,
search = "random")
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(life_exp ~ GDP, data = led_train,
method = "rf", ntree = 150, trControl = led_cv,
tuneLength = 5, nSamp = 1000,
preProcess = c("center","scale"))
# Create Prediction
rf_preds <-predict(train_rf, led_test)
# Check RMSE
RMSE(led_test$life_exp, rf_preds)
# The rf Model achieves an RMSE of 9.828415
# kNN Model:
knn_cv <-trainControl(method = "repeatedcv", repeats = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(life_exp ~ GDP, method = "knn", data = led_train,
tuneLength = 10, trControl = knn_cv,
preProcess = c("center","scale"))
# Create the Prediction:
knn_preds <-predict(train_knn, led_test)
# Check the RMSE:
RMSE(led_test$life_exp, knn_preds)
# The kNN model achieves the lowest RMSE of 8.923281
My approach would be the following. The final model should use all of the data. I am not sure what would motivate not including all data in the final model. You are just throwing away predictive power.
For cross validation, just split the data into training and test data. Then choose the modelling method with the best performance for the full model, and then create the complete model.
The bigger problem with the current code is that the cross validation method is likely to result in two things: spurious accuracy and potentially spurious model comparisons. You need to deal with temporal autocorrelation in the cross validation. For example, if my training dataset has features for the UK for 2014 and 2016, you expect something like a random forest to be able to predict life expectancy for 2015 with high accuracy. And that is potentially all you are measuring with the current type of cross validation. Better to create a segregated dataset so that the countries in training and test are different, or splitting it into clearly distinct time periods. The exact approach would depend on exactly what you want the model to predict

R - fixed effect of panel data analysis and robust standard errors

I am working with the panel data through plm package in R. And now I am considering a fixed effect model of group (cities), time, and two ways of group and time, respectively. Because I detected heteroskedasticity through the Breusch-Pagan test, I compute robust standard errors.
I read a help ?vcovHC, but I could not understand fully how to utilize coeftest.
My current code is:
library(plm)
library(lmtest)
library(sandwich)
fem_city <- plm (z ~ x+y, data = rawdata, index = c("city","year"), model = "within", effect = "individual")
fem_year <- plm (z ~ x+y, data = rawdata, index = c("city","year"), model = "within", effect = "time")
fem_both <- plm (z ~ x+y, data = rawdata, index = c("city","year"), model = "within", effect = "twoways")
coeftest(fem_city, vcovHC(fem_city, type = 'HC3', cluster = 'group')
coeftest(fem_year, vcovHC(fem_city, type = 'HC3', cluster = 'time')
In order to compute the robust standard errors, are codes of coeftest appropriate? I am wondering that how to set the cluster option for effect = 'individual and effect = 'time' each.
For example, I set coeftest codes:
cluster = 'group' in plm of fem_city for effect = 'individual' in coeftest
cluster = 'time' in plm of fem_year for effect = 'time' in coeftest
Is this way appropriate?
And, how to compute the robust standard error for twoways of both city and year?
Set cluster='group' if you want to cluster on the variable serving as the individual index (city in your example).
Set cluster='time' if you want to cluster on the variable serving as the time index (yearin your example).
You can cluster on the time index even for a fixed effects one-way individual model.
For clustering on both index variables, you cannot do that with plm::vcovHC. Look at vcovDC from the same packages which provides double clustering (DC = double clustering), e.g.,
coeftest(fem_city, vcovDC(fem_city)

Predicting responses for new observations using a model developed with multiple imputation via MICE

I have developed a model via multiple imputation using mice. I want to use this model to predict responses for new observations (containing no missing data), including standard errors. Passing the model object created in mice to predict doesn't work
A simple example using the in-built nhanes dataset. Say I wanted to develop a logistic regression model with the form age == 3 ~ bmi + hyp + chl, and use this model to predict, say, prob(age = 3 | bmi = 20, hyp = 2 and chl = 190)
library('mice')
imp<-mice(nhanes, seed = 1)
#create model on each imputed dataset
model <- with(imp, glm(age == 3 ~ bmi + hyp + chl, family = binomial))
#pool models into one
poolmodel <- pool(model)
#new data
newdata <- data.frame(bmi = 20, hyp = 2, chl = 190)
#attempt to predict response using predict() function
pred <- predict(object = model, newdata = newdata, type = 'link', se.fit = TRUE)
Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('mira', 'matrix')"
pred <- predict(object = poolmodel, newdata = newdata, type = 'link', se.fit = TRUE)
Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "c('mipo', 'mira', 'matrix')"
Obviously it would be straight forward to calculate predicted responses and errors manually using the pooled coefficients and the pooled covariance matrix. The real problem however is much larger and the model relies on a few splines and interactions, complicating calculations considerably. I would rather use existing functions that can do all this for me.
Is there a simple solution in R that will output predicted responses for any given (pooled) model object and any given set of new observations, without having to make cumbersome code modifications?
One way to do this is to stack all imputed data together and fit model on this complete dataset. After that you can use the function predict as normal. Parameter estimates generated by pool is actually the average of parameter estimates when you fit the same model on each imputed data separately. Of course, in this case, standard error for each covariate is underestimated.

How to Code Selection for Bootstrap Probit Models in R

This question regards how to code variable selection in a probit model with marginal effects (either directly or by calling some pre-existing package).
I'm conducting a little probit regression of the effects of free and commercial availability of films on the level of piracy of those films as a TLAPD-related blog post.
The easy way of running a probit in R is typically through glm, i.e.:
probit <- glm(y ~ x1 + x2, data=data, family =binomial(link = "probit"))
but that's problematic for interpretation because it doesn't supply marginal effects.
Typically, if I want marginal effects from a probit regression I define this function (I don't recall the original source, but it's a popular function that gets re-posted a lot):
mfxboot <- function(modform,dist,data,boot=500,digits=3){
x <- glm(modform, family=binomial(link=dist),data)
# get marginal effects
pdf <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
x1 <- glm(modform, family=binomial(link=dist),samp1)
pdf1 <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){
res1 <- res[2:nrow(res),]
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
rownames(res2) <- rownames(res1)
} else {
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
rownames(res2) <- rownames(res)
}
colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
return(res2)
}
Then run the regression like this:
mfxboot(modform = "y ~ x1 + x2",
dist = "probit",
data = piracy)
but using that approach I don't know that I can run any variable selection algorithms like forward, backward, stepwise, etc.
What's the best way to solve this problem? Is there a better way of running probits in R that reports marginal effects and also allows for automated model selection? Or should I focus on using mfxboot and doing variable selection with that function?
Thanks!
It is not clear why there is a problem. Model (variable) selection and computing of the marginal effects for a given model are sequential, and there is no reason to try to combine the two.
Here is how you might go about computing marginal effects and their bootstrapped standard effects post model (variable) selection:
Perform variable selection using your preferred model selection procedure (including bootstrap model selection techniques as discussed below, not to be confused with the bootstrap you will use to compute the standard errors of the marginal effects for the final model).
Here is an example on the dataset supplied in this question. Note also that this is in no way an endorsement of the use of stepwise variable selection methods.
#================================================
# read in data, and perform variable selection for
# a probit model
#================================================
dfE = read.csv("ENAE_Probit.csv")
formE = emploi ~ genre +
filiere + satisfaction + competence + anglais
glmE = glm(formula = formE,
family = binomial(link = "probit"),
data = dfE)
# perform model (variable) selection
glmStepE = step(object = glmE)
Now pass the selected model to a function that computes the marginal effects.
#================================================
# function: compute marginal effects for logit and probit models
# NOTE: this assumes that an intercept has been included by default
#================================================
fnMargEffBin = function(objBinGLM) {
stopifnot(objBinGLM$family$family == "binomial")
vMargEff = switch(objBinGLM$family$link,
probit = colMeans(outer(dnorm(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1]),
logit = colMeans(outer(dlogis(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1])
)
return(vMargEff)
}
# test the function
fnMargEffBin(glmStepE)
Here is the output:
> fnMargEffBin(glmStepE)
genre filiere
0.06951617 0.04571239
To get at the standard errors of the marginal effects, you could bootstrap the marginal effects, using, for example, the Boot function from the car function since it provides such a neat interface to bootstrap derived statistics from glm estimates.
#================================================
# compute bootstrap std. err. for the marginal effects
#================================================
margEffBootE = Boot(object = glmStepE, f = fnMargEffBin,
labels = names(coef(glmE))[-1], R = 100)
summary(margEffBootE)
Here is the output:
> summary(margEffBootE)
R original bootBias bootSE bootMed
genre 100 0.069516 0.0049706 0.045032 0.065125
filiere 100 0.045712 0.0013197 0.011714 0.048900
Appendix:
As a matter of theoretical interest, there are two ways to interpret your bootstrapped variable selection ask.
You can perform model selection (variable selection) by using as a measure of fit a bootstrap model fit criteria. The theory for this is outlined in Shao (1996), and requires a subsampling approach.
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.
You can perform variable selection on multiple bootstrap samples, and arrive at either one best model by looking at the variables retained across the multiple bootstrap model selections, or use a model averaging estimator. The theory for this is discussed in Austin and Tu (2004).
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.

Resources