How to Code Selection for Bootstrap Probit Models in R - r

This question regards how to code variable selection in a probit model with marginal effects (either directly or by calling some pre-existing package).
I'm conducting a little probit regression of the effects of free and commercial availability of films on the level of piracy of those films as a TLAPD-related blog post.
The easy way of running a probit in R is typically through glm, i.e.:
probit <- glm(y ~ x1 + x2, data=data, family =binomial(link = "probit"))
but that's problematic for interpretation because it doesn't supply marginal effects.
Typically, if I want marginal effects from a probit regression I define this function (I don't recall the original source, but it's a popular function that gets re-posted a lot):
mfxboot <- function(modform,dist,data,boot=500,digits=3){
x <- glm(modform, family=binomial(link=dist),data)
# get marginal effects
pdf <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
x1 <- glm(modform, family=binomial(link=dist),samp1)
pdf1 <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){
res1 <- res[2:nrow(res),]
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
rownames(res2) <- rownames(res1)
} else {
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
rownames(res2) <- rownames(res)
}
colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
return(res2)
}
Then run the regression like this:
mfxboot(modform = "y ~ x1 + x2",
dist = "probit",
data = piracy)
but using that approach I don't know that I can run any variable selection algorithms like forward, backward, stepwise, etc.
What's the best way to solve this problem? Is there a better way of running probits in R that reports marginal effects and also allows for automated model selection? Or should I focus on using mfxboot and doing variable selection with that function?
Thanks!

It is not clear why there is a problem. Model (variable) selection and computing of the marginal effects for a given model are sequential, and there is no reason to try to combine the two.
Here is how you might go about computing marginal effects and their bootstrapped standard effects post model (variable) selection:
Perform variable selection using your preferred model selection procedure (including bootstrap model selection techniques as discussed below, not to be confused with the bootstrap you will use to compute the standard errors of the marginal effects for the final model).
Here is an example on the dataset supplied in this question. Note also that this is in no way an endorsement of the use of stepwise variable selection methods.
#================================================
# read in data, and perform variable selection for
# a probit model
#================================================
dfE = read.csv("ENAE_Probit.csv")
formE = emploi ~ genre +
filiere + satisfaction + competence + anglais
glmE = glm(formula = formE,
family = binomial(link = "probit"),
data = dfE)
# perform model (variable) selection
glmStepE = step(object = glmE)
Now pass the selected model to a function that computes the marginal effects.
#================================================
# function: compute marginal effects for logit and probit models
# NOTE: this assumes that an intercept has been included by default
#================================================
fnMargEffBin = function(objBinGLM) {
stopifnot(objBinGLM$family$family == "binomial")
vMargEff = switch(objBinGLM$family$link,
probit = colMeans(outer(dnorm(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1]),
logit = colMeans(outer(dlogis(predict(objBinGLM,
type = "link")),
coef(objBinGLM))[, -1])
)
return(vMargEff)
}
# test the function
fnMargEffBin(glmStepE)
Here is the output:
> fnMargEffBin(glmStepE)
genre filiere
0.06951617 0.04571239
To get at the standard errors of the marginal effects, you could bootstrap the marginal effects, using, for example, the Boot function from the car function since it provides such a neat interface to bootstrap derived statistics from glm estimates.
#================================================
# compute bootstrap std. err. for the marginal effects
#================================================
margEffBootE = Boot(object = glmStepE, f = fnMargEffBin,
labels = names(coef(glmE))[-1], R = 100)
summary(margEffBootE)
Here is the output:
> summary(margEffBootE)
R original bootBias bootSE bootMed
genre 100 0.069516 0.0049706 0.045032 0.065125
filiere 100 0.045712 0.0013197 0.011714 0.048900
Appendix:
As a matter of theoretical interest, there are two ways to interpret your bootstrapped variable selection ask.
You can perform model selection (variable selection) by using as a measure of fit a bootstrap model fit criteria. The theory for this is outlined in Shao (1996), and requires a subsampling approach.
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.
You can perform variable selection on multiple bootstrap samples, and arrive at either one best model by looking at the variables retained across the multiple bootstrap model selections, or use a model averaging estimator. The theory for this is discussed in Austin and Tu (2004).
You then compute marginal effects and their bootstrap standard errors conditional on the best model selected above.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Is there a way to both include PCSE and Prais-Winsten correction in a fixed effects model in R (similar to the xtpcse function in Stata)?

I want to estimate a fixed effects model while using panel-corrected standard errors as well as Prais-Winsten (AR1) transformation in order to solve panel heteroscedasticity, contemporaneous spatial correlation and autocorrelation.
I have time-series cross-section data and want to perform regression analysis. I was able to estimate a fixed effects model, panel corrected standard errors and Prais-winsten estimates individually. And I was able to include panel corrected standard errors in a fixed effects model. But I want them all at once.
# Basic ols model
ols1 <- lm(y ~ x1 + x2, data = data)
summary(ols1)
# Fixed effects model
library('plm')
plm1 <- plm(y ~ x1 + x2, data = data, model = 'within')
summary(plm1)
# Panel Corrected Standard Errors
library(pcse)
lm.pcse1 <- pcse(ols1, groupN = Country, groupT = Time)
summary(lm.pcse1)
# Prais-Winsten estimates
library(prais)
prais1 <- prais_winsten(y ~ x1 + x2, data = data)
summary(prais1)
# Combination of Fixed effects and Panel Corrected Standard Errors
ols.fe <- lm(y ~ x1 + x2 + factor(Country) - 1, data = data)
pcse.fe <- pcse(ols.fe, groupN = Country, groupT = Time)
summary(pcse.fe)
In the Stata command: xtpcse it is possible to include both panel corrected standard errors and Prais-Winsten corrected estimates, with something allong the following code:
xtpcse y x x x i.cc, c(ar1)
I would like to achieve this in R as well.
I am not sure that my answer will completely address your concern, these days I've been trying to deal with the same problem that you mention.
In my case, I ran the Prais-Winsten function from the package prais where I included my model with the fixed effects. Afterwards, I correct for heteroskedasticity using the function vcovHC.prais which is analogous to vcovHC function from the package sandwich.
This basically will give you White's/sandwich heteroskedasticity-consistent covariance matrix which, if you later fit into the function coeftest from the package lmtest, it will give you the table output with the corrected standard errors. Taking your posted example, see below the code that I have used:
# Prais-Winsten estimates with Fixed Effects
library(prais)
prais.fe <- prais_winsten(y ~ x1 + x2 + factor(Country), data = data)
library(lmtest)
prais.fe.w <- coeftest(prais.fe, vcov = vcovHC.prais(prais.fe, "HC1")
h.m1 # run the object to see the output with the corrected standard errors.
Alas, I am aware that the sandwhich heteroskedasticity-consistent standard errors are not exactly the same as the Beck and Katz's PCSEs because PCSE deals with panel heteroskedasticity while sandwhich SEs addresses overall heteroskedasticity. I am not totally sure in how much these two differ in practice, but something is something.
I hope my answer was somehow helpful, this is actually my very first answer :D

Probability predictions with model averaged Cumulative Link Mixed Models fitted with clmm in ordinal package

I found that the predict function is currently not implemented in cumulative link mixed models fitted using the clmm function in ordinal R package. While predict is implemented for clmm2 in the same package, I chose to apply clmm instead because the later allows for more than one random effects. Further, I also fitted several clmm models and performed model averaging using model.avg function in MuMIn package. Ideally, I want to predict probabilities using the average model. However, while MuMIn supports clmm models, predict will also not work with the average model.
Is there a way to hack the predict function so that the function not only could predict probabilities from a clmm model, but also predict using model averaged coefficients from clmm (i.e. object of class "averaging")? For example:
require(ordinal)
require(MuMIn)
mm1 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "probit", threshold = "equidistant")
## test random effect:
mm2 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "logistic", threshold = "equidistant")
#create a model selection object
mm.sel<-model.sel(mm1,mm2)
##perform a model average
mm.avg<-model.avg(mm.sel)
#create new data and predict
new.data<-soup
##predict with indivindual model
predict(mm1, new.data)
I got the following error message:
In UseMethod("predict") :
no applicable method for predict applied to an object of class "clmm"
##predict with model average
predict(mm.avg, new.data)
Another error is returned:
Error in predict.averaging(mm.avg, new.data) :
predict for models 'mm1' and 'mm2' caused errors
I've been using clmm as well and yes I confirm predict.clmm is NOT (yet?) implemented. I didn't yet check the source code for fake.predict.clmm. It might work. If it doesn't, you're stuck with doing stuff by hand or using predict.clmm2.
I found a potential solution (pasted below) but have not been able to make work for my data.
Solution here: https://gist.github.com/mainambui/c803aaf857e54a5c9089ea05f91473bc
I think the problem is the number of coefficients I am using but am not experienced enough to figure it out. Hopefully this helps someone out though.
This is the model and newdata that I am using, though it is actually a model averaged version. Same predictors though.
ma10 <- clmm(Location3 ~ Sex * Grass3 + Sex * Forb3 + (1|Tag_ID), data =
IP_all_dunes)
ma_1 <- model.avg(ma10, ma8, ma5)##top 3 models
new_ma<- data.frame(Sex = c("m","f","m","f","m","f","m","f"),
Grass3 = c("1","1","1","1","0","0","0","0"),
Forb3 = c("0","0","1","1","0","0","1","1"))
# Arguments:
# - model = a clmm model
# - modelAvg = a clmm model average (object of class averaging)
# - newdata = a dataframe of new data to apply the model to
# Returns a dataframe of predicted probabilities for each row and response level
fake.predict.clmm <- function(modelAvg, newdata) {
# Actual prediction function
pred <- function(eta, theta, cat = 1:(length(theta) + 1), inv.link = plogis) {
Theta <- c(-1000, theta, 1000)
sapply(cat, function(j) inv.link(Theta[j + 1] - eta) - inv.link(Theta[j] -
eta))
}
# Multiply each row by the coefficients
#coefs <- c(model$beta, unlist(model$ST))##turn off if a model average is used
beta <- modelAvg$coefficients[2,3:12]
coefs <- c(beta, unlist(modelAvg$ST))
xbetas <- sweep(newdata, MARGIN=2, coefs, `*`)
# Make predictions
Theta<-modelAvg$coefficients[2,1:2]
#pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=model$Theta))
pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=Theta))
#colnames(pred.mat) <- levels(model$model[,1])
a<-attr(modelAvg, "modelList")
colnames(pred.mat) <- levels(a[[1]]$model[,1])
pred.mat
}

Estimating variance attributed a fixed effect

Disregarding how "important" it is, I am interested in trying to estimate how much of the variance is attributed to a single fixed effect (it being a main effect, or interaction term).
As a quick thought I imagined that constructing a linear model for the predicted values of mixed model (without the random effect), and assessing the ANOVA-table would provide a estimate (yes, the residual variance will then be zero, but we know(?) this from the mixed model). However, from playing around apparently not.
Where is the flaw in my reasoning? Or did I do something wrong along the way? Is there an alternative method?
Disclaimer: I know some people have suggested looking at the change in residual variance when removing/adding fixed effects, but as this does not take into account the correlation between fixed and random effects I am not interested .
data(Orthodont,package="nlme")
Orthodont = na.omit(Orthodont)
#Fitting a linear mixed model
library(lme4)
mod = lmer(distance ~ age*Sex + (1|Subject) , data=Orthodont)
# Predicting across all observed values,
pred.frame = expand.grid(age = seq(min(Orthodont$age, na.rm = T),max(Orthodont$age, na.rm=T)),
Sex = unique(Orthodont$Sex))
# But not including random effects
pred.frame$fit = predict(mod, newdata = pred.frame, re.form=NA)
anova(lm(fit~age*Sex, data = pred.frame))
library(data.table)
Orthodont = data.table(Orthodont)
# to test the validity of the approach
# by estimating a linear model using a random observation
# per individual and look at the means
tmp = sapply(1:500, function(x){
print(x)
as.matrix(anova(lm(distance~age*Sex, data =Orthodont[,.SD[sample(1:.N,1)],"Subject"])))[,2]
}
)
# These are clearly not similar
prop.table(as.table(rowMeans(tmp)[-4]))
age Sex age:Sex
0.60895615 0.31874622 0.07229763
> prop.table(as.table(anova(lm(fit~age*Sex, data = pred.frame))[1:3,2]))
A B C
0.52597575 0.44342996 0.03059429

R equivalent to Stata's xtregar

I'm doing a replication of an estimation done with Stata's xtregar command, but I'm using R instead.
The xtregar command implements the method from Baltagi and Wu (1999) "Unequally spaced panel data regressions with AR(1) disturbances" paper. As Stata describes it:
xtregar fits cross-sectional time-series regression models when the disturbance term is first-order autoregressive. xtregar offers a within estimator for fixed-effects models and a GLS estimator for random-effects models. xtregar can accommodate unbalanced panels whose observations are unequally spaced over time.
So far, for the fixed-effects model, I used the plm package for R. The attempt looks like this:
plm(data=A, y ~ x1 + x2, effect = "twoways", model = "within")
Nevertheless is not complete (comparing to xtregar description) and the results are not quite like the ones Stata provides. Furthermore, Stata's command needs to set a panel variable and a time variable, feature that's (as far as I can tell) absent in the plm environment.
Should I settle with plm or is there another way of doing this?
PS: I searched thoroughly different websites but failed to find a equivalent to Stata's xtregar.
Update
After reading Croissant and Millo (2008) "Panel Data Econometrics in R: The plm Package", specifically seccion 7.4 "Some useful 'econometric' models in nlme" I used something like this for the Random Effects part of the estimation:
gls(data=A, y ~ x1 + x2, correlation = corAR1(0, form = ~ year | pays), na.action = na.exclude)
Nevertheless the following has results closer to those of Stata
lme(data=A, y ~ x1 + x2, random = ~ 1 | pays, correlation = corAR1(0, form = ~ year | pays), na.action = na.exclude)
Try {panelAR}. This is a package for regressions in panel data that addresses AR1 type of autocorrelations.
Unfortunately, I do not own Stata, so I can not test which correlation method to replicate in panelCorrMethod
library(panelAR)
model <-
panelAR(formula = y ~ x1 + x2,
data = A,
panelVar = 'pays',
timeVar = 'year',
autoCorr = 'ar1',
rho.na = TRUE,
bound.rho = TRUE,
panelCorrMethod ='phet' # You might need to change this parameter. 'phet' uses the HW Sandwich stimator for heteroskedasticity cases, but others are available.
)

Resources