R Stargazer Getting p-values into one line - r

stargazer(model1, model2, title = "Models", header=FALSE,
dep.var.labels.include = FALSE,
column.labels = c("Count", "Percentage"),
style = "ajs",
report = "vcp*",
single.row = TRUE)
This is my code to create regression tables with stargazer. However, the p-value still shows up below the coefficient estimates. How do I get p-values to show up next to the coefficient estimates?

You may replace standard errors with p-values. Put models into a list, which allows you to use lapply.
model1 <- lm(mpg ~ hp, mtcars)
model2 <- lm(mpg ~ hp + cyl, mtcars)
model.lst <- list(model1, model2)
stargazer::stargazer(model.lst, title = "Models", header=FALSE,
dep.var.labels.include = FALSE,
column.labels = c("Count", "Percentage"),
style = "ajs",
report = "vcs*",
single.row = TRUE, type="text",
se=lapply(model.lst, function(x) summary(x)$coef[,4]))
# Models
# =================================================================
# Count Percentage
# 1 2
# -----------------------------------------------------------------
# hp -.068 (0.000)*** -.019 (.213)
# cyl -2.265 (0.000)***
# Constant 30.099 (0.000)*** 36.908 (0.000)***
# Observations 32 32
# R2 .602 .741
# Adjusted R2 .589 .723
# Residual Std. Error 3.863 (df = 30) 3.173 (df = 29)
# F Statistic 45.460*** (df = 1; 30) 41.422*** (df = 2; 29)
# -----------------------------------------------------------------
# Notes: *P < .05
# **P < .01
# ***P < .001
Note, that this is also possible with texreg which might look a little bit cleaner and the package is well maintained.
texreg::screenreg(model.lst, single.row=TRUE,
reorder.coef=c(2:3, 1),
custom.model.names=c("Count", "Percentage"),
override.se=lapply(model.lst, function(x) summary(x)$coef[,4]),
override.pvalues=lapply(model.lst, function(x) summary(x)$coef[,4]),
digits=3
)
# ===================================================
# Count Percentage
# ---------------------------------------------------
# hp -0.068 (0.000) *** -0.019 (0.213)
# cyl -2.265 (0.000) ***
# (Intercept) 30.099 (0.000) *** 36.908 (0.000) ***
# ---------------------------------------------------
# R^2 0.602 0.741
# Adj. R^2 0.589 0.723
# Num. obs. 32 32
# ===================================================
# *** p < 0.001; ** p < 0.01; * p < 0.05

Related

Does the `by` argument in `avg_comparisons` compute the strata specific marginal effect?

I'm analyzing data from an AB test we just finished running. Our outcome is binary, y, and we have stratified results by a third variable, g.
Because the intervention could vary by g, I've fit a Poisson regression with robust covariance estimation as follows
library(tidyverse)
library(sandwich)
library(marginaleffects)
fit <- glm(y ~ treatment * g, data=model_data, family=poisson, offset=log(n_users))
From here, I'd like to know the strata specific causal risk ration (which we usually call "lift" in industry). My approach is to use avg_comparisons as follows
avg_comparisons(fit,
variables = 'treatment',
newdata = model_data,
transform_pre = 'lnratioavg',
transform_post = exp,
by=c('g'),
vcov = 'HC')
The result seems to be consistent with calculations of the lift when I filter the data by groups in g.
Question
By passing by=c('g'), am I actually calculating the strata specific risk ratios as I suspect? Is there any hidden "gotchas" or things I have failed to consider?
I can provide data and a minimal working example if need be.
Here’s a very simple base R example to show what is happening under-the-hood:
library(marginaleffects)
fit <- glm(carb ~ hp * am, data = mtcars, family = poisson)
Unit level estimates of log ratio associated with a change of 1 in hp:
cmp <- comparisons(fit, variables = "hp", transform_pre = "lnratio")
cmp
#
# Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0054 0.0027 2.0 0.047 0.00007 0.0107
# hp +1 0.0054 0.0027 2.0 0.047 0.00007 0.0107
# --- 22 rows omitted. See ?avg_comparisons and ?print.marginaleffects ---
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# Prediction type: response
# Columns: rowid, type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, carb, hp, am
This is equivalent to:
# prediction grids with 1 unit difference
lo <- transform(mtcars, hp = hp - .5)
hi <- transform(mtcars, hp = hp + .5)
# predictions on response scale
y_lo <- predict(fit, newdata = lo, type = "response")
y_hi <- predict(fit, newdata = hi, type = "response")
# log ratio
lnratio <- log(y_hi / y_lo)
# equivalent to `comparisons()`
all(cmp$estimate == lnratio)
# [1] TRUE
Now we take the strata specific means, with mean() inside log():
by(data.frame(am = lo$am, y_lo, y_hi),
mtcars$am,
FUN = \(x) log(mean(x$y_hi) / mean(x$y_lo)))
# mtcars$am: 0
# [1] 0.005364414
# ------------------------------------------------------------
# mtcars$am: 1
# [1] 0.005566092
Same as:
avg_comparisons(fit, variables = "hp", by = "am", transform_pre = "lnratio") |>
print(digits = 7)
#
# Term Contrast am Estimate Std. Error z Pr(>|z|) 2.5 %
# hp mean(+1) 0 0.005364414 0.002701531 1.985694 0.04706726 6.951172e-05
# hp mean(+1) 1 0.005566092 0.001553855 3.582118 < 0.001 2.520592e-03
# 97.5 %
# 0.010659317
# 0.008611592
#
# Prediction type: response
# Columns: type, term, contrast, am, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
See the list of transformation functions here: https://vincentarelbundock.github.io/marginaleffects/reference/comparisons.html#transformations
The only thing is that by applies the function within stratas.

Clustered standard errors, stars, and summary statistics in modelsummary for multinom models

I want to create a regression table with modelsummary (amazing package!!!) for multinomial logistic models run with nnet::multinom that includes clustered standard errors, as well as corresponding "significance" stars and summary statistics.
Unfortunately, I cannot do this automatically with the vcov parameter within modelsummary because the sandwich package that modelsummary uses does not support nnet objects.
I was able to calculate robust standard errors with a customized function originally developed by Daina Chiba and modified by Davenport, Soule, Armstrong (available from: https://journals.sagepub.com/doi/suppl/10.1177/0003122410395370/suppl_file/Davenport_online_supplement.pdf).
I was also able to include these standard errors in the modelsummary table instead of the original ones. Yet, neither the "significance" stars nor the model summary statistics adapt to these new standard errors. I think this is because they are calculated via broom::tidy automatically by modelsummary.
I would be thankful for any advice for how to include stars and summary statistics that correspond to the clustered standard errors and respective p-values.
Another smaller question I have is whether there is any easy way of "spreading" the model statistics (e.g. number of observations or R2) such that they center below all response levels of the dependent variable and not just the first level. I am thinking about a multicolumn solution in Latex.
Here is some example code that includes how I calculate the standard errors. (Note, that the calculated clustered SEs are extremely small because they don't make sense with the example mtcars data. The only take-away is that the respective stars should correspond to the new SEs, and they don't).
# load data
dat_multinom <- mtcars
dat_multinom$cyl <- sprintf("Cyl: %s", dat_multinom$cyl)
# run multinomial logit model
mod <- nnet::multinom(cyl ~ mpg + wt + hp, data = dat_multinom, trace = FALSE)
# function to calculate clustered standard errors
mlogit.clust <- function(model,data,variable) {
beta <- c(t(coef(model)))
vcov <- vcov(model)
k <- length(beta)
n <- nrow(data)
max_lev <- length(model$lev)
xmat <- model.matrix(model)
# u is deviance residuals times model.matrix
u <- lapply(2:max_lev, function(x)
residuals(model, type = "response")[, x] * xmat)
u <- do.call(cbind, u)
m <- dim(table(data[,variable]))
u.clust <- matrix(NA, nrow = m, ncol = k)
fc <- factor(data[,variable])
for (i in 1:k) {
u.clust[, i] <- tapply(u[, i], fc, sum)
}
cl.vcov <- vcov %*% ((m / (m - 1)) * t(u.clust) %*% (u.clust)) %*% vcov
return(cl.vcov = cl.vcov)
}
# get coefficients, variance, clustered standard errors, and p values
b <- c(t(coef(mod)))
var <- mlogit.clust(mod,dat_multinom,"am")
se <- sqrt(diag(var))
p <- (1-pnorm(abs(b/se))) * 2
# modelsummary table with clustered standard errors and respective p-values
modelsummary(
mod,
statistic = "({round(se,3)}),[{round(p,3)}]",
shape = statistic ~ response,
stars = c('*' = .1, '**' = .05, '***' = .01)
)
# modelsummary table with original standard errors and respective p-values
modelsummary(
models = list(mod),
statistic = "({std.error}),[{p.value}]",
shape = statistic ~ response,
stars = c('*' = .1, '**' = .05, '***' = .01)
)
This code produces the following tables:
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759*
-6.096***
(0.286),[0]
(0.007),[0]
mpg
-38.699
-46.849
(5.169),[0]
(6.101),[0]
wt
23.196
39.327
(3.18),[0]
(4.434),[0]
hp
6.722
7.493
(0.967),[0]
(1.039),[0]
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00
Note:
^^ * p < 0.1, ** p < 0.05, *** p < 0.01
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759*
-6.096***
(11.652),[0.063]
(0.371),[0.000]
mpg
-38.699
-46.849
(279.421),[0.891]
(448.578),[0.918]
wt
23.196
39.327
(210.902),[0.913]
(521.865),[0.941]
hp
6.722
7.493
(55.739),[0.905]
(72.367),[0.918]
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00
Note:
^^ * p < 0.1, ** p < 0.05, *** p < 0.01
This is not super easy at the moment, I just opened a Github issue to track progress. This should be easy to improve, however, so I expect changes to be published in the next release of the package.
In the meantime, you can install the dev version of modelsummary:
library(remotes)
install_github("vincentarelbundock/modelsummary")
Them, you can use the tidy_custom mechanism described here to override standard errors and p values manually:
library(modelsummary)
tidy_custom.multinom <- function(x, ...) {
b <- coef(x)
var <- mlogit.clust(x, dat_multinom, "am")
out <- data.frame(
term = rep(colnames(b), times = nrow(b)),
response = rep(row.names(b), each = ncol(b)),
estimate = c(t(b)),
std.error = sqrt(diag(var))
)
out$p.value <- (1-pnorm(abs(out$estimate / out$std.error))) * 2
row.names(out) <- NULL
return(out)
}
modelsummary(
mod,
output = "markdown",
shape = term ~ model + response,
stars = TRUE)
Model 1 / Cyl: 6
Model 1 / Cyl: 8
(Intercept)
22.759***
-6.096***
(0.286)
(0.007)
mpg
-38.699***
-46.849***
(5.169)
(6.101)
wt
23.196***
39.327***
(3.180)
(4.434)
hp
6.722***
7.493***
(0.967)
(1.039)
Num.Obs.
32
R2
1.000
R2 Adj.
0.971
AIC
16.0
BIC
27.7
RMSE
0.00

R- compare lmer with mice imputation to original data

I want to fit a mixed model with data containing missing values.
The imputation is performed with mice.
How can I compare the original data model fit to the mice one?
Example code..
## dummy data
set.seed(123)
DF <- data.frame(countryname = rep(LETTERS[1:10],each = 10), x1 = sample(10,100,replace = T),x2 = sample(5,100,replace = T), y = sample(10,100,replace = T))
# impute NAs
DF[sample(100,10),c("x1")] <- NA
DF[sample(100,10),c("x2")] <- NA
DF[sample(100,10),c("y")] <- NA
#
library(mice)
imp = mice(data = DF, m = 10, printFlag = FALSE)
fit = with(imp, expr=lme4::lmer(y~ x1+x2+ (1 | countryname)))
library(broom.mixed)
pool(fit)
summary(fit)
## fit to original data
fitor= lme4::lmer(y~ x1+x2+ (1 | countryname),data=DF)
## how to compare model estimates for fit and fitor?
## example output
##
## =======================================
## base w/SES
## ---------------------------------------
## (Intercept) 0.105 -0.954 ***
## (0.058) (0.085)
## x1 -0.497 *** -0.356 ***
## (0.058) (0.054)
## x2 -0.079 -0.102 *
## (0.043) (0.040)
## ---------------------------------------
## R2 0.039 0.157
## Nobs 4073 4073
## =======================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
###

Omit multiple factors in texreg

When using texreg I frequently use omit.coef to remove certain estimates (for fixed effects) as below.
screenreg(lm01,omit.coef='STORE_ID',custom.model.names = c("AA"))
In my lm model if I use multiple fixed effects how can I omit multiple variables? For example, I have two types of fixed effects - STORE_ID and Year, let's say.
This does not work.
screenreg(lm01,omit.coef=c('STORE_ID','Year'),custom.model.names = c("AA"))
You'd have to consider regex instead, separated by an |. Example:
fit <- lm(mpg ~ cyl + disp + hp + drat, mtcars)
texreg::screenreg(fit)
# =====================
# Model 1
# ---------------------
# (Intercept) 23.99 **
# (7.99)
# cyl -0.81
# (0.84)
# disp -0.01
# (0.01)
# hp -0.02
# (0.02)
# drat 2.15
# (1.60)
# ---------------------
# R^2 0.78
# Adj. R^2 0.75
# Num. obs. 32
# =====================
# *** p < 0.001; ** p < 0.01; * p < 0.05
Now omitting:
texreg::screenreg(fit, omit.coef=c('disp|hp|drat'))
# =====================
# Model 1
# ---------------------
# (Intercept) 23.99 **
# (7.99)
# cyl -0.81
# (0.84)
# ---------------------
# R^2 0.78
# Adj. R^2 0.75
# Num. obs. 32
# =====================
# *** p < 0.001; ** p < 0.01; * p < 0.05
screenreg allows you to include the custom.coef.map option and pass a list through it. This option allows you to directly select the variables you want to KEEP (instead of omit with omit.coef), and allows you to change the variables' names simultaneously:
screenreg(lm01, custom.coef.map = list("var1" = "First variable",
"var2" = "Second Variable", "var3" = "Third variable"))

Confidence intervals with clustered standard errors and texreg?

I'm trying to reproduce the 95% CI that Stata produces when you run a model with clustered standard errors. For example:
regress api00 acs_k3 acs_46 full enroll, cluster(dnum)
Regression with robust standard errors Number of obs = 395
F( 4, 36) = 31.18
Prob > F = 0.0000
R-squared = 0.3849
Number of clusters (dnum) = 37 Root MSE = 112.20
------------------------------------------------------------------------------
| Robust
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
acs_k3 | 6.954381 6.901117 1.008 0.320 -7.041734 20.9505
acs_46 | 5.966015 2.531075 2.357 0.024 .8327565 11.09927
full | 4.668221 .7034641 6.636 0.000 3.24153 6.094913
enroll | -.1059909 .0429478 -2.468 0.018 -.1930931 -.0188888
_cons | -5.200407 121.7856 -0.043 0.966 -252.193 241.7922
------------------------------------------------------------------------------
I am able to reproduce the coefficients and the standard errors:
library(readstata13)
library(texreg)
library(sandwich)
library(lmtest)
clustered.se <- function(model_result, data, cluster) {
model_variables <-
intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model)
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M / (M - 1)) * ((N - 1) / (N - K))
uj <-
apply(estfun(model_result), 2, function(x)
tapply(x, cl, sum))
vcovCL <- dfc * sandwich(model_result, meat = crossprod(uj) / N)
standard.errors <- coeftest(model_result, vcov. = vcovCL)[, 2]
p.values <- coeftest(model_result, vcov. = vcovCL)[, 4]
clustered.se <-
list(vcovCL = vcovCL,
standard.errors = standard.errors,
p.values = p.values)
return(clustered.se)
}
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <-
lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll,
data = elemapi2)
clustered_se <-
clustered.se(model_result = lm1,
data = elemapi2,
cluster = "dnum")
htmlreg(
lm1,
override.se = clustered_se$standard.errors,
override.p = clustered_se$p.value,
star.symbol = "\\*",
digits = 7
)
=============================
Model 1
-----------------------------
(Intercept) -5.2004067
(121.7855938)
acs_k3 6.9543811
(6.9011174)
acs_46 5.9660147 *
(2.5310751)
full 4.6682211 ***
(0.7034641)
enroll -0.1059909 *
(0.0429478)
-----------------------------
R^2 0.3848830
Adj. R^2 0.3785741
Num. obs. 395
RMSE 112.1983218
=============================
*** p < 0.001, ** p < 0.01, * p < 0.05
Alas, I cannot reproduce the 95% confidence Interval:
screenreg(
lm1,
override.se = clustered_se$standard.errors,
override.p = clustered_se$p.value,
digits = 7,
ci.force = TRUE
)
========================================
Model 1
----------------------------------------
(Intercept) -5.2004067
[-243.8957845; 233.4949710]
acs_k3 6.9543811
[ -6.5715605; 20.4803228]
acs_46 5.9660147 *
[ 1.0051987; 10.9268307]
full 4.6682211 *
[ 3.2894567; 6.0469855]
enroll -0.1059909 *
[ -0.1901670; -0.0218148]
----------------------------------------
R^2 0.3848830
Adj. R^2 0.3785741
Num. obs. 395
RMSE 112.1983218
========================================
* 0 outside the confidence interval
If I do it 'by hand', I get the same thing than with texreg:
level <- 0.95
a <- 1-(1 - level)/2
coeff <- lm1$coefficients
se <- clustered_se$standard.errors
lb <- coeff - qnorm(a)*se
ub <- coeff + qnorm(a)*se
> lb
(Intercept) acs_k3 acs_46 full enroll
-243.895784 -6.571560 1.005199 3.289457 -0.190167
> ub
(Intercept) acs_k3 acs_46 full enroll
233.49497100 20.48032276 10.92683074 6.04698550 -0.02181481
What is Stata doing and how can I reproduce it in R?
PS: This is a follow up question.
PS2: The Stata data is available here.
It looks like Stata is using confidence intervals based on t(36) rather than Z (i.e. Normal errors).
Taking the values from the Stata output
coef=6.954381; rse= 6.901117 ; lwr= -7.041734; upr= 20.9505
(upr-coef)/rse
## [1] 2.028095
(lwr-coef)/rse
## [1] -2.028094
Computing/cross-checking the tail values for t(36):
pt(2.028094,36)
## [1] 0.975
qt(0.975,36)
## [1] 2.028094
I don't know how you pass confidence intervals to texreg. Since you haven't given a reproducible example (I don't have elemapi2.dta) I can't say exactly how you would get the df, but it looks like you would want tdf <- length(unique(elemapi2$dnum))-1
level <- 0.95
a <- 1- (1 - level)/2
bounds <- coef(lm1) + c(-1,1)*clustered_se*qt(a,tdf)
Indeed Stata is using the t distribution rather than the normal distribution. There is now a really easy solution to getting confidence intervals that match Stata into texreg using lm_robust from the estimatr package, which you can install from CRAN install.packages(estimatr).
> library(estimatr)
> lmro <- lm_robust(mpg ~ hp, data = mtcars, clusters = cyl, se_type = "stata")
> screenreg(lmro)
===========================
Model 1
---------------------------
(Intercept) 30.10 *
[13.48; 46.72]
hp -0.07
[-0.15; 0.01]
---------------------------
R^2 0.60
Adj. R^2 0.59
Num. obs. 32
RMSE 3.86
===========================
* 0 outside the confidence interval

Resources