I have a series of linear models and I'd like to report the standardized coefficients for each. However, when I print the models in stargazer, it looks like stargazer automatically prints the significance stars for the standardized coefficients as if they were unstandardized coefficients. You can see how the differences emerge below.
Is it statistically wrong to print the significance stars based on the unstandardized values? How is this done in stargazer? Thanks!
#load libraries
library(stargazer)
library(lm.beta)
#fake data
var1<-rnorm(100, mean=10, sd=5)
var2<-rnorm(100, mean=5, sd=2)
var3<-rnorm(100, mean=2, sd=3)
var4<-rnorm(100, mean=5, sd=1)
df<-data.frame(var1, var2, var3, var4)
#model with unstandardized betas
model1<-lm(var1~var2+var3+var4, data=df)
#Standardized betas
model1.beta<-lm.beta(model1)
#print
stargazer(model1, model1.beta, type='text')
Stargazer does not automatically know it should look for the standardized coefficients in the second model. lm.beta just add standardzied coefficients to the lm.object. So it is still an lm.object, so it extracts the coefficients as per usual (from model1.beta$coefficients. Use the coef = argument to specify the specific coefficients you want to use: coef = list(model1$coefficients, model1.beta$standardized.coefficients)
> stargazer(model1, model1.beta,
coef = list(model1$coefficients,
model1.beta$standardized.coefficients),
type='text')
==========================================================
Dependent variable:
----------------------------
var1
(1) (2)
----------------------------------------------------------
var2 0.135 0.048
(0.296) (0.296)
var3 -0.088 -0.044
(0.205) (0.205)
var4 -0.190 -0.030
(0.667) (0.667)
Constant 10.195** 0.000
(4.082) (4.082)
----------------------------------------------------------
Observations 100 100
R2 0.006 0.006
Adjusted R2 -0.025 -0.025
Residual Std. Error (df = 96) 5.748 5.748
F Statistic (df = 3; 96) 0.205 0.205
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
A big thank you to paqmo for the answer. I would just add that to get the correct p-values for the standardized solution, you need to add another line detailing which p-values to use:
stargazer(model1, model1.beta,
coef = list(model1$coefficients,
model1.beta$standardized.coefficients),
p = list (coef(summary(model1))[,4], coef(summary(model1.beta))[,5]),
type='text')
Also, generally, Stargazer sometimes does not work with longer model names and gives the warning Error in if (is.na(s)) { : the condition has length > 1
Thus, I would recommend to keep your models' names short (especially if you want Stargazer to display a few of them).
Related
I made the following simple regression model and used stargazer to output a table that plots the standardized vs non-standardized regression model coefficients, standard errors and p-values.
library(lm.beta)
mod <- lm(mpg ~ cyl + disp, mtcars)
summary(mod)
mod_std <- lm.beta(mod)
summary(mod_std)$coe[, 2]
library(stargazer)
stargazer(mod, mod_std,
coef = list(mod$coefficients,
mod_std$standardized.coefficients),
type='text')
And this is the output:
==========================================================
Dependent variable:
----------------------------
mpg
(1) (2)
----------------------------------------------------------
cyl -1.587** -0.470
(0.712) (0.712)
disp -0.021** -0.423***
(0.010) (0.010)
Constant 34.661*** 0.000
(2.547) (2.547)
----------------------------------------------------------
Observations 32 32
R2 0.760 0.760
Adjusted R2 0.743 0.743
Residual Std. Error (df = 29) 3.055 3.055
F Statistic (df = 2; 29) 45.808*** 45.808***
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
As can be observed here, the standard errors that are reported by the stargazer (for the coefficients) are the same for the standardized model and the non-standardized one. This is not correct as standard errors should change with the standardization of coefficients. Is there a way to report the correct standard errors? Or if not, simply remove them?
Lastly, what also changes from the standardized to the non-standardized models are the significance levels (of the coefficients). These should not change as they are not affected by standardization. Is there a way to prevent stargazer from modifying them? p or p.auto arguments maybe would work but I have no idea how to use them.
Reference for lm.beta: Stefan Behrendt (2014). lm.beta: Add Standardized Regression Coefficients to lm-Objects. R package version 1.5-1. https://CRAN.R-project.org/package=lm.beta
You would need to enter the additional values by hand, list-wise for each model, as you started with the coefficients. Standardized se=, the p= values (for the stars), ... as well as the the GOFs (R2, R2adj., ...), read options in help page: ?stargazer.
However, lm.beta appears to add nothing but the standardized coefficients, and none are yet calculated to report them.
Standardized standard errors are calculated using the formula SE*beta_star/beta.
So you could wrap a function, and calculate them, in order to fill them in the stargazer table:
std_se <- \(x) x[, 'Std. Error']*x[, 'Standardized']/x[, 'Estimate']
std_se(summary(mod_std)$coefficients)
# (Intercept) cyl disp
# 0.0000000 0.2109356 0.2109356
However, it might definitely be easier to calculate a actual standardized model
mod_std2 <- lm(mpg ~ cyl + disp, as.data.frame(scale(mtcars)))
summary(mod_std2) |> getElement('coefficients')
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 34.66099474 2.54700388 13.608536 4.022869e-14
# cyl -1.58727681 0.71184427 -2.229809 3.366495e-02
# disp -0.02058363 0.01025748 -2.006696 5.418572e-02
and put that in:
stargazer(mod, mod_std2, type='text')
# ==========================================================
# Dependent variable:
# ----------------------------
# mpg
# (1) (2)
# ----------------------------------------------------------
# cyl -1.587** -0.470**
# (0.712) (0.211)
# disp -0.021* -0.423*
# (0.010) (0.211)
# Constant 34.661*** -0.000
# (2.547) (0.090)
# ----------------------------------------------------------
# Observations 32 32
# R2 0.760 0.760
# Adjusted R2 0.743 0.743
# Residual Std. Error (df = 29) 3.055 0.507
# F Statistic (df = 2; 29) 45.808*** 45.808***
# ==========================================================
# Note: *p<0.1; **p<0.05; ***p<0.01
I managed to make the following script:
stargazer(mod_std,
coef=list(mod_std$standardized.coefficients),
se=list(summary(mod_std)$coe[, 2]),
p=list(summary(mod)$coe[, 4]),
type='text',
omit.stat = c("all"),
keep = c("cyl","disp"),
report = c('vcp'), notes.append = FALSE,
notes = "Coefficients are standardized")
With the following output:
===================================
Dependent variable:
-----------------------------
mpg
-----------------------------------
cyl -0.470
p = 0.034
disp -0.423
p = 0.055
===================================
===================================
Note: Coefficients are standardized
Here, the standardized coefficients are reported, together with the p-values from the original model (which should be unchanged across standardization).
I'm doing some quantile regressions, and want to format the resulting tables with stargazer. However, stargazer doesn't have an option to calculate standard errors with bootstraps for clustered data, which is what I need. So, I will need to input the standard errors manually, using stagazer's se = argument, but I'm not sure how it works, exactly.
model <- lm(mpg ~ wt, data = mtcars)
stargazer(model, type = 'text', se = list(1, 1))
===============================================
Dependent variable:
---------------------------
mpg
-----------------------------------------------
wt -5.344
Constant 37.285***
(1.000)
-----------------------------------------------
Observations 32
R2 0.753
Adjusted R2 0.745
Residual Std. Error 3.046 (df = 30)
F Statistic 91.375*** (df = 1; 30)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
In the code and output above, there are two coefficients, one for the wt variables and one for the intercept. In the se = argument, I entered two arbitrary standard errors, one for each coefficient. However, the output shows only the standard error for the intercept, but not for the other variable.
Any idea what is happening?
I like Stargazer quite a bit but am running into some issues trying to report standard errors and confidence intervals all in a single table.
Consider this simple regression as a reproducible example:
set.seed(04152020)
x <- rnorm(100)
y <- 2*x + rnorm(100)
m1 <- lm(y~x)
I can report standard errors and confidence intervals in two separate tables no problem using the ci option.
library(stargazer)
# standard errors
stargazer(m1, type = "text")
# confidence intervals
stargazer(m1, ci = FALSE, type = "text")
A workaround to get them into a single table is to "report" the model twice, but then the coefficients are repeated unnecessarily. For example, the following code:
stargazer(list(m1, m1),
ci = c(FALSE, TRUE),
type = "text")
Produces:
==========================================================
Dependent variable:
----------------------------
y
(1) (2)
----------------------------------------------------------
x 1.981*** 1.981***
(0.110) (1.766, 2.196)
Constant -0.218** -0.218**
(0.104) (-0.421, -0.014)
----------------------------------------------------------
Observations 100 100
R2 0.769 0.769
Adjusted R2 0.766 0.766
Residual Std. Error (df = 98) 1.032 1.032
F Statistic (df = 1; 98) 325.893*** 325.893***
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Is there a way to put both standard errors and confidence intervals into a single column automatically, like you can do with p-values? E.g. this code:
stargazer(m1,
ci = c(FALSE, TRUE),
report = ('vcsp'),
type = "text")
Produces exactly what I want, but with p-values, and the documentation for the option that allows for it—report—seems to only allow the choice for p-values, as indicated by this question and answer.
===============================================
Dependent variable:
---------------------------
y
-----------------------------------------------
x 1.981
(0.110)
p = 0.000
Constant -0.218
(0.104)
p = 0.039
-----------------------------------------------
Observations 100
R2 0.769
Adjusted R2 0.766
Residual Std. Error 1.032 (df = 98)
F Statistic 325.893*** (df = 1; 98)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
I don't know how to do this in stargazer, but you can easily achieve the desired result with the modelsummary package. (Disclaimer: I am the author.) The
library(modelsummary)
set.seed(04152020)
x <- rnorm(100)
y <- 2*x + rnorm(100)
m1 <- lm(y~x)
modelsummary(m1, statistic = c("std.error", "conf.int"))
You can also do crazy things like this, as described on the website:
modelsummary(models, gof_omit = ".*",
statistic = c("conf.int",
"s.e. = {std.error}",
"t = {statistic}",
"p = {p.value}"))
I want the same stars for significancies in regression output in stargazer as in the "normal output".
I produce data
library("stargazer"); library("lmtest"); library("sandwich")
set.seed(1234)
df <- data.frame(y=1001:1100)
df$x <- c(1:70,-100:-71) + rnorm(100, 0, 74.8)
model <- lm(log(y) ~ x, data=df)
and get some model estimates where the coefficient on x has a p-value of 0.1023
coeftest(model, vcov = vcovHC(model, type="HC3"))
I want to have these results in LaTeX. Based on the same function I calculate heteroscedasticity consistent standard estimates and let stargazer use them.
stderr_HC3_model <- sqrt(diag(vcovHC(model, type = "HC3")))
stargazer(model, se=list(stderr_HC3_model))
The stargazer output has a star at the coefficient indicating significance when alpha=10%. I want stargazer to give the same as the coeftest. (Because of the comparability with Stata where reg L_y x, vce(hc3) gives exactly the coeftest results.)
I played around with the stargazer options p.auto, t.auto which did not help. When I execute "stargazer" I cannot view the underlying code as it is possible in other cases. What to do?
Richards answer helped me. I indicate the steps I used to give out more than one regression (let's say ols_a and ols_b).
ses <- list(coeftest(ols_a, vcov = vcovHC(ols_a, type="HC3"))[,2],
coeftest(ols_b, vcov = vcovHC(ols_b, type="HC3"))[,2])
pvals <- list(coeftest(ols_a, vcov = vcovHC(ols_a, type="HC3"))[,4],
coeftest(ols_b, vcov = vcovHC(ols_b, type="HC3"))[,4])
stargazer(ols_a, ols_b, type="text", p=pvals, se=ses)
You need to provide the p values associated with your coeftest. From the man page.
p a list of numeric vectors that will replace the default p-values for
each model. Matched by element names. These will form the basis of
decisions about significance stars
The following should work.
test <- coeftest(model, vcov = vcovHC(model, type="HC3"))
ses <- test[, 2]
pvals <- test[, 4]
stargazer(model, type="text", p=pvals, se=ses)
This provides the following.
===============================================
Dependent variable:
---------------------------
log(y)
-----------------------------------------------
x -0.00005
Constant 6.956***
(0.003)
-----------------------------------------------
Observations 100
R2 0.026
Adjusted R2 0.016
Residual Std. Error 0.027 (df = 98)
F Statistic 2.620 (df = 1; 98)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
It may be a minor issue but Richard's answer is actually not entirely correct -
his stargazer output does not report any standard errors nor potential significance stars for the variable x.
Also when reporting only a single model in stargazer manual coefficients, se, p and t values have to be provided in a list. Otherwise stargazer will report an empty list.
The (slightly) corrected example:
test <- coeftest(model, vcov = vcovHC(model, type="HC3"))
ses <- list(test[, 2])
pvals <- list(test[, 4])
stargazer(model, type="text", p=pvals, se=ses)
Output:
=======================================================================
Dependent variable:
-----------------------------------------
Daily added investors
negative
binomial
-----------------------------------------------------------------------
log(lag_raised_amount + 1) -0.466***
(0.124)
lag_target1 -0.661***
(0.134)
Constant -3.480**
(1.290)
-----------------------------------------------------------------------
Observations 6,513
Log Likelihood -8,834
theta 1.840*** (0.081)
Akaike Inf. Crit. 17,924
=======================================================================
Note: + p<0.1; * p<0.05; ** p<0.01; *** p<0.001
There are inherent dangers associated with the se argument.
When using this approach, the user should be cautious wrt the arguments t.auto and p.auto, both of which default to TRUE. I think it would be cautious to set them both to FALSE, and supply manually t and p values.
Failure to do so, and you risk getting significance stars not in sync with the displayed p-values. (I suspect that stargazer will simply reuse the se, which are now different from the default ones, and recompute the displayed stars using this input; which will naturally yield unexpected results.)
See also:
Displaying p-values instead of SEs in parenthesis
I am testing differences on the number of pollen grains loading on plant stigmas in different habitats and stigma types.
My sample design comprises two habitats, with 10 sites each habitat.
In each site, I have up to 3 stigma types (wet, dry and semidry), and for each stigma stype, I have different number of plant species, with different number of individuals per plant species (code).
So, I ended up with nested design as follow: habitat/site/stigmatype/stigmaspecies/code
As it is a descriptive study, stigmatype, stigmaspecies and code vary between sites.
My response variable (n) is the number of pollengrains (log10+1)per stigma per plant, average because i collected 3 stigmas per plant.
Data doesnt fit Poisson distribution because (i) is not integers, and (ii) variance much higher than the mean (ratio = 911.0756). So, I fitted as negative.binomial.
After model selection, I have:
m4a <- glmer(n ~ habitat*stigmatype + (1|stigmaspecies/code),
family=negative.binomial(2))
> summary(m4a)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: Negative Binomial(2) ( log )
Formula: n ~ habitat * stigmatype + (1 | stigmaspecies/code)
AIC BIC logLik deviance
993.9713 1030.6079 -487.9856 975.9713
Random effects:
Groups Name Variance Std.Dev.
code:stigmaspecies (Intercept) 1.034e-12 1.017e-06
stigmaspecies (Intercept) 4.144e-02 2.036e-01
Residual 2.515e-01 5.015e-01
Number of obs: 433, groups: code:stigmaspecies, 433; stigmaspecies, 41
Fixed effects:
Estimate Std. Error t value Pr(>|z|)
(Intercept) -0.31641 0.08896 -3.557 0.000375 ***
habitatnon-invaded -0.67714 0.10060 -6.731 1.68e-11 ***
stigmatypesemidry -0.24193 0.15975 -1.514 0.129905
stigmatypewet -0.07195 0.18665 -0.385 0.699885
habitatnon-invaded:stigmatypesemidry 0.60479 0.22310 2.711 0.006712 **
habitatnon-invaded:stigmatypewet 0.16653 0.34119 0.488 0.625491
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) hbttn- stgmtyps stgmtypw hbttnn-nvdd:stgmtyps
hbttnn-nvdd -0.335
stgmtypsmdr -0.557 0.186
stigmatypwt -0.477 0.160 0.265
hbttnn-nvdd:stgmtyps 0.151 -0.451 -0.458 -0.072
hbttnn-nvdd:stgmtypw 0.099 -0.295 -0.055 -0.403 0.133
Two questions:
How do I check for overdispersion from this output?
What is the best way to go through model validation here?
I have been using:
qqnorm(resid(m4a))
hist(resid(m4a))
plot(fitted(m4a),resid(m4a))
While qqnorm() and hist() seem ok, and there is a tendency of heteroscedasticity on the 3rd graph. And here is my final question:
Can I go through model validation with this graph in glmer? or is there a better way to do it? if not, how much should I worry about the 3rd graph?
a simple way to check for overdispersion in glmer is:
> library("blmeco")
> dispersion_glmer(your_model) #it shouldn't be over
> 1.4
To solve overdispersion I usually add an observation level random factor
For model validation I usually start from these plots...but then depends on your specific model...
par(mfrow=c(2,2))
qqnorm(resid(your_model), main="normal qq-plot, residuals")
qqline(resid(your_model))
qqnorm(ranef(your_model)$id[,1])
qqline(ranef(your_model)$id[,1])
plot(fitted(your_model), resid(your_model)) #residuals vs fitted
abline(h=0)
dat_kackle$fitted <- fitted(your_model) #fitted vs observed
plot(your_data$fitted, jitter(your_data$total,0.1))
abline(0,1)
hope this helps a little....
cheers
Just an addition to Q1 for those who might find this by googling: the blmco dispersion_glmer function appears to be outdated. It is better to use #Ben_Bolker's function for this purpose:
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
Source: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#overdispersion.
With the highlighted notion:
Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter.
PS. Why outdated?
The lme4 package includes the residuals function these days, and Pearson residuals are supposedly more robust for this type of calculation than the deviance residuals. The blmeco::dispersion_glmer sums up the deviance residuals together with u cubed, divides by residual degrees of freedom and takes a square root of the value (the function):
dispersion_glmer <- function (modelglmer)
{
n <- length(resid(modelglmer))
return(sqrt(sum(c(resid(modelglmer), modelglmer#u)^2)/n))
}
The blmeco solution gives considerably higher deviance/df ratios than Bolker's function. Since Ben is one of the authors of the lme4 package, I would trust his solution more although I am not qualified to rationalize the statistical reason.
x <- InsectSprays
x$id <- rownames(x)
mod <- lme4::glmer(count ~ spray + (1|id), data = x, family = poisson)
blmeco::dispersion_glmer(mod)
# [1] 1.012649
overdisp_fun(mod)
# chisq ratio rdf p
# 55.7160734 0.8571704 65.0000000 0.7873823