Get coefficients estimated by maximum likelihood into a stargazer table - r

Stargazer produces very nice latex tables for lm (and other) objects. Suppose I've fit a model by maximum likelihood. I'd like stargazer to produce a lm-like table for my estimates. How can I do this?
Although it's a bit hacky, one way might be to create a "fake" lm object containing my estimates -- I think this would work as long as summary(my.fake.lm.object) works. Is that easily doable?
An example:
library(stargazer)
N <- 200
df <- data.frame(x=runif(N, 0, 50))
df$y <- 10 + 2 * df$x + 4 * rt(N, 4) # True params
plot(df$x, df$y)
model1 <- lm(y ~ x, data=df)
stargazer(model1, title="A Model") # I'd like to produce a similar table for the model below
ll <- function(params) {
## Log likelihood for y ~ x + student's t errors
params <- as.list(params)
return(sum(dt((df$y - params$const - params$beta*df$x) / params$scale, df=params$degrees.freedom, log=TRUE) -
log(params$scale)))
}
model2 <- optim(par=c(const=5, beta=1, scale=3, degrees.freedom=5), lower=c(-Inf, -Inf, 0.1, 0.1),
fn=ll, method="L-BFGS-B", control=list(fnscale=-1), hessian=TRUE)
model2.coefs <- data.frame(coefficient=names(model2$par), value=as.numeric(model2$par),
se=as.numeric(sqrt(diag(solve(-model2$hessian)))))
stargazer(model2.coefs, title="Another Model", summary=FALSE) # Works, but how can I mimic what stargazer does with lm objects?
To be more precise: with lm objects, stargazer nicely prints the dependent variable at the top of the table, includes SEs in parentheses below the corresponding estimates, and has the R^2 and number of observations at the bottom of the table. Is there a(n easy) way to obtain the same behavior with a "custom" model estimated by maximum likelihood, as above?
Here are my feeble attempts at dressing up my optim output as a lm object:
model2.lm <- list() # Mimic an lm object
class(model2.lm) <- c(class(model2.lm), "lm")
model2.lm$rank <- model1$rank # Problematic?
model2.lm$coefficients <- model2$par
names(model2.lm$coefficients)[1:2] <- names(model1$coefficients)
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
model2.lm$model <- df
model2.lm$terms <- model1$terms # Problematic?
summary(model2.lm) # Not working

I was just having this problem and overcame this through the use of the coef se, and omit functions within stargazer... e.g.
stargazer(regressions, ...
coef = list(... list of coefs...),
se = list(... list of standard errors...),
omit = c(sequence),
covariate.labels = c("new names"),
dep.var.labels.include = FALSE,
notes.append=FALSE), file="")

You need to first instantiate a dummy lm object, then dress it up:
#...
model2.lm = lm(y ~ ., data.frame(y=runif(5), beta=runif(5), scale=runif(5), degrees.freedom=runif(5)))
model2.lm$coefficients <- model2$par
model2.lm$fitted.values <- model2$par["const"] + model2$par["beta"]*df$x
model2.lm$residuals <- df$y - model2.lm$fitted.values
stargazer(model2.lm, se = list(model2.coefs$se), summary=FALSE, type='text')
# ===============================================
# Dependent variable:
# ---------------------------
# y
# -----------------------------------------------
# const 10.127***
# (0.680)
#
# beta 1.995***
# (0.024)
#
# scale 3.836***
# (0.393)
#
# degrees.freedom 3.682***
# (1.187)
#
# -----------------------------------------------
# Observations 200
# R2 0.965
# Adjusted R2 0.858
# Residual Std. Error 75.581 (df = 1)
# F Statistic 9.076 (df = 3; 1)
# ===============================================
# Note: *p<0.1; **p<0.05; ***p<0.01
(and then of course make sure the remaining summary stats are correct)

I don't know how committed you are to using stargazer, but you can try using the broom and the xtable packages, the problem is that it won't give you the standard errors for the optim model
library(broom)
library(xtable)
xtable(tidy(model1))
xtable(tidy(model2))

Related

Difference between arima(1,0,0) function and running a regression on lag values?

I'm currently doing time series in R and had a few fundamental R doubts. Mainly, what is the difference between the two pieces of code?
ar_1 <- lm(df$VALUE ~ lag(df$value))
summary(ar_1)
arima_values <- arima(df$value, order=c(1,0,0))
arima_values
I have to essentially get the coefficients, S.E. etc. but the above two pieces of code return different values for each. What is each piece of code doing? The general formula for AR(1) is essentially running a regression on the 1st order lagged values correct? The ARIMA function should achieve the same thing?
They give the same values to several decimals if the arguments to arima are set as shown:
# generate test series
set.seed(13)
n <- 25
mu <- 0.4
phi <- 0.8
s <- seq(0, length = n - 1)
x <- rnorm(1)
for(i in 2:n) x[i] <- mu + phi * x[i-1] + rnorm(1)
# lm
mod.lm <- lm(x[-1] ~ x[-n])
coef(mod.lm)
## (Intercept) x[-n]
## 0.7593169 0.7408584
# arima - use conditional sum of squares and drop 0 observations
mod.arima <- arima(x, c(1, 0, 0), method = "CSS", n.cond = 0)
co <- coef(mod.arima)
co
## ar1 intercept
## 0.7408535 2.9300719
# arima defines intercept differently so use this to compare to lm intercept
with(as.list(co), intercept * (1 - ar1))
## [1] 0.7593179
We can also use ar with the appropriate arguments:
mod.ar <- ar(x, order.max = 1, method = "ols", demean = FALSE, intercept = TRUE)
mod.ar
##
## Call:
## ar(x = x, order.max = 1, method = "ols", demean = FALSE, intercept = TRUE)
##
## Coefficients:
## 1
## 0.7409
##
## Intercept: 0.7593 (0.3695)

How to perform bootstrapping for estimation and inference of quantile regression using multiply imputed data in R?

I am trying to manually pool results from quantile regression models run on multiply imputed data in R using mice. I make use of a bootstrapping procedure to get 95% CIs and P values of the model terms, in which model parameters and their standard errors are obtained after sampling a certain number of rows that is equal to the unique number of participants in my data set. This procedure is repeated 500 times for each of the m imputed data sets. Then, as a last step, I pool the estimated coefficients and their standard errors of the resulting 500 * m regression models according to Rubin's rules (1987) (see e.g. https://bookdown.org/mwheymans/bookmi/rubins-rules.html). To speed things up, I use foreach to split up the analyses over multiple processor cores and for to loop over the m imputed data sets.
However, there seems to be a flaw in the part wherein the results are pooled. When I look at the pooled results, I observe that the P values are not in accordance with the 95% CIs (e.g. P < 0.05 when 0 is included in the 95% CI).
To illustrate this issue, I have made a reproducible example, using these publicly available data: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv
Because there are no missing data in this data set, I introduce them myself and impute the data (m = 10 multiply imputed data sets with 20 iterations). I use set.seed for reproducibility.
Note that I use lm instead of quantreg::rq in this example.
# load data
projdir <- "my_directory"
d <- read.csv(file = file.path(projdir, 'heart_failure_clinical_records_dataset.csv'))
#### introduce missing values
set.seed(1)
# age
age_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$age[age_miss_tag == 1] <- NA # MCAR
# serum creatinine
creat_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$serum_creatinine[creat_miss_tag == 1 & d$anaemia == 0] <- NA # MAR
# CK
CK_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$creatinine_phosphokinase[CK_miss_tag & d$platelets > median(d$platelets)] <- NA # MAR
# platelets
platelets_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$platelets[platelets_miss_tag == 1] <- NA # MCAR
library(mice); library(mitml); library(miceadds); library(splines); library(foreach); library(doParallel)
# impute data
imp <- mice(d, maxit = 20, m = 10, seed = 2)
# log creatinine
implong <- complete(imp, 'long', include = FALSE)
implong$log_creat <- log(implong$serum_creatinine)
imp <- miceadds::datlist2mids(split(implong, implong$.imp))
# compute values for Boundary.knots
temp <- complete(imp, 'long', include = FALSE)
B_knots <- rowMeans(sapply(split(temp, temp$.imp), function(x) {
quantile(x$age, c(0.1, 0.9))
}))
# Convert mids object into a datlist
longlist <- miceadds::mids2datlist(imp)
# fit model based on origial data and use the terms in the below foreach loop
# in order to fix the position of the inner knots
fit_orig <- lm(log_creat ~
# Main effects
ns(age, df = 2, B = c(B_knots[1], B_knots[2])) * sex,
data = longlist[[1]])
To further speed things up, I use OLS instead of quantile regression here and parallelize the process.
# make cluster used in foreach
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, iseed = 9956)
registerDoParallel(cl)
# No. of bootstrap samples to be taken
n_iter <- 500
boot.1 <- c()
for(k in seq_along(longlist)){
boot.1[[k]] <- foreach(i = seq_len(n_iter),
.combine = rbind,
.packages = c('mice', 'mitml', 'splines')) %dopar% {
# store data from which rows can be samples
longlist0 <- longlist[[k]]
# set seed for reproducibility
set.seed(i)
# sample rows
boot_dat <- longlist0[sample(1:nrow(longlist0), replace = TRUE), ]
# linear regression model based on sampled rows
fit1 <- lm(terms(fit_orig), data = boot_dat)
# save coefficients
fit1$coef
}
}
stopCluster(cl)
As a last step, I pool the results according to Rubin's rules.
n_cols <- dim(boot.1[[1]])[2]
list <- c()
for(i in seq_len(n_cols)) {
# extract coefficients
parameter <- lapply(boot.1, function(x){
x[,i]
})
m <- length(parameter)
for(k in seq_len(m)) {
names(parameter[[k]]) <- NULL
}
Q <- sapply(parameter, mean)
U <- sapply(parameter, var) # (standard error of estimate)^2
#### Pooling
# Pooled univariate estimate
qbar <- mean(Q)
# Mean of the variances (i.e. the pooled within-imputation variance)
ubar <- mean(U)
# Between-imputation variance
btw_var <- var(Q)
# Total variance of the pooled estimated
tot_var <- ubar + btw_var + (btw_var / m)
# Relative increase in variance due to non-response
r_var <- (btw_var + (btw_var / m)) / ubar
# Fraction of missing information
lambda <- (btw_var + (btw_var / m)) / tot_var
# degrees of freedom for the t-distribution according to Rubin (1987)
df_old <- (m - 1) / lambda^2
# sample size in the imputed data sets
n_sample <- nrow(longlist[[1]])
# observed degrees of freedom
df_observed <- (((n_sample - n_cols) + 1) / ((n_sample - n_cols) + 3)) *
(n_sample - n_cols) * (1 - lambda)
# adjusted degrees of freedom according to Barnard & Rubin (1999)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
# 95% confidence interval of qbar
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
# F statistic
q <- ((0 - qbar)^2 / tot_var)^2
# Significance level associated with the null value Q[0]
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
list[[i]] <- cbind(qbar, lwr, upr, p_value)
}
names(list) <- colnames(boot.1[[1]])
list
Obviously, the P value shown below is not in accordance with the 95% CI (as 0 is included in the CI, so the P value should be ≥0.05).
> list
$`(Intercept)`
qbar lwr upr p_value
[1,] 0.06984595 -0.02210231 0.1617942 0.008828337
EDIT (29 Dec 2021)
As #Gerko Vink notes in his answer, multiple imputation and bootstrapping both induce variance. The variance induced by imputation is taken care of by Rubin's rules, the bootstrap variance is not. Unfortunately, mice::pool will not work with the output returned by quantreg::rq.
I am aware of constructing bootstrap CIs based on a naive percentile-based approach as shown in this post, but I am inclined to think this is not the correct approach to proceed with.
Does anyone know how to appropriately take care of the extra variance induced by bootstrapping when using rq?
EDIT (30 Dec 2021)
Inspired by this recent post, I decided not to hit the road of bootstrapping anymore, but instead manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules. I have posted this approach as answer below. Any input on how to appropriately take care of the extra variance induced by bootstrapping when using rq is still very welcome though!
Bootstrapping and multiple imputation both induce variance. The imputation variance is taken care of by Rubin's rules for parameters with normal sampling distributions. The bootstrap variance is not.
Two remarks:
First, there is a small error in your code. You're calculating the bootstrap variance about Q in U <- sapply(parameter, var). No need for U <- U/n_iter. U is already the variance and sapply(parameter, sd) would yield the bootstrapped standard error.
Second, you're using bootstrap parameters to calculate a parametric interval and p-value. That seems needlessly complicated and, as you can see, potentially problematic. Why not calculate the bootstrap CI?
See also this link for some inspiration with respect to different means of calculating the CIs and their respective validity.
A small sim that demonstrates that you cannot expect both to be identical for a finite set of bootstrap replications.
library(purrr)
library(magrittr)
#fix seed
set.seed(123)
#some data
n = 1000
d <- rnorm(n, 0, 1)
# ci function
fun <- function(x){
se <- var(x)/length(x)
lwr <- mean(x) - 1.96 * se
upr <- mean(x) + 1.96 * se
ci <- c(lwr, upr)
return(ci)
}
# bootstrap
boot <- replicate(500,
d[sample(1:1000, 1000, replace = TRUE)],
simplify = FALSE)
# bootstrapped ci's based on parameters
boot.param.ci <- boot %>%
map(~.x %>% fun) %>%
do.call("rbind", args = .)
# bootstrap CI
boot.ci <- boot %>%
map(~.x %>% mean) %>%
unlist %>%
quantile(c(.025, .975))
# Overview
data.frame(param = fun(d),
boot.param = boot.param.ci %>% colMeans,
boot.ci = boot.ci)
#> param boot.param boot.ci
#> 2.5% 0.01420029 0.01517527 -0.05035913
#> 97.5% 0.01805545 0.01904181 0.07245449
Created on 2021-12-22 by the reprex package (v2.0.1)
The following reprex also demonstrates that the bootstrap applied to the imputed data yields different variance estimates under the same pooling rules.
library(purrr)
library(magrittr)
library(mice)
#fix seed
set.seed(123)
imp <- mice(boys,
m = 10,
printFlag = FALSE)
imp %>%
complete("all") %>%
map(~.x %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9601179 0.809167659 -2.422388 682.5182 0.01567825 -3.5488747
#> 2 hgt 0.1690468 0.002784939 60.700342 572.1861 0.00000000 0.1635768
#> 3 hc -0.2138941 0.021843724 -9.792018 639.0432 0.00000000 -0.2567883
#> 97.5 %
#> 1 -0.3713610
#> 2 0.1745167
#> 3 -0.1710000
imp %>%
complete("all") %>%
map(~.x %>%
.[sample(1:748, 748, replace = TRUE), ] %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9810146 1.253312293 -1.580623 22.57546 1.278746e-01 -4.5763892
#> 2 hgt 0.1689181 0.004124538 40.954423 24.47123 0.000000e+00 0.1604141
#> 3 hc -0.2133606 0.033793045 -6.313743 22.29686 2.217445e-06 -0.2833890
#> 97.5 %
#> 1 0.6143599
#> 2 0.1774221
#> 3 -0.1433322
Created on 2021-12-22 by the reprex package (v2.0.1)
For quantile regression, mice::pool will not work with the output returned by quantreg::rq, because (according to this post) there is no agreed upon method to calculate standard errors, which are required to pool results under multiple imputation.
An ad hoc solution would be to manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules.
First, a reprex using lm to verify whether results from the manual approach and mice::pool are identical.
library(mice)
imp <- mice(nhanes, print = FALSE, seed = 123)
# fit linear model
fit <- with(imp, lm(bmi ~ chl + hyp))
# manually pool univariate estimates using Rubin's rules
pool_manual <- \(model_object) {
m <- length(model_object$analyses)
Q <- sapply(model_object$analyses, \(x) summary(x)$coefficients[, 'Estimate'])
U <- sapply(model_object$analyses, \(x) (summary(x)$coefficients[, 'Std. Error'])^2)
qbar <- rowMeans(Q)
ubar <- rowMeans(U)
btw_var <- apply(Q, 1, var)
tot_var <- ubar + btw_var + (btw_var / m)
lambda <- (btw_var + (btw_var / m)) / tot_var
df_old <- (m - 1) / lambda^2
n_sample <- length(residuals(model_object$analyses[[1]]))
n_cols <- dim(Q)[1]
df_com <- n_sample - n_cols
df_observed <- ((df_com + 1) / (df_com + 3)) * df_com * (1 - lambda)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
q <- (0 - qbar)^2 / tot_var
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
df <- data.frame(noquote(rownames(Q)), qbar, lwr, upr, p_value)
rownames(df) <- NULL
names(df) <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
return(df)
}
Verify.
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
> extract <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
> summary(pool(fit), conf.int = TRUE)[, extract]
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
Quantile regression
Now, let's pool results from rq for the expected median of the outcome.
library(quantreg)
# fit quantile regression model
fit <- with(imp, rq(bmi ~ chl + hyp, tau = 0.5))
To be able to pool results from rq, only the summary method used to extract point estimates and variances from each of the imputed data sets needs to be adjusted in pool_manual.
Q <- sapply(model_object$analyses, \(x) summary.rq(x, covariance = TRUE)$coefficients[, 'Value'])
U <- sapply(model_object$analyses, \(x) (summary.rq(x, covariance = TRUE)$coefficients[, 'Std. Error'])^2)
Result
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 22.23452856 0.8551626 43.6138945 0.04461337
2 chl 0.03487894 -0.0857199 0.1554778 0.47022312
3 hyp -1.43636147 -6.0666990 3.1939761 0.52455041
> summary(pool(fit), conf.int = TRUE)[, extract]
Error in rq.fit.br(x, y, tau = tau, ci = TRUE, ...) :
unused arguments (effects = "fixed", parametric = TRUE, exponentiate = FALSE)

Using a custom `glance` method in huxtable::huxreg()

I am looking for a way to report the number of groups alongside the number of observations using huxtable::huxreg to create a table of results of a multilevel model predicted with lmer(). I can write a custom glance method that overwrites the glance.merMod method from the broom.mixed package that works when called from R but does not work when calling huxreg().
I believe that the issue is that the huxreg function imports both broom and broom.mixed as required namespaces inside the function.
What would be the best way to overcome this issue to use the custom glance method (or an alternative other than simply adding rows to existing huxtables)?
Here's a MWE
library(lme4)
library(tibble)
library(huxtable)
library(broom.mixed)
## Simulate multilevel data
sigma <- 0.5
tau <- 0.1
x <- rnorm(100)
w <- rep(rnorm(10), each=10); i <- factor(rep(1:10, each=10))
y <- x + w + rep(rnorm(10, 0, tau), each = 10) + rnorm(100, 0, sigma)
d <- tibble(y, x, w)
m <- lmer(y ~ x + w + (1|i), data=d)
## Custom glance method
glance.merMod <- function(x, ret=tibble::tibble_row()) {
ret$nobs <- nobs(x)
ret$ngrps <- summary(x)$ngrps
return(ret)
}
glance(m) ## Works, returns nobs and ngrps
huxreg(m, statistics = c("nobs", "ngrps")) ## Doesn't work, ngrps missing
You can fix this using tidy_override():
m2 <- tidy_override(m,
glance = list(
ngrps = summary(m)$ngrps
),
extend = TRUE
)
huxreg(m2, statistics = c("nobs", "ngrps"))
─────────────────────────────────────────────────
(1)
─────────────────────────
(Intercept) 0.027 
(0.064)
x 0.916 
(0.063)
w 0.982 
(0.075)
sd__(Intercept) 0.096 
(NA)     
sd__Observation 0.561 
(NA)     
─────────────────────────
nobs 100     
ngrps 10.000 
─────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
You'll probably also want to call set_number_format on the ngrps cell.

How to use the replicate function in R to repeat the function

I have a problem when using replicate to repeat the function.
I tried to use the bootstrap to fit
a quadratic model using concentration as the predictor and Total_lignin as the response and going to report an estimate of the maximum with a corresponding standard error.
My idea is to create a function called bootFun that essentially did everything within one iteration of a for loop. bootFun took in only the data set the predictor, and the response to use (both variable names in quotes).
However, the SD is 0, not correct. I do not know where is the wrong place. Could you please help me with it?
# Load the libraries
library(dplyr)
library(tidyverse)
# Read the .csv and only use M.giganteus and S.ravennae.
dat <- read_csv('concentration.csv') %>%
filter(variety == 'M.giganteus' | variety == 'S.ravennae') %>%
arrange(variety)
# Check the data
head(dat)
# sample size
n <- nrow(dat)
# A function to do one iteration
bootFun <- function(dat, pred, resp){
# Draw the sample size from the dataset
sample <- sample_n(dat, n, replace = TRUE)
# A quadratic model fit
formula <- paste0('resp', '~', 'pred', '+', 'I(pred^2)')
fit <- lm(formula, data = sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
max <- bootFun(dat = dat, pred = 'concentration', resp = 'Total_lignin' )
# Iterated times
N <- 5000
# Use 'replicate' function to do a loop
maxs <- replicate(N, max)
# An estimate of the max of predictor and corresponding SE
mean(maxs)
sd(maxs)
Base package boot, function boot, can ease the job of calling the bootstrap function repeatedly. The first argument must be the data set, the second argument is an indices argument, that the user does not set and other arguments can also be passed toit. In this case those other arguments are the predictor and the response names.
library(boot)
bootFun <- function(dat, indices, pred, resp){
# Draw the sample size from the dataset
dat.sample <- dat[indices, ]
# A quadratic model fit
formula <- paste0(resp, '~', pred, '+', 'I(', pred, '^2)')
formula <- as.formula(formula)
fit <- lm(formula, data = dat.sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
N <- 5000
set.seed(1234) # Make the bootstrap results reproducible
results <- boot(dat, bootFun, R = N, pred = 'concentration', resp = 'Total_lignin')
results
#
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = dat, statistic = bootFun, R = N, pred = "concentration",
# resp = "Total_lignin")
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -0.4629808 -0.0004433889 0.03014259
#
results$t0 # this is the statistic, not bootstrapped
#concentration
# -0.4629808
mean(results$t) # bootstrap value
#[1] -0.4633233
Note that to fit a polynomial, function poly is much simpler than to explicitly write down the polynomial terms one by one.
formula <- paste0(resp, '~ poly(', pred, ',2, raw = TRUE)')
Check the distribution of the bootstrapped statistic.
op <- par(mfrow = c(1, 2))
hist(results$t)
qqnorm(results$t)
qqline(results$t)
par(op)
Test data
set.seed(2020) # Make the results reproducible
x <- cumsum(rnorm(100))
y <- x + x^2 + rnorm(100)
dat <- data.frame(concentration = x, Total_lignin = y)

R: Robust SE's and model diagnostics in stargazer table

I try to put some 2SLS regression outputs generated via ivreg() from the AER package into a Latex document using the stargazer package. I have a couple of problems however that I can't seem to solve myself.
I can't figure out on how to insert model diagnostics as provided by the summary of ivreg(). Namely weak instruments tests, Wu-Hausmann and Sargan Test. I would like to have them with the statistics usually reported underneath the table like number of observations, R-squared, and Resid. SE. The stargazer function doesn't seem to have an argument where you can provide a list with additional diagnostics. I didn't put this into my example because I honestly have no clue where to begin.
I want to exchange the normal standard errors with robust standard errors and the only way to do this that i found is producing objects with robust standard errors and adding them in the stargazer() function with se=list(). I put this into the minimum working example below. Is there maybe a more elegant way to code this or maybe re-estimate the model and save it with robust standard errors?
library(AER)
library(stargazer)
y <- rnorm(100, 5, 10)
x <- rnorm(100, 3, 15)
z <- rnorm(100, 3, 7)
a <- rnorm(100, 1, 7)
b <- rnorm(100, 3, 5)
# Fitting IV models
fit1 <- ivreg(y ~ x + a |
a + z,
model = TRUE)
fit2 <- ivreg(y ~ x + a |
a + b + z,
model = TRUE)
# Here are the se's and the diagnostics i want
summary(fit1, vcov = sandwich, diagnostics=T)
summary(fit2, vcov = sandwich, diagnostics=T)
# Getting robust se's, i think HC0 is the standard
# used with "vcov=sandwich" from the above summary
cov1 <- vcovHC(fit1, type = "HC0")
robust1 <- sqrt(diag(cov1))
cov2 <- vcovHC(fit2, type = "HC0")
robust2 <- sqrt(diag(cov1))
# Create latex table
stargazer(fit1, fit2, type = "latex", se=list(robust1, robust2))
Here's one way to do what you want:
require(lmtest)
rob.fit1 <- coeftest(fit1, function(x) vcovHC(x, type="HC0"))
rob.fit2 <- coeftest(fit2, function(x) vcovHC(x, type="HC0"))
summ.fit1 <- summary(fit1, vcov. = function(x) vcovHC(x, type="HC0"), diagnostics=T)
summ.fit2 <- summary(fit2, vcov. = function(x) vcovHC(x, type="HC0"), diagnostics=T)
stargazer(fit1, fit2, type = "text",
se = list(rob.fit1[,"Std. Error"], rob.fit2[,"Std. Error"]),
add.lines = list(c(rownames(summ.fit1$diagnostics)[1],
round(summ.fit1$diagnostics[1, "p-value"], 2),
round(summ.fit2$diagnostics[1, "p-value"], 2)),
c(rownames(summ.fit1$diagnostics)[2],
round(summ.fit1$diagnostics[2, "p-value"], 2),
round(summ.fit2$diagnostics[2, "p-value"], 2)) ))
Which will yield:
==========================================================
Dependent variable:
----------------------------
y
(1) (2)
----------------------------------------------------------
x -1.222 -0.912
(1.672) (1.002)
a -0.240 -0.208
(0.301) (0.243)
Constant 9.662 8.450**
(6.912) (4.222)
----------------------------------------------------------
Weak instruments 0.45 0.56
Wu-Hausman 0.11 0.18
Observations 100 100
R2 -4.414 -2.458
Adjusted R2 -4.526 -2.529
Residual Std. Error (df = 97) 22.075 17.641
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
As you can see, this allows manually including the diagnostics in the respective models.
You could automate this approach by creating a function that takes in a list of models (e.g. list(summ.fit1, summ.fit2)) and outputs the objects required by se or add.lines arguments.
gaze.coeft <- function(x, col="Std. Error"){
stopifnot(is.list(x))
out <- lapply(x, function(y){
y[ , col]
})
return(out)
}
gaze.coeft(list(rob.fit1, rob.fit2))
gaze.coeft(list(rob.fit1, rob.fit2), col=2)
Will both take in a list of coeftest objects, and yield the SEs vector as expected by se:
[[1]]
(Intercept) x a
6.9124587 1.6716076 0.3011226
[[2]]
(Intercept) x a
4.2221491 1.0016012 0.2434801
Same can be done for the diagnostics:
gaze.lines.ivreg.diagn <- function(x, col="p-value", row=1:3, digits=2){
stopifnot(is.list(x))
out <- lapply(x, function(y){
stopifnot(class(y)=="summary.ivreg")
y$diagnostics[row, col, drop=FALSE]
})
out <- as.list(data.frame(t(as.data.frame(out)), check.names = FALSE))
for(i in 1:length(out)){
out[[i]] <- c(names(out)[i], round(out[[i]], digits=digits))
}
return(out)
}
gaze.lines.ivreg.diagn(list(summ.fit1, summ.fit2), row=1:2)
gaze.lines.ivreg.diagn(list(summ.fit1, summ.fit2), col=4, row=1:2, digits=2)
Both calls will yield:
$`Weak instruments`
[1] "Weak instruments" "0.45" "0.56"
$`Wu-Hausman`
[1] "Wu-Hausman" "0.11" "0.18"
Now the stargazer() call becomes as simple as this, yielding identical output as above:
stargazer(fit1, fit2, type = "text",
se = gaze.coeft(list(rob.fit1, rob.fit2)),
add.lines = gaze.lines.ivreg.diagn(list(summ.fit1, summ.fit2), row=1:2))

Resources