pROC median Sensitivity vs. manual Sensitivity calculation - different Results - r

Calculating the sensitivity manually from the confusion matrix, gives the value 0.853.
TN = 16
FP = 7
FN = 5
TP = 29
The output of pROC is different (median = 0.8235).
y_test = c(1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0)
y_pred_prob = c(0.63069148, 0.65580015, 0.9478634 , 0.94471701, 0.24756774,
0.51969906, 0.26881201, 0.6722361 , 0.30275069, 0.61676645,
0.76116789, 0.90867332, 0.31525658, 0.10681422, 0.6890589 ,
0.25185641, 0.54820684, 0.7175465 , 0.57194733, 0.71304872,
0.98805141, 0.92829077, 0.38150015, 0.97653216, 0.96036858,
0.75878699, 0.95466371, 0.52292342, 0.28296724, 0.5660834 ,
0.91581461, 0.49574317, 0.79025422, 0.14303487, 0.66885536,
0.07660444, 0.10342033, 0.53661914, 0.04701796, 0.83313871,
0.37766607, 0.89157993, 0.47731778, 0.62640482, 0.47664294,
0.0928437 , 0.13605622, 0.2561323 , 0.95572329, 0.49051571,
0.49267652, 0.92600581, 0.48464618, 0.96006108, 0.01548211,
0.56057243, 0.82257937)
set.seed(99)
boot = 2000
rocobj <- roc(y_test, y_pred_prob)
print(ci.thresholds(rocobj,.95, thresholds = 0.5, method = 'bootstrap',boot.n = boot))
OUT: 95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.5002624 0.5652 0.7391 0.913 0.6765 0.8235 0.9412
Is this a result of the bootstrapping method? Because it is a median?

What threshold did you use?
You need to be careful when you report and analyze the results of a confusion matrix. When you have numeric predictions, you must consider at which threshold this table was generated. Given the numbers in it, I will assume you used a threshold of 0.495 or something close to that, which allowed me to obtain the same numbers as you:
> table(y_test, y_pred_prob > 0.495)
y_test FALSE TRUE
0 17 6
1 5 29
How to get the empirical sensitivity and specificity from pROC?
Now that we have a threshold to work with, we can extract the data for this threshold from pROC with the coords function:
> coords(rocobj, 0.495, "threshold", transpose = FALSE)
threshold specificity sensitivity
1 0.495 0.7391304 0.8529412
This is exactly the sensitivity you calculated.
What about boostrapping?
As you suspected, the boostrapping that is used to calculate the confidence intervals is a stochastic process and the median of the resampled curves is going to be different from the empirical value.
However for a median with 2000 bootstrap replicates we get pretty close:
> set.seed(99)
> print(ci.thresholds(rocobj,.95, thresholds = 0.495, method = 'bootstrap',boot.n = boot))
95% CI (2000 stratified bootstrap replicates):
thresholds sp.low sp.median sp.high se.low se.median se.high
0.495 0.5652 0.7391 0.913 0.7353 0.8529 0.9706

Related

cumulative incidence plot for two groups with no competing risk

I am hoping someone can provide some help with an analysis I have.
Context: I have a dataset with patients who had liver transplant (some of them are in treatment group A, others in treatment group B, depending on the immunosuppressive med they’re on). Due to the transplant, these patients are at high risk for developing donor derived HBV infections.
What’s needed: The investigator is interested in time until infection starts (first instance of HBV) and the proportion that develop the infection over time. They also want the cumulative incidence of HBV infection at baseline, and at each of the post-transplant follow-up timepoints (6 months, 12 months, 18 months and 24 months) for group A and group B. For example, the 6-months data would be the proportion of those patients with 6-months follow-up who ever had HBV, the 12-months data would the proportion of those with 12-months data follow-up who ever had HBV and so on.
Cumulative incidence in this specific case just means 1 minus the survival function, without accounting for any competing risks. The analysis population has no deaths or loss to follow up.
My questions are:
How do I get the cumulative incidence by treatment group over time? (I want to also display the numbers at risk and number of events for each group under the plot)
Is it possible to also display a log rank p-value on the cumulative incidence plot, to see if there’s any significant difference between the groups in terms of time to first infection or does that not make any sense?
Any way to also obtain the cumulative incidence numbers at each time point, for each group with the std.err and 95%CI, similar to the life tables we get when we use summary(km) below? these life tables give me the survival probabilities so I guess if I want cumulative incidence, I could just manually do 1-survival probability but not sure how to obtain the std.err and confidence intervals?
Below is a test dataset similar to the actual one, and what I have done so far:
time<-c(1.5989,6.9433, 0.8890, 3.2691, 1.0514, 2.7625, 1.4319, 0.9681, 7.4416, 0.0268, 1.5168, 1.9647, 0.0657, 4.3571, 6.4490, 0.2198, 1.2028, 0.9555, 0.2601, 2.0096, 7.5156, 0.4463, 0.2355, 0.9391, 2.6996)
censor<-c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0)
group<-c(1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1)
df<-data.frame(ID, time, censor, group)
View(df)
km<-survfit(formula = Surv(time, censor) ~ group, data = df)
summary(km)
#cumulative incidence plot
plot(km, fun = function(x) 1-x)
#log rank test;
survdiff(Surv(time, censor) ~ group, data=df)
#plot survival curves for each treatment group
plot(survfit(Surv(time, censor) ~ group, data = df),
xlab = "Time",
ylab = "Overall survival probability")
It is better to use survminer package to draw survival plots by its ggsurvplot() function.
fun argument of ggsurvplot() can be "event" for cumulative incidence or "cumhaz" for cumulative hazard function, or if left by default, would plot Kaplan Meier curve.
Additional arguments such as pvalue also can be customized.
pvalue method is kaplan meier by default.
You can see more examples in survminer documentations.
library(survminer)
#> Loading required package: ggplot2
#> Loading required package: ggpubr
time <- c(1.5989,6.9433, 0.8890, 3.2691, 1.0514, 2.7625, 1.4319, 0.9681, 7.4416,
0.0268, 1.5168, 1.9647, 0.0657, 4.3571, 6.4490, 0.2198, 1.2028, 0.9555,
0.2601, 2.0096, 7.5156, 0.4463, 0.2355, 0.9391, 2.6996)
censor <- c(1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0)
group <- c(1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1)
df <- data.frame(time, censor, group)
km <- survfit(formula = Surv(time, censor) ~ group, data = df)
#cumulative incidence plot
ggsurvplot(km,
data = df,
censor = T,
risk.table = TRUE,
legend.labs = c("group 1", "group 2"),
xlim = c(0,10),
ylim = c(0,1),
pval = T,
pval.method = T,
pval.method.coord = c(2.5,0.5),
pval.coord = c(4.2,0.5),
xlab = "Months",
ylab = "SURVIVAL PROBABILITY",
linetype = c(1,2),
legend.title = "",
palette = c('red', 'blue'),
fun="event"
)
Created on 2023-02-12 with reprex v2.0.2

R Manipulating List of Lists With Conditions / Joining Data

I have the following data showing 5 possible kids to invite to a party and what neighborhoods they live in.
I have a list of solutions as well (binary indicators of whether the kid is invited or not; e.g., the first solution invites Kelly, Gina, and Patty.
data <- data.frame(c("Kelly", "Andrew", "Josh", "Gina", "Patty"), c(1, 1, 0, 1, 0), c(0, 1, 1, 1, 0))
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1), c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
I'm looking for a way to now filter the solutions in the following ways:
a) Only keep solutions where there are at least 3 kids from both neighborhood A and neighborhood B (one kid can count as one for both if they're part of both)
b) Only keep solutions that have at least 3 kids selected (i.e., sum >= 3)
I think I need to somehow join data to the solutions in solutions, but I'm a bit lost on how to manipulate everything since the solutions are stuck in lists. Basically looking for a way to add entries to every solution in the list indicating a) how many kids the solution has, b) how many kids from neighborhood A, and c) how many kids from neighborhood B. From there I'd have to somehow filter the lists to only keep the solutions that satisfy >= 3?
Thank you in advance!
I wrote a little function to check each solution and return TRUE or FALSE based on your requirements. Passing your solutions to this using sapply() will give you a logical vector, with which you can subset solutions to retain only those that met the requirements.
check_solution <- function(solution, data) {
data <- data[as.logical(solution),]
sum(data[["Neighborhood A"]]) >= 3 && sum(data[["Neighborhood B"]]) >= 3
}
### No need for function to test whether `sum(solution) >= 3`, since
### this will *always* be true if either neighborhood sums is >= 3.
tests <- sapply(solutions, check_solution, data = data)
# FALSE FALSE FALSE FALSE FALSE
solutions[tests]
# list()
### none of the `solutions` provided actually meet criteria
Edit: OP asked in the comments how to test against all neighborhoods in the data, and return TRUE if a specified number of neighborhoods have enough kids. Below is a solution using dplyr.
library(dplyr)
data <- data.frame(
c("Kelly", "Andrew", "Josh", "Gina", "Patty"),
c(1, 1, 0, 1, 0),
c(0, 1, 1, 1, 0),
c(1, 1, 1, 0, 1),
c(0, 1, 1, 1, 1)
)
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B", "Neighborhood C",
"Neighborhood D")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1),
c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
check_solution <- function(solution,
data,
min_kids = 3,
min_neighborhoods = NULL) {
neighborhood_tests <- data %>%
filter(as.logical(solution)) %>%
summarize(across(starts_with("Neighborhood"), ~ sum(.x) >= min_kids)) %>%
as.logical()
# require all neighborhoods by default
if (is.null(min_neighborhoods)) min_neighborhoods <- length(neighborhood_tests)
sum(neighborhood_tests) >= min_neighborhoods
}
tests1 <- sapply(solutions, check_solution, data = data)
solutions[tests1]
# list()
tests2 <- sapply(
solutions,
check_solution,
data = data,
min_kids = 2,
min_neighborhoods = 3
)
solutions[tests2]
# [[1]]
# [1] 1 0 0 1 1
#
# [[2]]
# [1] 0 1 0 1 1

Pooling Survreg Results Across Multiply Imputed Datasets - Error Message: log(1 - 2 * pnorm(width/2)) : NaNs produced

I am trying to run an interval regression using the survival r package (as described here https://stats.oarc.ucla.edu/r/dae/interval-regression/), but I am running into difficulties when trying to pool results across multiply imputed datasets. Specifically, although estimates are returned, I get the following error: log(1 - 2 * pnorm(width/2)) : NaNs produced. The estimates seem reasonable, at face value (no NaNs, very large or small SEs).
I ran the same model on the stacked dataset (ignoring imputations) and on individual imputed datasets, but in either case, I do not get the error. Would someone be able to explain to me what is going on? Is this an ignorable error? If not, is there a workaround that avoids this error?
Thanks so much!
# A Reproducible Example
require(survival)
require(mice)
require(car)
# Create DF
dat <- data.frame(dv = c(1, 1, 2, 1, 0, NA, 1, 4, NA, 0, 3, 1, 3, 0, 2, 1, 4, NA, 2, 4),
catvar1 = factor(c(0, 0, 0, 0, 0, 1, 0, 0, 0, NA, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)),
catvar2 = factor(c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, NA, 0)))
dat_imp <- mice(data = dat)
# Transform Outcome Var for Interval Reg
dat_imp_long <- complete(dat_imp, action = "long", include=TRUE)
# 1-4 correspond to ranges (e.g., 1 = 1 to 2 times...4 = 10 or more)
# create variables that reflect this range
dat_imp_long$dv_low <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 1; 2 = 3; 3 = 6; 4 = 10")
dat_imp_long$dv_high <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 2; 2 = 5; 3 = 9; 4 = 999")
dat_imp_long$dv_high[dat_imp_long$dv_high > 40] <- Inf
# Convert back to mids
dat_mids <- as.mids(dat_imp_long)
# Run Interval Reg
model1 <- with(dat_mids, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
# Warning message for both calls: In log(1 - 2 * pnorm(width/2)) : NaNs produced
# Problem does not only occur with pool, but summary
summary(model1)
summary(pool(model1))
# Run Equivalent Model on Individual Datasets
# No errors produced
imp1 <- subset(dat_imp_long, .imp == 1)
model2 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp1)
summary(model2)
imp2 <- subset(dat_imp_long, .imp == 2)
model3 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp2)
summary(model3)
# Equivalent Analysis on Stacked Dataset
# No error
model <- with(dat_imp_long, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
summary(model)

R optim() constraint optimization does not find the first best

my problem is summarized in finding a vector X with the best solution to the problem:
L is the profits,
R is the restrictions,
P is a constraint parameters matrix,
max SUM_i (x_i * l_i)
or max(t(L)%*%X)
restriction
SUM_i(x_i*p_ij)<=r_j
or P%*%X <= R.
I find a solution for X, but not the best, which would be
fb = c(.217,0,0,23,2865,0,13,427).
How do I find the best solution?
code:
X<-matrix(rep(1,6),6,1)
P<-matrix(c(
1, 1, 1, 2, 0, 0,
0, 1, 1, 2, 1, 1,
99.4, 37.75, 19.75, 54.40, 74.75, 53,
2.400, 1.540, 0, 0, 0, 0,
2.400, 1.960, 0, 0, 0, 0,
1.800, 3.300, 5.330, 0, 0, 0,
0, 0, 2.070, 0, 8.700, 0,
0, 0, .436, 0, 19.100, 12.363,
0, 3.000, .364, 0, 9.100, 26.737 ),
9,6,1)
L <- matrix(c(83.4, 72.35, 27.3, 72.05, 217.25, 455), 6,1)
R <- matrix(c(60,60,2000,351,448,479,338,424,359),9,1)
farm<- function(par, P,R, L){
trues<- P%*%par<=R
if (min(trues)==1 && min(par)>=0) {
return(-t(L)%*%par)
}
else{
return(0)
}
}
mtds = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN","Brent")
out <- optim(par = X, # initial guess
fn = farm,
P = P,
R = R ,
L = L,
method = mtds[5])
# my result
t(L)%*%out$par
#A matrix: 1 × 1 of type dbl
#7419.596
# the first best
fb<- matrix(c(.217,0,0,23.2865,0,13.427),6,1)
t(L)%*%fb
#A matrix: 1 × 1 of type dbl
#7805.175
I think you can try fmincon from package pracma
library(pracma)
objfun <- function(x) -t(L)%*%x
res <- fmincon(x0 = X,fn = objfun,A = P,b = R,lb = rep(0,length(X)))
and you will see that
> res$par
[1] 4.201711e-16 -1.239088e-15 1.863081e-17 2.310286e+01
[5] 5.566620e-01 1.323762e+01
> -res$value
[,1]
[1,] 7808.615
That looks very much like a model that could be solved by a linear programme.
library("Rglpk")
Rglpk_solve_LP(obj = L,
mat = P,
dir = rep("<=", 9),
rhs = R,
max = TRUE)

Stargazer Confidence Interval Incorrect?

So I am really fond of the stargazer package for displaying the statistics for regression models. I've been using R and Stata together to complete some problems in a textbook. One issue that I have found is that the confidence interval printed by the stargazer package does not correspond to the confidence interval by stata. I determined that the CI in stata is the correct one after doing it by hand.
Because the issue might may possibly lie in how I am handling the data, I offer it here as an optional choice. My primary concern is to determine why the CI's do not respond. From a previous post, here is one possible way of finding the data I am using;
install.packages("devtools") # if not already installed
library(devtools)
install_git("https://github.com/ccolonescu/PoEdata")
library(PoEdata) # loads the package in memory
library(multcomp) # for hypo testing
data(fair4) # loads the data set of interest
In Stata, the name of the dataset I am using is called fair4.dta. For the data itself, you can use it manually,
structure(list(year = structure(c(1880, 1884, 1888, 1892, 1896,
1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940,
1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984,
1988, 1992, 1996, 2000, 2004, 2008), label = "year", format.stata = "%9.0g"),
vote = structure(c(50.2200012207031, 49.8460006713867, 50.4140014648438,
48.2680015563965, 47.7599983215332, 53.1710014343262, 60.0060005187988,
54.4830017089844, 54.7080001831055, 51.681999206543, 36.1189994812012,
58.2439994812012, 58.8199996948242, 40.8409996032715, 62.4580001831055,
54.9990005493164, 53.773998260498, 52.3699989318848, 44.5950012207031,
57.7639999389648, 49.9129981994629, 61.3440017700195, 49.5960006713867,
61.7890014648438, 48.9480018615723, 44.6969985961914, 59.1699981689453,
53.9020004272461, 46.5449981689453, 54.7360000610352, 50.2649993896484,
51.2330017089844, 46.5999984741211), label = "Incumbent share of the two-party presidential vote", format.stata = "%9.0g"),
party = structure(c(-1, -1, 1, -1, 1, -1, -1, -1, -1, 1,
1, -1, -1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1,
-1, -1, 1, 1, -1, -1), label = "= 1 if Democratic incumbent at election time; -1 if a Republican incumbent", format.stata = "%9.0g"),
person = structure(c(0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
1, 0), label = "= 1 if incumbent is running for election and 0 otherwise", format.stata = "%9.0g"),
duration = structure(c(1.75, 2, 0, 0, 0, 0, 1, 1.25, 1.5,
0, 1, 0, 1, 1.25, 0, 1, 1.25, 1.5, 1.75, 0, 1, 0, 1, 0, 1,
0, 0, 1, 1.25, 0, 1, 0, 1), label = "number of terms incumbent administration in power", format.stata = "%9.0g"),
war = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), label = "= 1 for elections of 1920, 1944, and 1948 and 0 otherwise.", format.stata = "%9.0g"),
growth = structure(c(3.87899994850159, 1.58899998664856,
-5.55299997329712, 2.76300001144409, -10.0240001678467, -1.42499995231628,
-2.4210000038147, -6.2810001373291, 4.16400003433228, 2.22900009155273,
-11.4630002975464, -3.87199997901917, 4.6230001449585, -14.4989995956421,
11.7650003433228, 3.90199995040894, 4.27899980545044, 3.5789999961853,
0.690999984741211, -1.45099997520447, 0.377000004053116,
5.10900020599365, 5.04300022125244, 5.91400003433228, 3.75099992752075,
-3.59699988365173, 5.44000005722046, 2.17799997329712, 2.66199994087219,
3.12100005149841, 1.21899998188019, 2.69000005722046, 0.219999998807907
), label = "growth rate GDP in first three quarters of the election year", format.stata = "%9.0g"),
inflation = structure(c(1.97399997711182, 1.05499994754791,
0.603999972343445, 2.2739999294281, 3.41000008583069, 2.54800009727478,
1.44200003147125, 1.87899994850159, 2.17199993133545, 4.2519998550415,
0, 5.16099977493286, 0.18299999833107, 7.19999980926514,
2.49699997901917, 0.0810000002384186, 0, 0, 2.36199998855591,
1.93499994277954, 1.96700000762939, 1.25999999046326, 3.13899993896484,
4.81500005722046, 7.63000011444092, 7.83099985122681, 5.25899982452393,
2.90599989891052, 3.27999997138977, 2.06200003623962, 1.60500001907349,
2.32500004768372, 2.88000011444092), label = "growth rate of GDP deflator during first 15 quarters of admin", format.stata = "%9.0g"),
goodnews = structure(c(9, 2, 3, 7, 6, 7, 5, 8, 8, 3, 0, 10,
7, 4, 9, 8, 0, 0, 7, 5, 5, 10, 7, 4, 5, 5, 8, 4, 2, 4, 8,
1, 3), label = "number of quarters in first 15 with real GDP per capita growth > 3.2", format.stata = "%9.0g")), notes = c("more complete variable definitions in fair.def",
"1"), .Names = c("year", "vote", "party", "person", "duration",
"war", "growth", "inflation", "goodnews"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -33L))
So here is the stargazer code that is giving me trouble:
presidential <- read_dta("~/Directory/fair4.dta")
pres.lm = lm(vote ~ growth, data = subset(presidential,
presidential$year >= 1916)
stargazer(pres.lm,
type = "text",
intercept.bottom = T,
digits = 5,
report = "vc*stp",
ci = T
)
confint(pres.lm, level = 0.95)
Consider the difference in the confidence intervals.
(0.52948, 1.24241) # in R, Stargazer
0.5087671 1.263126 # in R, confint(pres.lm)
.5087671 1.263126 # in Stata
I also calculated by hand for the confidence intervals and the confit() and the Stata numbers check out. The t-critical value for this dataset should be t_(N-2 , prob) = t(22,.0025) = -2.073873.
In addition, I made sure to create an entirely new data frame. That is, instead of subsetting within the the lm() argument, I subset it first. When comparing this method to the previous one, I still get the same exact (incorrect) confidence intervals.
# subset into a new dataframe
presidential.1 = subset(presidential, presidential$year >= 1916)
# create the model
pres.lm.2 = lm(vote ~ growth, data = presidential.1)
# compare the two
stargazer(pres.lm,pres.lm.2,
type = "text",
intercept.bottom = F,
digits = 5,
report = "vc*stp",
ci = T,
t.auto = T)
(1) (2)
-----------------------------------------------------------------------
Constant 50.84840*** 50.84840***
(48.86384, 52.83295) (48.86384, 52.83295)
t = 50.21835 t = 50.21835
p = 0.00000 p = 0.00000
growth 0.88595*** 0.88595***
(0.52948, 1.24241) (0.52948, 1.24241)
t = 4.87126 t = 4.87126
p = 0.00008 p = 0.00008
# correct intervals from Stata and R's confint()
growth 0.5087671 1.263126
Am I running the code incorrectly? It really isn't a big deal for me to run the stargazer command and print only the coefficients and the t-stats, but it is kind of disappointing that I would have to run confint() as a separate command given that the output for Stargazer is gorgeous. It is quite odd because the coefficient estimates and the t-statistics are perfect. The confidence intervals are off by varying degrees, and I would like to know what the cause of this might be. Any advice would be greatly appreciated.
The simple answer is that stata and confint calculate confidence intervals using the t-distribution, while stargazer's internal method uses the normal distribution. The result is that the former two are more conservative in their estimates and thus have wider CI compared to stargazer. (Well, I'm assuming with stata here, but since it gives the same results as confint I feel it is a safe assumption).
Looking deep into the source code for stargazer (line 688ff) we can find how CIs are calculated:
z.value <- qnorm((1 + .format.ci.level.use)/2)
coef <- .global.coefficients[.global.coefficient.variables[which.variable],i]
se <- .global.std.errors[.global.coefficient.variables[which.variable],i]
ci.lower.bound <- coef - z.value * se
ci.upper.bound <- coef + z.value * se
It uses qnorm to set the critical value.
Compare to confint:
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual) ##Relevant line, uses T-distribution
pct <- format.perc(a, 3)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(vcov(object)))[parm]
ci[] <- cf[parm] + ses %o% fac
Compare:
#Using normal/z distribution
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qnorm((1 + 0.95)/2), qnorm((1 + 0.95)/2))
[,1] [,2]
growth 0.5294839 1.242409
#Using t-distribution with df degrees of freedom
> df <- pres.lm$df.residual
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qt((1 + 0.95)/2, df), qt((1 + 0.95)/2, df))
[,1] [,2]
growth 0.5087671 1.263126
Probably the easiest way to handle this if you are committed to stargazer is to use the ci.custom argument:
> stargazer(pres.lm, type = "text", ci.custom = list(confint(pres.lm)))
===============================================
Dependent variable:
---------------------------
vote
-----------------------------------------------
growth 0.886***
(0.509, 1.263)
Constant 50.848***
(48.749, 52.948)
-----------------------------------------------
Observations 24
R2 0.519
Adjusted R2 0.497
Residual Std. Error 4.798 (df = 22)
F Statistic 23.729*** (df = 1; 22)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Once the sample size is sufficiently large, the t-distribution converges on the z-distribution and the differences between the CIs become much smaller.
set.seed(432)
x1 <- rnorm(10000, 100, 50)
u <- 2 * rnorm(10000)
y <- 50 + x1 * 0.752 * u
fit <- lm(y ~ x1)
> confint(fit)
2.5 % 97.5 %
(Intercept) 39.29108955 54.1821315
x1 -0.02782141 0.1061173
> stargazer(fit, type= "text", ci = T)
===============================================
Dependent variable:
---------------------------
y
-----------------------------------------------
x1 0.039
(-0.028, 0.106)
Constant 46.737***
(39.292, 54.181)
-----------------------------------------------
Observations 10,000
R2 0.0001
Adjusted R2 0.00003
Residual Std. Error 168.194 (df = 9998)
F Statistic 1.313 (df = 1; 9998)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
With a sample size of 24, the t-distribution with 22 degrees of freedom has much fatter tails than the z!

Resources