Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset

Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset - r

I'm currently trying to do a stratified split in R to create train and test datasets.
A problem posed to me is the following
split the data into a train and test sample such that 70% of the data
is in the train sample. To ensure a similar distribution of price
across the train and test samples, use createDataPartition from the
caret package. Set groups to 100 and use a seed of 1031. What is the
average house price in the train sample?
The dataset is a set of houses with prices (along with other data points)
For some reason, when I run the following code, the output I get is labeled as incorrect in the practice problem simulator. Can anyone spot an issue with my code? Any help is much appreciated since I'm trying to avoid learning this language incorrectly.
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
Output
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6

I try to reproduce it manually using sample_frac form dplyr package and cut2 function from Hmisc package. The results are almost the same - still not same.
It looks like there might be a problem with pseudo numbers generator or with some rounding.
In my opinion your code looks to be a correct one.
Is it possible that in previous steps you should remove some outliers or pre-process dataset in any way.
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
This sampling procedure should result in mean that approximate to a population one.

Related

Rolling average of values that satisfy multiple conditions in R

This is my first question on Stackoverflow, so please bear with me if I make any mistakes or omit necessary information.
I have a dataset consisting of a time series where I need to find the 5-day rolling average of a binary variable for each specific hour of the day. An example of my data can be created using:
library(dplyr)
library(zoo)
set.seed(69)
df <- data.frame(Hour = rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24), times = 10),
Reg = rep(round(runif(24*10, 0, 1))),
HumidityLevel = rep(runif(24*10, 0, 100)))
df_ranges <- data.frame(LowerRange = rep(cbind(rollapply(df$HumidityLevel, 24, min, by = 24)), each = 24)
,UpperRange = rep(cbind(rollapply(df$HumidityLevel, 24, max, by = 24)), each = 24))
df <- cbind(df, df_ranges)
I have computed the simple rolling average using the following code:
df <- df %>%
group_by(Hour) %>%
mutate(AvgReg = lag(rollapplyr(Reg, 5, mean, na.rm = T, partial = T), n = 1))
What I need to do is compute the rolling average of Reg using previous rows where HumidityLevel lies within the range for that specific day. The lower and upper boundary of the range is determined by two columns (LowerRange, UpperRange). The boundary values are dependent on the lowest and highest HumidityLevel-values for the day.
For instance, a day may have levels between 20 and 54. The rolling average for hour 1 of that specific day should then be computed by using previous Hour 1 observations with a HumidityLevel value above or equal to 20 and below or equal to 54.
I hope that my question makes sense.
This is my desired output:
desired_output <- data.frame(RowNum = c(1:10),
Hour = rep(1, times = 10),
Reg = c(1,0,0,1,0,1,0,0,0,0),
HumidityLevel = c(28.36, 65.02, 1.12, 49.61, 24.50, 98.16, 77.33, 97.03, 47.03, 85.71),
LowerBoundary = c(5.67, 7.50, 1.12, 19.32, 0.01, 6.94, 7.48, 0.71, 2.85, 1.59),
UpperBoundary = c(93.60, 89.37, 97.25, 99.63, 91.92, 98.16, 98.48, 99.98, 99.70, 98.86),
AvgReg = c("NA", 1, 0.5, 0.5, 0.5, 0.5, 0.6, 0.4, 0.4, 0.2))

Using data.table you can use between for filter and shift + frollmean for calculation:
setDT(df)[
between(HumidityLevel, LowerRange, UpperRange),
new_col := shift(
frollmean(Reg, c(seq_len(min(5, .N)), rep(5, max(0, .N - 5))), adaptive = TRUE)
),
by = Hour
]

How to use for-loop (or apply) in R to filter and select specific columns with specific terms

I am trying to make a table like this -
The table contains several scenarios and risk_type.
The scenarios are basically filters. For example
0 - loan_age > 18
1 - interest_rate > 8%
2 - interest_rate > 18% AND referee == "MALE" AND new_LTV > 50
risk_type are columns in the original dataset like
A - flood risk
B - wildfire risk
C - foundation risk
What I want to do is to create a summary table of all these different risks for all the filters.
This is how the data looks like -
Damage and new LTV is a function of risk score, and I want to filter for risk score > 4
Edit - The first 5 rows of the dummy dataframe.
structure(list(ID = c(1, 2, 3, 4, 5), LTV_value = c(43, 43, 32,
34, 35), loan_age = c(17, 65, 32, 33, 221), referee = c("MALE",
"FEMALE", "MALE", "MALE", "FEMALE"), interest_rate = c(0.02,
0.03, 0.05, 0.0633333333333333, 0.0783333333333333), value = c(70000,
80000, 90000, 1e+05, 45000), flood_risk_score = c(3, 4, 5, 0,
1), wildfire_risk_score = c(3, 4, 3, 3, 2), foundation_risk_score = c(5,
5, 2, 0, 1), flood_damage = c(21000, 32000, 45000, 0, 4500),
wildfire_damage = c(21000, 32000, 27000, 30000, 9000), foundation_damage = c(35000,
40000, 18000, 0, 4500), new_LTV_flood = c(40, 39, 27, 34,
34), new_LTV_wildfire = c(40, 39, 29, 31, 33), new_LTV_foundation = c(38,
38, 30, 34, 34)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
Till now I have tried these methods.
risk_list = c("flood_risk_score"
, "wildfire_risk_score"
, "foundation_risk_score")
for (i in risk_list){
table <- df %>%
filter(df[i] > 3) %>%
summarise(Count = n()
, mean = mean(value, na.rm = TRUE)
, LTV = mean(LTV))
# Using rbind() to append the output of one iteration to the dataframe
table_append= rbind(table_append, table)
}
This helps me get the values for all the risk scores, however, I have two issues here.
I am unable to filter according to a filter list
For the filter list, I tried this code, but I am unable to add it in a loop -
filters_list = list(which(df$interest > 8)
, which(df$loan_age > 18))
For LTV update, all of them have different new LTV
All of them need to be filtered for high LTV using their new LTV scores
risk_type_list = c("flood"
, "wildfire"
, "foundation")
for (i in list(paste0(risk_type_list,"_risk_level"))){
table <- df %>%
filter(df[paste0(i,"_risk_level")] > 3) %>%
summarise(Count = n())
#Using rbind() to append the output of one iteration to the dataframe
table_append = rbind(table_append, table)
}
In the end, I want to have code that will generate data from the given data by putting in required filters for all different risk types and also use their new LTV values.

Pooled average marginal effects from survey-weighted and multiple-imputed data

I am working with survey data and their associated weights, in addition to missing data that I imputed using mice(). The model I'm eventually running contains complex interactions between variables for which I want the average marginal effect.
This task seems trivial in STATA, but I'd rather stay in R since that's what I know best. It seems easy to retrieve AME's for each separate imputed dataset and average the estimates. However, I need to make use of pool() (from mice) to make sure I'm getting the correct standard errors.
Here is a reproducible example:
library(tidyverse)
library(survey)
library(mice)
library(margins)
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9))
Using margins() on a simple (non-multiple) svyglm works without a hitch. Running svyglm on each imputation using which() and pooling the results also works well.
m <- with(surv_obj, svyglm(y ~ x1 * x2))
pool(m)
However, wrapping margins() into which() returns an error "Error in .svycheck(design) : argument "design" is missing, with no default"
with(surv_obj, margins(svyglm(y ~ x1 * x2), design = surv_obj))
If I specify the design in the svyglm call, I get "Error in UseMethod("svyglm", design) : no applicable method for 'svyglm' applied to an object of class "svyimputationList""
with(surv_obj, margins(svyglm(y ~ x1 * x2, design = surv_obj), design = surv_obj))
If I drop the survey layer, and simply try to run the margins on each imputed set and then pool, I get a warning: "Warning in get.dfcom(object, dfcom) : Infinite sample size assumed.".
m1 <- with(imputed_df, margins(lm(y ~ x1 * x2)))
pool(m1)
This worries me given that pool() may use sample size in its calculations.
Does anyone know of any method to either (a) use which(), margins() and pool() to retrieve the pooled average marginal effects or (b) knows what elements of margins() I should pass to pool() (or pool.scalar()) to achieve the desired result?

Update following Vincent's comment
Wanted to update this post following Vincent's comment and related package marginaleffects() which ended up fixing my issue. Hopefully, this will be helpful to others stuck on similar problems.
I implemented the code in the vignette linked in Vincent's comment, adding a few steps that allow for survey weighting and modeling. It's worth noting that svydesign() will drop any observations missing on clustering/weighting variables, so marginaleffects() can't predict values back unto the original "dat" data and will throw up an error. Pooling my actual data still throws up an "infinite sample size assumed", which (as noted) should be fine but I'm still looking into fixes.
library(tidyverse)
library(survey)
library(mice)
library(marginaleffects)
fit_reg <- function(dat) {
svy <- svydesign(ids = ~ 1, cluster = ~ region, weight = ~weight, data = dat)
mod <- svyglm(y ~ x1 + x2*factor(x3), design = svy)
out <- marginaleffects(mod, newdata = dat)
class(out) <- c("custom", class(out))
return(out)
}
tidy.custom <- function(x, ...) {
out <- marginaleffects:::tidy.marginaleffects(x, ...)
out$term <- paste(out$term, out$contrast)
return(out)
}
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9),
x3 = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2))
imputed_df <- mice(df, m = 2, seed = 123)
dat_mice <- complete(imputed_df, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
summary(mod_imputation)

Pooling Survreg Results Across Multiply Imputed Datasets - Error Message: log(1 - 2 * pnorm(width/2)) : NaNs produced

I am trying to run an interval regression using the survival r package (as described here https://stats.oarc.ucla.edu/r/dae/interval-regression/), but I am running into difficulties when trying to pool results across multiply imputed datasets. Specifically, although estimates are returned, I get the following error: log(1 - 2 * pnorm(width/2)) : NaNs produced. The estimates seem reasonable, at face value (no NaNs, very large or small SEs).
I ran the same model on the stacked dataset (ignoring imputations) and on individual imputed datasets, but in either case, I do not get the error. Would someone be able to explain to me what is going on? Is this an ignorable error? If not, is there a workaround that avoids this error?
Thanks so much!
# A Reproducible Example
require(survival)
require(mice)
require(car)
# Create DF
dat <- data.frame(dv = c(1, 1, 2, 1, 0, NA, 1, 4, NA, 0, 3, 1, 3, 0, 2, 1, 4, NA, 2, 4),
catvar1 = factor(c(0, 0, 0, 0, 0, 1, 0, 0, 0, NA, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)),
catvar2 = factor(c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, NA, 0)))
dat_imp <- mice(data = dat)
# Transform Outcome Var for Interval Reg
dat_imp_long <- complete(dat_imp, action = "long", include=TRUE)
# 1-4 correspond to ranges (e.g., 1 = 1 to 2 times...4 = 10 or more)
# create variables that reflect this range
dat_imp_long$dv_low <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 1; 2 = 3; 3 = 6; 4 = 10")
dat_imp_long$dv_high <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 2; 2 = 5; 3 = 9; 4 = 999")
dat_imp_long$dv_high[dat_imp_long$dv_high > 40] <- Inf
# Convert back to mids
dat_mids <- as.mids(dat_imp_long)
# Run Interval Reg
model1 <- with(dat_mids, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
# Warning message for both calls: In log(1 - 2 * pnorm(width/2)) : NaNs produced
# Problem does not only occur with pool, but summary
summary(model1)
summary(pool(model1))
# Run Equivalent Model on Individual Datasets
# No errors produced
imp1 <- subset(dat_imp_long, .imp == 1)
model2 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp1)
summary(model2)
imp2 <- subset(dat_imp_long, .imp == 2)
model3 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp2)
summary(model3)
# Equivalent Analysis on Stacked Dataset
# No error
model <- with(dat_imp_long, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
summary(model)

Stargazer Confidence Interval Incorrect?

So I am really fond of the stargazer package for displaying the statistics for regression models. I've been using R and Stata together to complete some problems in a textbook. One issue that I have found is that the confidence interval printed by the stargazer package does not correspond to the confidence interval by stata. I determined that the CI in stata is the correct one after doing it by hand.
Because the issue might may possibly lie in how I am handling the data, I offer it here as an optional choice. My primary concern is to determine why the CI's do not respond. From a previous post, here is one possible way of finding the data I am using;
install.packages("devtools") # if not already installed
library(devtools)
install_git("https://github.com/ccolonescu/PoEdata")
library(PoEdata) # loads the package in memory
library(multcomp) # for hypo testing
data(fair4) # loads the data set of interest
In Stata, the name of the dataset I am using is called fair4.dta. For the data itself, you can use it manually,
structure(list(year = structure(c(1880, 1884, 1888, 1892, 1896,
1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940,
1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984,
1988, 1992, 1996, 2000, 2004, 2008), label = "year", format.stata = "%9.0g"),
vote = structure(c(50.2200012207031, 49.8460006713867, 50.4140014648438,
48.2680015563965, 47.7599983215332, 53.1710014343262, 60.0060005187988,
54.4830017089844, 54.7080001831055, 51.681999206543, 36.1189994812012,
58.2439994812012, 58.8199996948242, 40.8409996032715, 62.4580001831055,
54.9990005493164, 53.773998260498, 52.3699989318848, 44.5950012207031,
57.7639999389648, 49.9129981994629, 61.3440017700195, 49.5960006713867,
61.7890014648438, 48.9480018615723, 44.6969985961914, 59.1699981689453,
53.9020004272461, 46.5449981689453, 54.7360000610352, 50.2649993896484,
51.2330017089844, 46.5999984741211), label = "Incumbent share of the two-party presidential vote", format.stata = "%9.0g"),
party = structure(c(-1, -1, 1, -1, 1, -1, -1, -1, -1, 1,
1, -1, -1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1,
-1, -1, 1, 1, -1, -1), label = "= 1 if Democratic incumbent at election time; -1 if a Republican incumbent", format.stata = "%9.0g"),
person = structure(c(0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1,
0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0,
1, 0), label = "= 1 if incumbent is running for election and 0 otherwise", format.stata = "%9.0g"),
duration = structure(c(1.75, 2, 0, 0, 0, 0, 1, 1.25, 1.5,
0, 1, 0, 1, 1.25, 0, 1, 1.25, 1.5, 1.75, 0, 1, 0, 1, 0, 1,
0, 0, 1, 1.25, 0, 1, 0, 1), label = "number of terms incumbent administration in power", format.stata = "%9.0g"),
war = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), label = "= 1 for elections of 1920, 1944, and 1948 and 0 otherwise.", format.stata = "%9.0g"),
growth = structure(c(3.87899994850159, 1.58899998664856,
-5.55299997329712, 2.76300001144409, -10.0240001678467, -1.42499995231628,
-2.4210000038147, -6.2810001373291, 4.16400003433228, 2.22900009155273,
-11.4630002975464, -3.87199997901917, 4.6230001449585, -14.4989995956421,
11.7650003433228, 3.90199995040894, 4.27899980545044, 3.5789999961853,
0.690999984741211, -1.45099997520447, 0.377000004053116,
5.10900020599365, 5.04300022125244, 5.91400003433228, 3.75099992752075,
-3.59699988365173, 5.44000005722046, 2.17799997329712, 2.66199994087219,
3.12100005149841, 1.21899998188019, 2.69000005722046, 0.219999998807907
), label = "growth rate GDP in first three quarters of the election year", format.stata = "%9.0g"),
inflation = structure(c(1.97399997711182, 1.05499994754791,
0.603999972343445, 2.2739999294281, 3.41000008583069, 2.54800009727478,
1.44200003147125, 1.87899994850159, 2.17199993133545, 4.2519998550415,
0, 5.16099977493286, 0.18299999833107, 7.19999980926514,
2.49699997901917, 0.0810000002384186, 0, 0, 2.36199998855591,
1.93499994277954, 1.96700000762939, 1.25999999046326, 3.13899993896484,
4.81500005722046, 7.63000011444092, 7.83099985122681, 5.25899982452393,
2.90599989891052, 3.27999997138977, 2.06200003623962, 1.60500001907349,
2.32500004768372, 2.88000011444092), label = "growth rate of GDP deflator during first 15 quarters of admin", format.stata = "%9.0g"),
goodnews = structure(c(9, 2, 3, 7, 6, 7, 5, 8, 8, 3, 0, 10,
7, 4, 9, 8, 0, 0, 7, 5, 5, 10, 7, 4, 5, 5, 8, 4, 2, 4, 8,
1, 3), label = "number of quarters in first 15 with real GDP per capita growth > 3.2", format.stata = "%9.0g")), notes = c("more complete variable definitions in fair.def",
"1"), .Names = c("year", "vote", "party", "person", "duration",
"war", "growth", "inflation", "goodnews"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -33L))
So here is the stargazer code that is giving me trouble:
presidential <- read_dta("~/Directory/fair4.dta")
pres.lm = lm(vote ~ growth, data = subset(presidential,
presidential$year >= 1916)
stargazer(pres.lm,
type = "text",
intercept.bottom = T,
digits = 5,
report = "vc*stp",
ci = T
)
confint(pres.lm, level = 0.95)
Consider the difference in the confidence intervals.
(0.52948, 1.24241) # in R, Stargazer
0.5087671 1.263126 # in R, confint(pres.lm)
.5087671 1.263126 # in Stata
I also calculated by hand for the confidence intervals and the confit() and the Stata numbers check out. The t-critical value for this dataset should be t_(N-2 , prob) = t(22,.0025) = -2.073873.
In addition, I made sure to create an entirely new data frame. That is, instead of subsetting within the the lm() argument, I subset it first. When comparing this method to the previous one, I still get the same exact (incorrect) confidence intervals.
# subset into a new dataframe
presidential.1 = subset(presidential, presidential$year >= 1916)
# create the model
pres.lm.2 = lm(vote ~ growth, data = presidential.1)
# compare the two
stargazer(pres.lm,pres.lm.2,
type = "text",
intercept.bottom = F,
digits = 5,
report = "vc*stp",
ci = T,
t.auto = T)
(1) (2)
-----------------------------------------------------------------------
Constant 50.84840*** 50.84840***
(48.86384, 52.83295) (48.86384, 52.83295)
t = 50.21835 t = 50.21835
p = 0.00000 p = 0.00000
growth 0.88595*** 0.88595***
(0.52948, 1.24241) (0.52948, 1.24241)
t = 4.87126 t = 4.87126
p = 0.00008 p = 0.00008
# correct intervals from Stata and R's confint()
growth 0.5087671 1.263126
Am I running the code incorrectly? It really isn't a big deal for me to run the stargazer command and print only the coefficients and the t-stats, but it is kind of disappointing that I would have to run confint() as a separate command given that the output for Stargazer is gorgeous. It is quite odd because the coefficient estimates and the t-statistics are perfect. The confidence intervals are off by varying degrees, and I would like to know what the cause of this might be. Any advice would be greatly appreciated.

The simple answer is that stata and confint calculate confidence intervals using the t-distribution, while stargazer's internal method uses the normal distribution. The result is that the former two are more conservative in their estimates and thus have wider CI compared to stargazer. (Well, I'm assuming with stata here, but since it gives the same results as confint I feel it is a safe assumption).
Looking deep into the source code for stargazer (line 688ff) we can find how CIs are calculated:
z.value <- qnorm((1 + .format.ci.level.use)/2)
coef <- .global.coefficients[.global.coefficient.variables[which.variable],i]
se <- .global.std.errors[.global.coefficient.variables[which.variable],i]
ci.lower.bound <- coef - z.value * se
ci.upper.bound <- coef + z.value * se
It uses qnorm to set the critical value.
Compare to confint:
a <- (1 - level)/2
a <- c(a, 1 - a)
fac <- qt(a, object$df.residual) ##Relevant line, uses T-distribution
pct <- format.perc(a, 3)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(vcov(object)))[parm]
ci[] <- cf[parm] + ses %o% fac
Compare:
#Using normal/z distribution
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qnorm((1 + 0.95)/2), qnorm((1 + 0.95)/2))
[,1] [,2]
growth 0.5294839 1.242409
#Using t-distribution with df degrees of freedom
> df <- pres.lm$df.residual
> pres.lm$coefficients[2] + sqrt(diag(vcov(pres.lm)))[2] %o% c(-qt((1 + 0.95)/2, df), qt((1 + 0.95)/2, df))
[,1] [,2]
growth 0.5087671 1.263126
Probably the easiest way to handle this if you are committed to stargazer is to use the ci.custom argument:
> stargazer(pres.lm, type = "text", ci.custom = list(confint(pres.lm)))
===============================================
Dependent variable:
---------------------------
vote
-----------------------------------------------
growth 0.886***
(0.509, 1.263)
Constant 50.848***
(48.749, 52.948)
-----------------------------------------------
Observations 24
R2 0.519
Adjusted R2 0.497
Residual Std. Error 4.798 (df = 22)
F Statistic 23.729*** (df = 1; 22)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Once the sample size is sufficiently large, the t-distribution converges on the z-distribution and the differences between the CIs become much smaller.
set.seed(432)
x1 <- rnorm(10000, 100, 50)
u <- 2 * rnorm(10000)
y <- 50 + x1 * 0.752 * u
fit <- lm(y ~ x1)
> confint(fit)
2.5 % 97.5 %
(Intercept) 39.29108955 54.1821315
x1 -0.02782141 0.1061173
> stargazer(fit, type= "text", ci = T)
===============================================
Dependent variable:
---------------------------
y
-----------------------------------------------
x1 0.039
(-0.028, 0.106)
Constant 46.737***
(39.292, 54.181)
-----------------------------------------------
Observations 10,000
R2 0.0001
Adjusted R2 0.00003
Residual Std. Error 168.194 (df = 9998)
F Statistic 1.313 (df = 1; 9998)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
With a sample size of 24, the t-distribution with 22 degrees of freedom has much fatter tails than the z!

Categories

HOME

aframe

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Stratified Sampling a Dataset and Averaging a Variable within the Train Dataset - r

Related

Rolling average of values that satisfy multiple conditions in R

How to use for-loop (or apply) in R to filter and select specific columns with specific terms

Pooled average marginal effects from survey-weighted and multiple-imputed data

Pooling Survreg Results Across Multiply Imputed Datasets - Error Message: log(1 - 2 * pnorm(width/2)) : NaNs produced

Stargazer Confidence Interval Incorrect?

Categories

Resources