Omit multiple factors in texreg - r

When using texreg I frequently use omit.coef to remove certain estimates (for fixed effects) as below.
screenreg(lm01,omit.coef='STORE_ID',custom.model.names = c("AA"))
In my lm model if I use multiple fixed effects how can I omit multiple variables? For example, I have two types of fixed effects - STORE_ID and Year, let's say.
This does not work.
screenreg(lm01,omit.coef=c('STORE_ID','Year'),custom.model.names = c("AA"))

You'd have to consider regex instead, separated by an |. Example:
fit <- lm(mpg ~ cyl + disp + hp + drat, mtcars)
texreg::screenreg(fit)
# =====================
# Model 1
# ---------------------
# (Intercept) 23.99 **
# (7.99)
# cyl -0.81
# (0.84)
# disp -0.01
# (0.01)
# hp -0.02
# (0.02)
# drat 2.15
# (1.60)
# ---------------------
# R^2 0.78
# Adj. R^2 0.75
# Num. obs. 32
# =====================
# *** p < 0.001; ** p < 0.01; * p < 0.05
Now omitting:
texreg::screenreg(fit, omit.coef=c('disp|hp|drat'))
# =====================
# Model 1
# ---------------------
# (Intercept) 23.99 **
# (7.99)
# cyl -0.81
# (0.84)
# ---------------------
# R^2 0.78
# Adj. R^2 0.75
# Num. obs. 32
# =====================
# *** p < 0.001; ** p < 0.01; * p < 0.05

screenreg allows you to include the custom.coef.map option and pass a list through it. This option allows you to directly select the variables you want to KEEP (instead of omit with omit.coef), and allows you to change the variables' names simultaneously:
screenreg(lm01, custom.coef.map = list("var1" = "First variable",
"var2" = "Second Variable", "var3" = "Third variable"))

Related

Does the `by` argument in `avg_comparisons` compute the strata specific marginal effect?

I'm analyzing data from an AB test we just finished running. Our outcome is binary, y, and we have stratified results by a third variable, g.
Because the intervention could vary by g, I've fit a Poisson regression with robust covariance estimation as follows
library(tidyverse)
library(sandwich)
library(marginaleffects)
fit <- glm(y ~ treatment * g, data=model_data, family=poisson, offset=log(n_users))
From here, I'd like to know the strata specific causal risk ration (which we usually call "lift" in industry). My approach is to use avg_comparisons as follows
avg_comparisons(fit,
variables = 'treatment',
newdata = model_data,
transform_pre = 'lnratioavg',
transform_post = exp,
by=c('g'),
vcov = 'HC')
The result seems to be consistent with calculations of the lift when I filter the data by groups in g.
Question
By passing by=c('g'), am I actually calculating the strata specific risk ratios as I suspect? Is there any hidden "gotchas" or things I have failed to consider?
I can provide data and a minimal working example if need be.
Here’s a very simple base R example to show what is happening under-the-hood:
library(marginaleffects)
fit <- glm(carb ~ hp * am, data = mtcars, family = poisson)
Unit level estimates of log ratio associated with a change of 1 in hp:
cmp <- comparisons(fit, variables = "hp", transform_pre = "lnratio")
cmp
#
# Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0054 0.0027 2.0 0.047 0.00007 0.0107
# hp +1 0.0054 0.0027 2.0 0.047 0.00007 0.0107
# --- 22 rows omitted. See ?avg_comparisons and ?print.marginaleffects ---
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# hp +1 0.0056 0.0016 3.6 <0.001 0.00252 0.0086
# Prediction type: response
# Columns: rowid, type, term, contrast, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo, carb, hp, am
This is equivalent to:
# prediction grids with 1 unit difference
lo <- transform(mtcars, hp = hp - .5)
hi <- transform(mtcars, hp = hp + .5)
# predictions on response scale
y_lo <- predict(fit, newdata = lo, type = "response")
y_hi <- predict(fit, newdata = hi, type = "response")
# log ratio
lnratio <- log(y_hi / y_lo)
# equivalent to `comparisons()`
all(cmp$estimate == lnratio)
# [1] TRUE
Now we take the strata specific means, with mean() inside log():
by(data.frame(am = lo$am, y_lo, y_hi),
mtcars$am,
FUN = \(x) log(mean(x$y_hi) / mean(x$y_lo)))
# mtcars$am: 0
# [1] 0.005364414
# ------------------------------------------------------------
# mtcars$am: 1
# [1] 0.005566092
Same as:
avg_comparisons(fit, variables = "hp", by = "am", transform_pre = "lnratio") |>
print(digits = 7)
#
# Term Contrast am Estimate Std. Error z Pr(>|z|) 2.5 %
# hp mean(+1) 0 0.005364414 0.002701531 1.985694 0.04706726 6.951172e-05
# hp mean(+1) 1 0.005566092 0.001553855 3.582118 < 0.001 2.520592e-03
# 97.5 %
# 0.010659317
# 0.008611592
#
# Prediction type: response
# Columns: type, term, contrast, am, estimate, std.error, statistic, p.value, conf.low, conf.high, predicted, predicted_hi, predicted_lo
See the list of transformation functions here: https://vincentarelbundock.github.io/marginaleffects/reference/comparisons.html#transformations
The only thing is that by applies the function within stratas.

R Stargazer Getting p-values into one line

stargazer(model1, model2, title = "Models", header=FALSE,
dep.var.labels.include = FALSE,
column.labels = c("Count", "Percentage"),
style = "ajs",
report = "vcp*",
single.row = TRUE)
This is my code to create regression tables with stargazer. However, the p-value still shows up below the coefficient estimates. How do I get p-values to show up next to the coefficient estimates?
You may replace standard errors with p-values. Put models into a list, which allows you to use lapply.
model1 <- lm(mpg ~ hp, mtcars)
model2 <- lm(mpg ~ hp + cyl, mtcars)
model.lst <- list(model1, model2)
stargazer::stargazer(model.lst, title = "Models", header=FALSE,
dep.var.labels.include = FALSE,
column.labels = c("Count", "Percentage"),
style = "ajs",
report = "vcs*",
single.row = TRUE, type="text",
se=lapply(model.lst, function(x) summary(x)$coef[,4]))
# Models
# =================================================================
# Count Percentage
# 1 2
# -----------------------------------------------------------------
# hp -.068 (0.000)*** -.019 (.213)
# cyl -2.265 (0.000)***
# Constant 30.099 (0.000)*** 36.908 (0.000)***
# Observations 32 32
# R2 .602 .741
# Adjusted R2 .589 .723
# Residual Std. Error 3.863 (df = 30) 3.173 (df = 29)
# F Statistic 45.460*** (df = 1; 30) 41.422*** (df = 2; 29)
# -----------------------------------------------------------------
# Notes: *P < .05
# **P < .01
# ***P < .001
Note, that this is also possible with texreg which might look a little bit cleaner and the package is well maintained.
texreg::screenreg(model.lst, single.row=TRUE,
reorder.coef=c(2:3, 1),
custom.model.names=c("Count", "Percentage"),
override.se=lapply(model.lst, function(x) summary(x)$coef[,4]),
override.pvalues=lapply(model.lst, function(x) summary(x)$coef[,4]),
digits=3
)
# ===================================================
# Count Percentage
# ---------------------------------------------------
# hp -0.068 (0.000) *** -0.019 (0.213)
# cyl -2.265 (0.000) ***
# (Intercept) 30.099 (0.000) *** 36.908 (0.000) ***
# ---------------------------------------------------
# R^2 0.602 0.741
# Adj. R^2 0.589 0.723
# Num. obs. 32 32
# ===================================================
# *** p < 0.001; ** p < 0.01; * p < 0.05

Running a regression

Background: my data set has 52 rows and 12 columns (assume column names are A - L) and the name of my data set is foo
I am told to run a regression where foo$L is the dependent variable, and all other variables are independent except for foo$K.
The way i was doing it is
fit <- lm(foo$L ~ foo$a + ... +foo$J)
then calling
summary(fit)
Is my way a good way to run a regression and finding the intercept and coef?
Use the data argument to lm so you don't have to use the foo$ syntax for each predictor. Use dependent ~ . as the formula to have the dependent variable predicted by all other variables. Then you can use - K to exclude K:
data_mat = matrix(rnorm(52 * 12), nrow = 52)
df = as.data.frame(data_mat)
colnames(df) = LETTERS[1:12]
lm(L ~ . - K, data = df)
You can first remove the column K, and then do fit <- lm(L ~ ., data = foo). This will treat the L column as the dependent variable and all the other columns as the independent variables. You don't have to specify each column names in the formula.
Here is an example using the mtcars, fitting a multiple regression model to mpg with all the other variables except carb.
mtcars2 <- mtcars[, !names(mtcars) %in% "carb"]
fit <- lm(mpg ~ ., data = mtcars2)
summary(fit)
# Call:
# lm(formula = mpg ~ ., data = mtcars2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.3038 -1.6964 -0.1796 1.1802 4.7245
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.83084 18.18671 0.706 0.48790
# cyl -0.16881 0.99544 -0.170 0.86689
# disp 0.01623 0.01290 1.259 0.22137
# hp -0.02424 0.01811 -1.339 0.19428
# drat 0.70590 1.56553 0.451 0.65647
# wt -4.03214 1.33252 -3.026 0.00621 **
# qsec 0.86829 0.68874 1.261 0.22063
# vs 0.36470 2.05009 0.178 0.86043
# am 2.55093 2.00826 1.270 0.21728
# gear 0.50294 1.32287 0.380 0.70745
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.593 on 22 degrees of freedom
# Multiple R-squared: 0.8687, Adjusted R-squared: 0.8149
# F-statistic: 16.17 on 9 and 22 DF, p-value: 9.244e-08

Confidence interval for sigma in a purely fixed effect model

Is there a standard way to estimate confidence interval for the variance parameter of a linear model with fixed-effect. E.g. given:
reg=lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
how can I get the confidence interval for the variance parameter. confint only details fixed effect and lmer from lme4 does not accept model without level-2 random-effect, which is my case here.
Unfortunately, you have to implement it yourself.
Like so :
reg <- lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
alpha <- 0.05
n <- length(resid(reg))
sigma <- summary(reg)$sigma
sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
> sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
[1] 0.4600539
[1] 1.287194
It comes from the relation :
I assume you are looking for the summary() function.
The code shows the following:
data(mtcars)
reg<-lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
summary(reg)
# Call:
# lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.6923 -0.3901 0.0579 0.3649 1.2608
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.740648 0.738594 1.003 0.32487
# disp 0.002703 0.002715 0.996 0.32832
# hp 0.005275 0.003253 1.621 0.11657
# wt 1.001303 0.302761 3.307 0.00267 **
# am 0.155815 0.375515 0.415 0.68147
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6754 on 27 degrees of freedom
# Multiple R-squared: 0.8527, Adjusted R-squared: 0.8309
# F-statistic: 39.08 on 4 and 27 DF, p-value: 7.369e-11
To select it, you can store the summary as a variable and select the coefficients.
summa<-summary(reg)
summa$coefficients
With that, one can select the sd covariate that you want and do the confidence interval with the % of interest. To learn the confidence interval, one can read how it is done here
R does it automatically using confint(object, parms, level)
In your case, confint(reg, level = 0.95)

How to update `lm` or `glm` model on same subset of data?

I am trying to fit two nested models and then test those against each other using anova function. The commands used are:
probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt,
family=binomial(link = "probit"))
nprobit <- update(probit, . ~ . - afqt1)
anova(nprobit, probit, test="Rao")
However, the variable afqt1 apparently contains NAs and because the update call does not take the same subset of data, anova() returns error
Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, :
models were not all fitted to the same size of dataset
Is there a simple way how to achieve refitting the model on the same dataset as the original model?
As suggested in the comments, a straightforward approach to this is to use the model data from the first fit (e.g. probit) and update's ability to overwrite arguments from the original call.
Here's a reproducible example:
data(mtcars)
mtcars[1,2] <- NA
nobs( xa <- lm(mpg~cyl+disp, mtcars) )
## [1] 31
nobs( update(xa, .~.-cyl) ) ##not nested
## [1] 32
nobs( xb <- update(xa, .~.-cyl, data=xa$model) ) ##nested
## [1] 31
It is easy enough to define a convenience wrapper around this:
update_nested <- function(object, formula., ..., evaluate = TRUE){
update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}
This forces the data argument of the updated call to re-use the data from the first model fit.
nobs( xc <- update_nested(xa, .~.-cyl) )
## [1] 31
all.equal(xb, xc) ##only the `call` component will be different
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE
So now you can easily do anova:
anova(xa, xc)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 269.97
## 2 29 312.96 -1 -42.988 4.4584 0.04378 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The other approach suggested is na.omit on the data frame prior to the lm() call. At first I thought this would be impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars), but not because of speed. This approach requires manual bookkeeping of which vars should be sanitized of NAs and which shouldn't, and is precisely what the OP seems intent to avoid. The biggest drawback would be that you must always keep in sync the formula with the subsetted data frame.
This however can be overcome rather easily, as it turns out:
data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear +
carb, mtcars) )
## [1] 21
nobs( xb <- update(xa, .~.-cyl) ) ##not nested
## [1] 22
nobs( xb <- update_nested(xa, .~.-cyl) ) ##nested
## [1] 21
nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) ) ##nested
## [1] 21
all.equal(xb, xc)
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE
anova(xa, xc)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 10 104.08
## 2 11 104.42 -1 -0.34511 0.0332 0.8591

Resources