lm: use product of two variables as a single variable - r

I am running the following piece of code:
lm(ath ~ HAPP + IQ2 + OPEN2 + INCOME*EXPEC,data=data)
Which, of course, lead me to the output:
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-3.2644 -0.5461 -0.0223 0.4158 3.2217
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.730e+00 3.141e+00 1.824 0.068112 .
HAPP -7.765e-02 8.958e-02 -0.867 0.386014
IQ2 5.080e-04 7.453e-05 6.816 9.38e-12 ***
OPEN2 -5.038e-06 5.114e-06 -0.985 0.324640
INCOME -1.837e-02 1.211e-01 -0.152 0.879395
EXPEC -3.336e-01 1.161e-01 -2.873 0.004067 **
INCOME:EXPEC 2.645e-03 7.597e-04 3.481 0.000499 ***
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 9.489 1.363 6.96 3.41e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 222.5 on 8 Df
Pseudo R-squared: 0.6938
Number of iterations: 36 (BFGS) + 4 (Fisher scoring)
I need to drop the INCOME and EXPEC lines (with Estimate, Std. Error, z value and Pr(>|z|)) from the regression in a really elegant way (I need to run like a million models, so I can't do it by hand one by one). Please note that those variables (INCOME and EXPEC) were not included in the original set of individual variables. This is, ONLY the requested variables (and the demanded interactions, of course) should be printed.
Any piece of advice?
Thanks!!! :D

You can use the AsIs function. See the example below;
fit <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
fit
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
# data = iris)
#
# Coefficients:
# (Intercept) Sepal.Width
# 4.1072 0.2688
# I(Petal.Length * Petal.Width)
# 0.1578
library(broom)
tidy(fit)
# term estimate std.error statistic p.value
# 1 (Intercept) 4.1072163 0.266529393 15.409994 1.702125e-32
# 2 Sepal.Width 0.2687704 0.081280587 3.306698 1.186597e-03
# 3 I(Petal.Length * Petal.Width) 0.1578160 0.007517941 20.991921 4.426899e-46

If you only need part of the coefficients you can just use the coef function from base R and subset the indices you like. For example:
a1 <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
coefficients(a1)[1:2]
(Intercept) Sepal.Width
4.1072163 0.2687704
If you need the formula call as well you could do a1$call
a1$call
lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
data = iris)
Or if you need any other argument just take a look at str(a1) or str(summary(a1))

Related

Perform multiple linear regression analysis including interaction terms, interpret results using summary() and diagnostic plots using lm()

I tried to perform a multiple linear regression analysis with code like this one but with no success. I tried to do it with lm() function. I think there is a problem with the 'x1*x2'.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1 + x2 + x1*x2)
summary(model)
plot(model)
It shows me error.
What should I do?
The error did not occur because of your interaction term. When testing it, that worked perfectly for me. You forgot to specify the data. The lm() function requires you to provide the data your variables should stem from. In the code below I also shortened the code within the function because x1*x2 is already sufficient. R detects that you have an interaction term, so you don't have to repeat the same variable names.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1*x2,
data= data)
summary(model)
#>
#> Call:
#> lm(formula = y ~ x1 * x2, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.21772 -0.77564 0.06347 0.56901 2.15324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.05853 0.09914 -0.590 0.5564
#> x1 0.17384 0.09466 1.836 0.0694 .
#> x2 -0.02830 0.08646 -0.327 0.7442
#> x1:x2 -0.00836 0.07846 -0.107 0.9154
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9792 on 96 degrees of freedom
#> Multiple R-squared: 0.03423, Adjusted R-squared: 0.004055
#> F-statistic: 1.134 on 3 and 96 DF, p-value: 0.3392
Created on 2023-01-14 with reprex v2.0.2

Full versus partial marginal effect using fixest package

I have would like to know the full marginal effect of the continuous variable provtariff given the interaction term Female * provtariff on the outcome variable log(totalinc) as well as the coefficient of the interaction term.
Using the code:
feols(log(totalinc) ~ i(Female, provtariff) | hhid02 + year,
data = inc0402_p,
weights = ~hhwt,
vcov = ~tinh)
I got the following results
OLS estimation, Dep. Var.: log(totalinc)
Observations: 24,966
Weights: hhwt
Fixed-effects: hhid02: 11,018, year: 2
Standard-errors: Clustered (tinh)
Estimate Std. Error t value Pr(>|t|)
Female::0:provtariff 5.79524 1.84811 3.13577 0.0026542 **
Female::1:provtariff 2.66994 2.09540 1.27419 0.2075088
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7.61702 Adj. R2: 0.670289
Within R2: 0.045238s
However, when I implement the following code
feols(log(totalinc) ~ Female*provtariff | hhid02 + year,
data = inc0402_p,
weights = ~hhwt,
vcov = ~tinh)
I get the following results
OLS estimation, Dep. Var.: log(totalinc)
Observations: 24,966
Weights: hhwt
Fixed-effects: hhid02: 11,018, year: 2
Standard-errors: Clustered (tinh)
Estimate Std. Error t value Pr(>|t|)
Female -0.290019 0.029894 -9.70142 6.6491e-14 ***
provtariff 4.499561 1.884625 2.38751 2.0130e-02 *
Female:provtariff -0.433963 0.170505 -2.54516 1.3512e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7.52022 Adj. R2: 0.678592
Within R2: 0.069349
Should the provtariff coefficient in the latter model not be the same as the coefficient for Female::0:provtariff in the first model?
No, clearly the two models are different because one includes two parameters and the other one includes 3. They won’t produce equivalent results. More specifically, one of your models includes only the interactions, but no “constitutive” term, whereas the other model includes both.
Here is a reproducible example with a 3rd model that reproduces your model with the * asterisk, but uses the fixest interaction syntax with i(). You’ll see that some of the coefficients and standard errors are exactly identical to those in the 2nd model, and that R2 are the same. This suggests that m2 and m3 are equivalent:
library(fixest)
library(modelsummary)
library(marginaleffects)
# Your
m1 <- feols(mpg ~ i(am, hp) | gear, data = mtcars)
m2 <- feols(mpg ~ am * hp | gear, data = mtcars)
m3 <- feols(mpg ~ am + i(am, hp) | gear, data = mtcars)
models <- list(m1, m2, m3)
modelsummary(models)
(1)
(2)
(3)
am = 0 × hp
-0.076
-0.056
(0.025)
(0.006)
am = 1 × hp
-0.059
-0.071
(0.009)
(0.021)
am
5.568
5.568
(1.575)
(1.575)
hp
-0.056
(0.006)
am × hp
-0.015
(0.019)
Num.Obs.
32
32
32
R2
0.763
0.797
0.797
Std.Errors
by: gear
by: gear
by: gear
FE: gear
X
X
X
We can further check the equivalence between models 2 and 3 by computing the partial derivative of the outcome with respect to one of the predictors. In economics they call this slope a “marginal effect”, although the terminology changes across disciplines, and I am not sure if that is the quantity you are interested in when you say “marginal effects”:
marginaleffects(m2, variables = "hp") |> summary()
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 1 hp mean(dY/dX) -0.062 0.01087 -5.705 1.1665e-08 -0.0833 -0.0407
#>
#> Model type: fixest
#> Prediction type: response
marginaleffects(m3, variables = "hp") |> summary()
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 1 hp mean(dY/dX) -0.062 0.01087 -5.705 1.1665e-08 -0.0833 -0.0407
#>
#> Model type: fixest
#> Prediction type: response

R Fixest Package: IV Estimation Without Further Exogenous Variables

I intend to run instrumental variable regressions with fixed effects using the fixest package's feols function. However, I am having issues with the syntax specifying an estimation without further exogenous controls.
Consider the following example:
# Load package
require("fixest")
# Load data
df <- airquality
I would like to something like the following, i.e. explaining the outcome via the instrumented endogenous variable and fixed effects:
feols(Temp | Month + Day | Ozone ~ Wind, df)
This, however, produces an error:
The dependent variable is a constant. Estimation cannot be done.
It only works, when I add further exogenous covariates (as in the documentation's examples):
feols(Temp ~ Solar.R | Month + Day | Ozone ~ Wind, df)
How do I fix this? How do I run the estimation without further controls, such as Solar.R in this case?
Note: I post this on Stack Overflow rather than Cross Validated because the question relates to a coding syntax issue, and not to the econometric techniques underlying the estimations.
actually there seems to be a misunderstanding on how to write the formula.
The syntax is: Dep_var ~ Exo_vars | Fixed-effects | Endo_vars ~ Instruments.
The parts Fixed-effects and Endo_vars ~ Instruments are optional. On the other hand, the part with Exo_vars must always be there, be it with only the intercept.
Knowing that, the following works:
base = iris
names(base) = c("y", "x1", "x_endo", "x_inst", "fe")
feols(y ~ 1 | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Standard-errors: Standard
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.345900 0.08096 53.679 < 2.2e-16 ***
#> fit_x_endo 0.398477 0.01964 20.289 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.404769 Adj. R2: 0.757834
#> F-test (1st stage): stat = 1,882.45 , p < 2.2e-16 , on 1 and 148 DoF.
#> Wu-Hausman: stat = 3.9663, p = 0.048272, on 1 and 147 DoF.
# Same with fixed-effect
feols(y ~ 1 | fe | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Fixed-effects: fe: 3
#> Standard-errors: Clustered (fe)
#> Estimate Std. Error t value Pr(>|t|)
#> fit_x_endo 0.900061 0.117798 7.6407 0.016701 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.333489 Adj. R2: 0.833363
#> Within R2: 0.57177
#> F-test (1st stage): stat = 44.77 , p = 4.409e-10, on 1 and 146 DoF.
#> Wu-Hausman: stat = 0.001472, p = 0.969447 , on 1 and 145 DoF.
Getting back to the original example:
feols(Temp | Month + Day | Ozone ~ Wind, df) means that the dependent variable will be Temp | Month + Day | Ozone with the pipe here meaning the logical OR, leading to a 1 for all observations. Hence the error message.
To fix it and obtain an appropriate behavior, use feols(Temp ~ 1 | Month + Day | Ozone ~ Wind, df).

Dynamic variable names in R regressions

Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?
Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).
The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.
1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)

summary dataframe from several multiple regression outputs

I am doing multiple OLS regressions. I have used the following lm function:
GroupNetReturnsStockPickers <- read.csv("GroupNetReturnsStockPickers.csv", header=TRUE, sep=",", dec=".")
ModelGroupNetReturnsStockPickers <- lm(StockPickersNet ~ Mkt.RF+SMB+HML+WML, data=GroupNetReturnsStockPickers)
names(GroupNetReturnsStockPickers)
summary(ModelGroupNetReturnsStockPickers)
Which gives me the summary output of:
Call:
lm(formula = StockPickersNet ~ Mkt.RF + SMB + HML + WML, data = GroupNetReturnsStockPickers)
Residuals:
Min 1Q Median 3Q Max
-0.029698 -0.005069 -0.000328 0.004546 0.041948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.655e-05 5.981e-04 0.078 0.938
Mkt.RF -1.713e-03 1.202e-02 -0.142 0.887
SMB 3.006e-02 2.545e-02 1.181 0.239
HML 1.970e-02 2.350e-02 0.838 0.403
WML 1.107e-02 1.444e-02 0.766 0.444
Residual standard error: 0.009029 on 251 degrees of freedom
Multiple R-squared: 0.01033, Adjusted R-squared: -0.005445
F-statistic: 0.6548 on 4 and 251 DF, p-value: 0.624
This is perfect. However, I am doing a total of 10 multiple OLS regressions, and I wish to create my own summary output, in a data frame, where I extract the Intercept Estimate, the tvalue estimate, and the p-value, for all 10 analyzes individually. Hence it would be a 10x3, where the columns names would be Model1, Model2,..,Model10, and row names: Value, t-value and p-Value.
I appreciate any help.
There's a few packages that do this (stargazer and texreg) as well as this code for outreg.
In any case, if you are only interested in the intercept here is one approach:
# Estimate a bunch of different models, stored in a list
fits <- list() # Create empty list to store models
fits$model1 <- lm(Ozone ~ Solar.R, data = airquality)
fits$model2 <- lm(Ozone ~ Solar.R + Wind, data = airquality)
fits$model3 <- lm(Ozone ~ Solar.R + Wind + Temp, data = airquality)
# Combine the results for the intercept
do.call(cbind, lapply(fits, function(z) summary(z)$coefficients["(Intercept)", ]))
# RESULT:
# model1 model2 model3
# Estimate 18.598727772 7.724604e+01 -64.342078929
# Std. Error 6.747904163 9.067507e+00 23.054724347
# t value 2.756222869 8.518995e+00 -2.790841389
# Pr(>|t|) 0.006856021 1.052118e-13 0.006226638
Look at the broom package, which was created to do exactly what you are asking for. The only difference is that it puts the models into rows and the different statistics into columns, and I understand that you would prefer the opposite, but you can work around that afterwards if it is really necessary.
To give you an example, the function tidy() converts a model output into a dataframe.
model <- lm(mpg ~ cyl, data=mtcars)
summary(model)
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
And
library(broom)
tidy(model)
yields the following data frame:
term estimate std.error statistic p.value
1 (Intercept) 37.88458 2.0738436 18.267808 8.369155e-18
2 cyl -2.87579 0.3224089 -8.919699 6.112687e-10
Look at ?tidy.lm to see more options, for instance for confidence intervals, etc.
To combine the output of your ten models into one dataframe, you could use
library(dplyr)
bind_rows(one, two, three, ... , .id="models")
Or, if your different models come from regressions using the same dataframe, you can combine it with dplyr:
models <- mtcars %>% group_by(gear) %>% do(data.frame(tidy(lm(mpg~cyl, data=.), conf.int=T)))
Source: local data frame [6 x 8]
Groups: gear
gear term estimate std.error statistic p.value conf.low conf.high
1 3 (Intercept) 29.783784 4.5468925 6.550360 1.852532e-05 19.960820 39.6067478
2 3 cyl -1.831757 0.6018987 -3.043297 9.420695e-03 -3.132080 -0.5314336
3 4 (Intercept) 41.275000 5.9927925 6.887440 4.259099e-05 27.922226 54.6277739
4 4 cyl -3.587500 1.2587382 -2.850076 1.724783e-02 -6.392144 -0.7828565
5 5 (Intercept) 40.580000 3.3238331 12.208796 1.183209e-03 30.002080 51.1579205
6 5 cyl -3.200000 0.5308798 -6.027730 9.153118e-03 -4.889496 -1.5105036

Resources