I intend to run instrumental variable regressions with fixed effects using the fixest package's feols function. However, I am having issues with the syntax specifying an estimation without further exogenous controls.
Consider the following example:
# Load package
require("fixest")
# Load data
df <- airquality
I would like to something like the following, i.e. explaining the outcome via the instrumented endogenous variable and fixed effects:
feols(Temp | Month + Day | Ozone ~ Wind, df)
This, however, produces an error:
The dependent variable is a constant. Estimation cannot be done.
It only works, when I add further exogenous covariates (as in the documentation's examples):
feols(Temp ~ Solar.R | Month + Day | Ozone ~ Wind, df)
How do I fix this? How do I run the estimation without further controls, such as Solar.R in this case?
Note: I post this on Stack Overflow rather than Cross Validated because the question relates to a coding syntax issue, and not to the econometric techniques underlying the estimations.
actually there seems to be a misunderstanding on how to write the formula.
The syntax is: Dep_var ~ Exo_vars | Fixed-effects | Endo_vars ~ Instruments.
The parts Fixed-effects and Endo_vars ~ Instruments are optional. On the other hand, the part with Exo_vars must always be there, be it with only the intercept.
Knowing that, the following works:
base = iris
names(base) = c("y", "x1", "x_endo", "x_inst", "fe")
feols(y ~ 1 | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Standard-errors: Standard
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.345900 0.08096 53.679 < 2.2e-16 ***
#> fit_x_endo 0.398477 0.01964 20.289 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.404769 Adj. R2: 0.757834
#> F-test (1st stage): stat = 1,882.45 , p < 2.2e-16 , on 1 and 148 DoF.
#> Wu-Hausman: stat = 3.9663, p = 0.048272, on 1 and 147 DoF.
# Same with fixed-effect
feols(y ~ 1 | fe | x_endo ~ x_inst, base)
#> TSLS estimation, Dep. Var.: y, Endo.: x_endo, Instr.: x_inst
#> Second stage: Dep. Var.: y
#> Observations: 150
#> Fixed-effects: fe: 3
#> Standard-errors: Clustered (fe)
#> Estimate Std. Error t value Pr(>|t|)
#> fit_x_endo 0.900061 0.117798 7.6407 0.016701 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.333489 Adj. R2: 0.833363
#> Within R2: 0.57177
#> F-test (1st stage): stat = 44.77 , p = 4.409e-10, on 1 and 146 DoF.
#> Wu-Hausman: stat = 0.001472, p = 0.969447 , on 1 and 145 DoF.
Getting back to the original example:
feols(Temp | Month + Day | Ozone ~ Wind, df) means that the dependent variable will be Temp | Month + Day | Ozone with the pipe here meaning the logical OR, leading to a 1 for all observations. Hence the error message.
To fix it and obtain an appropriate behavior, use feols(Temp ~ 1 | Month + Day | Ozone ~ Wind, df).
Related
I have would like to know the full marginal effect of the continuous variable provtariff given the interaction term Female * provtariff on the outcome variable log(totalinc) as well as the coefficient of the interaction term.
Using the code:
feols(log(totalinc) ~ i(Female, provtariff) | hhid02 + year,
data = inc0402_p,
weights = ~hhwt,
vcov = ~tinh)
I got the following results
OLS estimation, Dep. Var.: log(totalinc)
Observations: 24,966
Weights: hhwt
Fixed-effects: hhid02: 11,018, year: 2
Standard-errors: Clustered (tinh)
Estimate Std. Error t value Pr(>|t|)
Female::0:provtariff 5.79524 1.84811 3.13577 0.0026542 **
Female::1:provtariff 2.66994 2.09540 1.27419 0.2075088
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7.61702 Adj. R2: 0.670289
Within R2: 0.045238s
However, when I implement the following code
feols(log(totalinc) ~ Female*provtariff | hhid02 + year,
data = inc0402_p,
weights = ~hhwt,
vcov = ~tinh)
I get the following results
OLS estimation, Dep. Var.: log(totalinc)
Observations: 24,966
Weights: hhwt
Fixed-effects: hhid02: 11,018, year: 2
Standard-errors: Clustered (tinh)
Estimate Std. Error t value Pr(>|t|)
Female -0.290019 0.029894 -9.70142 6.6491e-14 ***
provtariff 4.499561 1.884625 2.38751 2.0130e-02 *
Female:provtariff -0.433963 0.170505 -2.54516 1.3512e-02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7.52022 Adj. R2: 0.678592
Within R2: 0.069349
Should the provtariff coefficient in the latter model not be the same as the coefficient for Female::0:provtariff in the first model?
No, clearly the two models are different because one includes two parameters and the other one includes 3. They won’t produce equivalent results. More specifically, one of your models includes only the interactions, but no “constitutive” term, whereas the other model includes both.
Here is a reproducible example with a 3rd model that reproduces your model with the * asterisk, but uses the fixest interaction syntax with i(). You’ll see that some of the coefficients and standard errors are exactly identical to those in the 2nd model, and that R2 are the same. This suggests that m2 and m3 are equivalent:
library(fixest)
library(modelsummary)
library(marginaleffects)
# Your
m1 <- feols(mpg ~ i(am, hp) | gear, data = mtcars)
m2 <- feols(mpg ~ am * hp | gear, data = mtcars)
m3 <- feols(mpg ~ am + i(am, hp) | gear, data = mtcars)
models <- list(m1, m2, m3)
modelsummary(models)
(1)
(2)
(3)
am = 0 × hp
-0.076
-0.056
(0.025)
(0.006)
am = 1 × hp
-0.059
-0.071
(0.009)
(0.021)
am
5.568
5.568
(1.575)
(1.575)
hp
-0.056
(0.006)
am × hp
-0.015
(0.019)
Num.Obs.
32
32
32
R2
0.763
0.797
0.797
Std.Errors
by: gear
by: gear
by: gear
FE: gear
X
X
X
We can further check the equivalence between models 2 and 3 by computing the partial derivative of the outcome with respect to one of the predictors. In economics they call this slope a “marginal effect”, although the terminology changes across disciplines, and I am not sure if that is the quantity you are interested in when you say “marginal effects”:
marginaleffects(m2, variables = "hp") |> summary()
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 1 hp mean(dY/dX) -0.062 0.01087 -5.705 1.1665e-08 -0.0833 -0.0407
#>
#> Model type: fixest
#> Prediction type: response
marginaleffects(m3, variables = "hp") |> summary()
#> Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
#> 1 hp mean(dY/dX) -0.062 0.01087 -5.705 1.1665e-08 -0.0833 -0.0407
#>
#> Model type: fixest
#> Prediction type: response
I want to perform the following task using fastfood dataset from openintro package in R.
a) Create a regression predicting whether or not a restaurant is McDonalds or Subway
based on calories, sodium, and protein. (McDonalds should be 1, Subway 0).
Save the coefficients to Q2.
b) use data from only restaurants with between 50 and 60 items in the
data set. Predict total fat from cholesterol, total carbs, vitamin a, and restaurant.
Remove any nonsignificant predictors and run again.
Assign the strongest standardized regression coefficient to Q5.
Here's my code.
library(tidyverse)
library(openintro)
library(lm.beta)
fastfood <- openintro::fastfood
head(fastfood)
#Solving for part (a)
fit_1 <- lm(I(restaurant %in% c("Subway", "Mcdonalds")) ~ calories + sodium + protein, data = fastfood)
Q2 <- round(summary(fit_1)$coefficients,2)
#Solving for part (b)
newdata <- fastfood[ which(fastfood$item>=50 & fastfood$item <= 60), ]
df = sort(sample(nrow(newdata), nrow(data)*.7))
newdata_train<-data[df,]
newdata_test<-data[-df,]
fit_5 <- lm(I(total_fat) ~ cholesterol + total_carb + vit_a + restaurant, data = newdata)
prediction_5 <- predict(fit_5, newdata = newdata_test)
Q5 <- lm.beta(fit_5)
But I'm not getting desired results
Here's is desired output
output for part (a):
output for part (b):
The first question requires logistic regression rather than linear regression, since the aim is to predict a binary outcome. The most sensible way to do this is, as the question suggests, to remove all the restaurants except McDonald's and Subway, then create a new binary variable to mark which rows are McDonald's and which aren't:
library(dplyr)
fastfood <- openintro::fastfood %>%
filter(restaurant %in% c("Mcdonalds", "Subway")) %>%
mutate(is_mcdonalds = restaurant == "Mcdonalds")
The logistic regression is done like this:
fit_1 <- glm(is_mcdonalds ~ calories + sodium + protein,
family = "binomial", data = fastfood)
And your coefficients are obtained like this:
Q2 <- round(coef(fit_1), 2)
Q2
#> (Intercept) calories sodium protein
#> -1.24 0.00 0.00 0.06
The second question requires that you filter out any restaurants with more than 60 or fewer than 50 items:
fastfood <- openintro::fastfood %>%
group_by(restaurant) %>%
filter(n() >= 50 & n() <= 60)
We now fit the described regression and examine it to look for non-significant regressors:
fit_2 <- lm(total_fat ~ cholesterol + vit_a + total_carb + restaurant,
data = fastfood)
summary(fit_2)
#>
#> Call:
#> lm(formula = total_fat ~ cholesterol + vit_a + total_carb + restaurant,
#> data = fastfood)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -24.8280 -2.9417 0.9397 5.1450 21.0494
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.20102 2.08029 -0.577 0.564751
#> cholesterol 0.26932 0.01129 23.853 < 2e-16 ***
#> vit_a 0.01159 0.01655 0.701 0.484895
#> total_carb 0.16327 0.03317 4.922 2.64e-06 ***
#> restaurantMcdonalds -4.90272 1.94071 -2.526 0.012778 *
#> restaurantSonic 6.43353 1.89014 3.404 0.000894 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.611 on 125 degrees of freedom
#> (34 observations deleted due to missingness)
#> Multiple R-squared: 0.8776, Adjusted R-squared: 0.8727
#> F-statistic: 179.2 on 5 and 125 DF, p-value: < 2.2e-16
We note that vit_a is non-significant and drop it from our model:
fit_3 <- update(fit_2, . ~ . - vit_a)
Now we get the regularized coefficients and round them:
coefs <- round(coef(lm.beta::lm.beta(fit_3)), 2)
and Q5 will be the maximum value of these coefficients:
Q5 <- coefs[which.max(coefs)]
Q5
#> cholesterol
#> 0.82
Created on 2022-02-26 by the reprex package (v2.0.1)
I would like to ask how to calculace inf. criteria such as AIC, etc... for Fixed effect logit model from
bife package.
Basic summmary output does NOT include AIC, how ever when looking at: Goodness-of-fit for fixed effect logit model using 'bife' package
The AIC criterium was computed. how ever I do no have it in my summary output nor log-likelihood.
dta = bife::psid
mod_logit <- bife(LFP ~ AGE + I(INCH / 1000) + KID1 + KID2 + KID3 | ID, data = dta, bias_corr = "ana")
summary(mod_logit)
If you check bife code, AIC was computed in earlier versions at least in version 0.5. You might be using the current version 0.6 in which AIC is no longer included.
If you do not mind using the older version, try the following:
remove the current version from your library.
download version 0.5 from CRAN website: https://cran.r-project.org/src/contrib/Archive/bife/
install to your computer: install.packages("D:\\bife_0.5.tar.gz", repos = NULL, type="source"). Assuming it is stored on D: drive.
Or:
require(devtools)
install_version("bife", version = "0.5", repos = "http://cran.us.r-project.org")
If successfully installed, run the following with AIC included:
library(bife)
dta = bife::psid
mod_logit <- bife(LFP ~ AGE + I(INCH / 1000) + KID1 + KID2 + KID3 | ID, data = dta, bias_corr = "ana")
summary(mod_logit)
#> ---------------------------------------------------------------
#> Fixed effects logit model
#> with analytical bias-correction
#>
#> Estimated model:
#> LFP ~ AGE + I(INCH/1000) + KID1 + KID2 + KID3 | ID
#>
#> Log-Likelihood= -3045.505
#> n= 13149, number of events= 9516
#> Demeaning converged after 5 iteration(s)
#> Offset converged after 3 iteration(s)
#>
#> Corrected structural parameter(s):
#>
#> Estimate Std. error t-value Pr(> t)
#> AGE 0.033945 0.012990 2.613 0.00898 **
#> I(INCH/1000) -0.007630 0.001993 -3.829 0.00013 ***
#> KID1 -1.052985 0.096186 -10.947 < 2e-16 ***
#> KID2 -0.509178 0.084510 -6.025 1.74e-09 ***
#> KID3 -0.010562 0.060413 -0.175 0.86121
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> AIC= 9023.011 , BIC= 19994.7
#>
#>
#> Average individual fixed effects= 0.0122
#> ---------------------------------------------------------------
Created on 2020-01-09 by the reprex package (v0.3.0)
I am running the following piece of code:
lm(ath ~ HAPP + IQ2 + OPEN2 + INCOME*EXPEC,data=data)
Which, of course, lead me to the output:
Standardized weighted residuals 2:
Min 1Q Median 3Q Max
-3.2644 -0.5461 -0.0223 0.4158 3.2217
Coefficients (mean model with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.730e+00 3.141e+00 1.824 0.068112 .
HAPP -7.765e-02 8.958e-02 -0.867 0.386014
IQ2 5.080e-04 7.453e-05 6.816 9.38e-12 ***
OPEN2 -5.038e-06 5.114e-06 -0.985 0.324640
INCOME -1.837e-02 1.211e-01 -0.152 0.879395
EXPEC -3.336e-01 1.161e-01 -2.873 0.004067 **
INCOME:EXPEC 2.645e-03 7.597e-04 3.481 0.000499 ***
Phi coefficients (precision model with identity link):
Estimate Std. Error z value Pr(>|z|)
(phi) 9.489 1.363 6.96 3.41e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Type of estimator: ML (maximum likelihood)
Log-likelihood: 222.5 on 8 Df
Pseudo R-squared: 0.6938
Number of iterations: 36 (BFGS) + 4 (Fisher scoring)
I need to drop the INCOME and EXPEC lines (with Estimate, Std. Error, z value and Pr(>|z|)) from the regression in a really elegant way (I need to run like a million models, so I can't do it by hand one by one). Please note that those variables (INCOME and EXPEC) were not included in the original set of individual variables. This is, ONLY the requested variables (and the demanded interactions, of course) should be printed.
Any piece of advice?
Thanks!!! :D
You can use the AsIs function. See the example below;
fit <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
fit
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
# data = iris)
#
# Coefficients:
# (Intercept) Sepal.Width
# 4.1072 0.2688
# I(Petal.Length * Petal.Width)
# 0.1578
library(broom)
tidy(fit)
# term estimate std.error statistic p.value
# 1 (Intercept) 4.1072163 0.266529393 15.409994 1.702125e-32
# 2 Sepal.Width 0.2687704 0.081280587 3.306698 1.186597e-03
# 3 I(Petal.Length * Petal.Width) 0.1578160 0.007517941 20.991921 4.426899e-46
If you only need part of the coefficients you can just use the coef function from base R and subset the indices you like. For example:
a1 <- lm(Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width) , data = iris)
coefficients(a1)[1:2]
(Intercept) Sepal.Width
4.1072163 0.2687704
If you need the formula call as well you could do a1$call
a1$call
lm(formula = Sepal.Length ~ Sepal.Width + I(Petal.Length * Petal.Width),
data = iris)
Or if you need any other argument just take a look at str(a1) or str(summary(a1))
I'm actualy running on a problem with the T pipe. I'm trying to do 3 things in the same chain :
Fit my GLM
Save it in a variable
Print it's summary
So i'm trying the following syntax :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
summary
Wich do not work as expected. The glm get's fitted, but the summary wont show itself. So i'm force to break it into 2 chains :
my_variable <- data %>%
glm(Responce ~ Variables, family)
my_variable %>% summary
So i'm thinking : Either i did'nt get the functionality of the T-pipe, either it's not properly coded and mess around with the summary function.
Because if i try :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
plot
it works well.
Some ideas ?
When you just type summary(something) in the console, print is called implicitly. It's not the case in your pipe call, so you need to explicitly call print.
Because the unbranching of %T>% works for one operation only you'll have to compose print and summary :
my_variable <- data %>%
glm(Responce ~ Variables, family) %T>%
{print(summary(.)}
You need curly braces and the dot else the glm output would be passed as the first argument to print.
I don't see, why you would need %T>% here. If you want to force printing, just use regular pipes and add print() to your pipe. Keep in mind however, that with this approach you store the summary in my_variable, not the model itself.
library(magrittr)
my_variable <- my_data %>%
glm(counts ~ outcome + treatment, family = poisson(), data = .) %>%
summary() %>%
print()
#>
#> Call:
#> glm(formula = counts ~ outcome + treatment, family = poisson(),
#> data = .)
#>
#> Deviance Residuals:
#> 1 2 3 4 5 6 7
#> -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715
#> 8 9
#> -0.09167 -0.96656
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 ***
#> outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 *
#> outcome3 -2.930e-01 1.927e-01 -1.520 0.1285
#> treatment2 1.338e-15 2.000e-01 0.000 1.0000
#> treatment3 1.421e-15 2.000e-01 0.000 1.0000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 10.5814 on 8 degrees of freedom
#> Residual deviance: 5.1291 on 4 degrees of freedom
#> AIC: 56.761
#>
#> Number of Fisher Scoring iterations: 4
Data
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
my_data <- data.frame(treatment, outcome, counts)