I'm trying to do a 'classic' difference in difference with multiple time periods. The model I want to do is:
y = a + b1x1 + b2_treat + b3_period + b_4(treat*period) + u (eq.1)
So basically I'm testing different setups just to make sure I specify my model in the right way, using different packages. I want to use the fixest-package, so I tried to compare the estimates with the estimates from the standard lm()-package. The results, however, differ -- both coefficients and std.errors.
My question is:
Is either the lm_mod, lm_mod2 or the feols_mod regressions specified correctly (as in eq.1)?
If not, I would appreciate it if anyone can show me how to get the same results in lm() as in feols()!
# libraries
library(fixest)
library(modelsummary)
library(tidyverse)
# load data
data(base_did)
# make df for lm_mod with 5 as the reference-period
base_ref_5 <- base_did %>%
mutate(period = as.factor(period)) %>%
mutate(period = relevel(period, ref = 5))
# Notice that i use base_ref_5 for the lm model and base_did for the feol_mod.
lm_mod <- lm(y ~ x1 + treat*period, base_ref_5)
lm_mod2 <- lm(y ~ x1 + treat + period + treat*period, base_ref_5)
feols_mod <- feols(y ~ x1 + i(period, treat, ref = 5), base_did)
# compare models
models <- list("lm" = lm_mod,
"lm2" = lm_mod2,
"feols" = feols_mod)
msummary(models, stars = T)
**EDIT:**
the reason why I created base_ref_5 was so that both regressions would have period 5 as the reference period, if that was unclear.
**EDIT 2**:
added a third model (lm_mod2) which is much closer, but there is still a difference.
There are two issues, here.
In the lm() model, the period variable is interacted, but treated as a continuous numeric variable. In contrast, calling i(period, treat) treats period as a factor (this is explained clearly in the documentation).
The i() function only includes the interactions, and not the constitutive terms.
Here are two models to illustrate the parallels:
library(fixest)
data(base_did)
lm_mod <- lm(y ~ x1 + factor(period) * factor(treat), base_did)
feols_mod <- feols(y ~ x1 + factor(period) + i(period, treat), base_did)
coef(lm_mod)["x1"]
#> x1
#> 0.9799697
coef(feols_mod)["x1"]
#> x1
#> 0.9799697
Please note that I only answered the part of your question about parallels between lm and feols. StackOverflow is a programming Q&A site. If you have questions about the proper specification of a statistical model, you might want to ask on CrossValidated.
Related
I am attempting to get clustered SEs (at the school level in my data) with data that is both imputed (MICE) and weighted (CBPS). I have tried a couple different approaches that have thrown different errors.
This is what I have to start, which works fine:
library(tidyverse)
library(mice)
library(MatchThem)
library(CBPS)
tempdata <- mice(d, m = 10, maxit = 50, meth = "pmm", seed = 99)
weighted_data <- weightthem(trtmnt ~ x1 + x2 + x3,
data = tempdata,
method = "cbps",
estimand = "ATT")
Using this (https://www.r-bloggers.com/2021/05/clustered-standard-errors-with-r/) as a guide, I attempted all 3, which all resulted in various types of error messages.
My data is in a restricted server so unfortunately I can't bring it into here to reproduce things exactly, although if it's useful I could attempt to recreate some sample data.
So attempting with estimatr first, I get this error:
m1 <- estimatr::lm_robust(outcome ~ trtmnt + x1 + x2 + x3,
clusters = schoolID,
data = weighted_data)
Error in eval_tidy(mfargs[[da]], data = data) :
object 'schoolID' not found
I have no clue where the schoolID variable would have dropped out/not be recognized. It isn't part of the weighting procedure but it should still be in the data frame...if I use it as a covariate in a standard model without clustering, it's there.
I also attempted with miceadds and got this error:
m2 <- miceadds::lm.cluster(outcome ~ trtmnt + x1 + x2 + x3,
cluster = "schoolID",
data = weighted_data)
Error in as.data.frame.default(data) :
cannot coerce class `"wimids"` to a data.frame
And finally, with sandwich and lmtest:
library(sandwich)
library(lmtest)
m3 <- weighted_models <- with(weighted_data,
exp=lm(outcome ~ trtmnt + x1 + x2 + x3))
msandwich <- coeftest(m3, vcov = vcovCL, cluster = ~schoolID)
Error in UseMethod("estfun") :
no applicable method for `estfun` applied to an object of class "c(`mimira`, `mira`)"
Any ideas on any of the above methods, or where to go next?
You were really close. You need to use with(weighted_data, .) to fit a model in your weighted datasets, and you need to use estimatr::lm_robust() to get the clustered standard errors. So try the following:
weighted_models <- with(weighted_data,
estimatr::lm_robust(outcome ~ trtmnt + x1 + x2 + x3,
cluster = schoolID))
Your first and second approaches were incorrect because you supplied weighted_data to a single model as if it were a data frame, but it's not; it's a complicated wimids object. You need to use the with() infrastructure to fit a model to the imputed weighted data.
Your third approach was close, but coeftest() needs to be used on a single model, not a mimira object, which contains all the models fit the imputed datasets. Although you can use coeftest() inside with() with mira objects, you cannot do so with mimira objects from MatchThem. This is where estimatr::lm_robust() comes in since it is able to apply the clustering within each imputed dataset.
I also recommend you take a look at this blog post on estimating treatment effects after weighting with multiply imputed data. The only difference in your case to the code presented in the post is that you would change vcov = "HC3" to vcov = ~schoolID in whichever function you use.
I'm structuring a simple linear regression model using lm and since I think that my dependent variable is history-dependent I would like to add a control variable that is the dependent variable at time (t-1).
How can I do that in R? This is my model so far:
lm(Y ~ Country + Year + GDP, data=df)
And I would like to add a control variable like so:
lm(Y ~ Y(t-1) + Country + Year + GDP, data=df)
I hope it is clear! Thanks in advance!
Can use dplyr library to add the control feature.
library(dplyr)
# create a new variable for the value of Y at time (t-1)
df$Y_lag <- lag(df$Y)
# run the regression with Y_lag as a control variable
fit <- lm(Y ~ Y_lag + Country + Year + GDP, data=df)
1) Assuming that the data is in order of increasing time to align the variables appropriately remove the first or last element of each column. We have used the built in BOD data frame since the input was not provided in the question -- see the info at the top of the r tag page for guidance on asking questions.
nr <- nrow(BOD)
lm(demand[-1] ~ Time[-1] + demand[-nr], BOD)
2) Use flag from the collapse package.
library(collapse)
lm(demand ~ Time + flag(demand), BOD)
3) An alternative is the dyn package. In particular comparing models using anova is tricky when using lags but dyn has an anova.dyn which addresses these problems automatically. Note that dplyr clobbers R's lag with its own lag and will result in dozens or hundreds of packages to fail so be sure that dplyr is not loaded or use library(dplyr, exclkude = c("lag","filter")) if you need dplyr. This works with zoo and ts objects.
library(dyn)
z <- read.zoo(BOD, drop = FALSE)
dyn$lm(demand ~ time(z) + lag(demand, -1), z)
4) There is also the dynlm package which uses a slightly different syntax than dyn but is similar in prinicple. An important advantage of dynlm is that it supports instrumental variables regression via two-stage least squares.
I have heard people talk about "modeling on the residuals" when they want to calculate some effect after an a-priori model has been made. For example, if they know that two variables, var_1 and var_2 are correlated, we first make a model with var_1 and then model the effect of var_2 afterwards. My problem is that I've never seen this done in practice.
I'm interested in the following:
If I'm using a glm, how do I account for the link function used?
What distribution do I choose when running a second glm with var_2 as explanatory variable?
I assume this is related to 1.
Is this at all related to using the first models prediction as an offset in the second model?
My attempt:
dt <- data.table(mtcars) # I have a hypothesis that `mpg` is a function of both `cyl` and `wt`
dt[, cyl := as.factor(cyl)]
model <- stats::glm(mpg ~ cyl, family=Gamma(link="log"), data=dt) # I want to model `cyl` first
dt[, pred := stats::predict(model, type="response", newdata=dt)]
dt[, res := mpg - pred]
# will this approach work?
model2_1 <- stats::glm(mpg ~ wt + offset(pred), family=Gamma(link="log"), data=dt)
dt[, pred21 := stats::predict(model2_1, type="response", newdata=dt) ]
# or will this approach work?
model2_2 <- stats::glm(res ~ wt, family=gaussian(), data=dt)
dt[, pred22 := stats::predict(model2_2, type="response", newdata=dt) ]
My first suggested approach has convergence issues, but this is how my silly brain would approach this problem. Thanks for any help!
In a sense, an ANCOVA is 'modeling on the residuals'. The model for ANCOVA is y_i = grand_mean + treatment_i + b * (covariate - covariate_mean_i) + error for each treatment i. The term (covariate - covariate_mean_i) can be seen as the residuals of a model with covariate as DV and treatment as IV.
The following regression is equivalent to this ANCOVA:
lm(y ~ treatment * scale(covariate, scale = FALSE))
Which applied to the data would look like this:
lm(mpg ~ factor(cyl) * scale(wt, scale = FALSE), data = mtcars)
And can be turned into a glm similar to the one you use in your example:
glm(mpg ~ factor(cyl) * scale(wt, scale = FALSE),
family=Gamma(link="log"),
data = mtcars)
I currently have a problem in that I have to pre-specify my formulas before sending them into a regression function. For example, using the stan_gamm4 function in R, we have the following example:
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect `fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~ (1 | fac),
chains = 1, iter = 200) # for example speed
Now, because the formula and random formula were specified explicitly, then if we call:
br$call$random
> ~(1 | fac)
We are able to retrieve the form of the random effects.
NOW, let us then leave everything the same, BUT use an expression for the random part:
formula.rand <- as.formula( '~(1|fac)' )
Then, if we did the same thing before, but with formula.rand taking the place, we have:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand,
chains = 1, iter = 200) # for example speed
BUT NOW: we have that:
br$call$random
> formula.rand
Instead of the original. A lot of bayesian packages rely on accessing br$call$random, so is there a way to use a variable for formula, have it pass in, AND retain the original relation when calling br$call$random? Thanks.
While I haven't used Stan, this is a problem inherent in the way that R handles storing calls. You can see it happening with lm, for example:
model <- function(formula)
{
lm(formula, data=mtcars)
}
m <- model(mpg ~ disp)
m$call$formula
# formula
The simplest solution is to construct the call using substitute to insert the actual values you want to keep, not the symbol name. In the case of lm, this would be something like
model2 <- function(formula)
{
call <- substitute(lm(formula=.f, data=mtcars), list(.f=formula))
eval(call)
}
m2 <- model2(mpg ~ disp)
m2$call$formula
# mpg ~ disp
For Stan, you can do
stan_call <- substitute(br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data=dat, random=.rf,
chains=1, iter=200),
list(.rf=formula.rand))
br <- eval(stan_call)
If I understand correctly, your problem is not, that stan_gamm4 could be computing incorrect results (which is not the case, from what I gather), but only that br$call$random refers to the variable name and not the formula. This seems to be problematic for further post-processing of the model.
Since stan_gamm4 uses match.call inside to find the call, I don't know of a way to specify the model differently to obtain a "correct" br$call$random up front. But you can simply modify it after the fact via:
br <- stan_gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = formula.rand)
br$call$random <- formula.rand
br$call$random
#> ~(1 | fac)
and the continue with whatever you are doing.
IMHO, this is not a problem with stan_gamm4. In your second example, if you then do
class(br$call$random)
you will see that it is of class "name". So, it is not as if $call is just some list with stuff in it. In order to access it programatically in general, you need to evaluate that with
eval(br$call$random)
in order to obtain ~(1 | fac), which is of class "formula".
I'm trying to predict a simple lagged time series regression with the dyn library in R. This question was a helpful starting point, but I'm getting some weird behaviour that I'm hoping someone can explain.
Here's a minimum working example.
library(dyn)
# Initial data
y.orig <- arima.sim(model=list(ar=c(.9)),n=10)
x1.orig <- rnorm(10)
data <- cbind(y=y.orig, x1=x1.orig)
# This model, with a single lag term, predicts from t=2
mod1 <- dyn$lm(y ~ lag(y, -1), data)
y.new <- window(y.orig, end=end(y.orig) + c(5,0), extend=TRUE)
newdata1 <- cbind(y=y.new)
predict(mod1, newdata1)
# This one, with a lag plus another predictor, predicts from t=1 on
mod2 <- dyn$lm(y ~ lag(y, -1) + x1, data)
y.new <- window(y.orig, end=end(y.orig) + c(5,0), extend=TRUE)
x1.new <- c(x1.orig, rnorm(5))
newdata2 <- cbind(y=y.new, x1=x1.new)
predict(mod2, newdata2)
Why is there the difference between the two? Can anyone suggest how to predict my ''mod1'' using dyn? Thanks in advance.
Both mod1 and mod2 start predicting at t=2. The prediction vector for mod2 starts at t=1 but its NA. Regarding why one starts at 2 and the other at 1 note that predict merges together the variables on the right hand side of the formula and in the case of mod1 we see that lag(y, -1) starts at t=2 since y starts at t=1. On the other hand in the case of mod2 when we merge lag(y, -1) and x1 we get a series that starts at t=1 (since x1 starts at t=1). Try this which does not involve dyn:
> start(with(as.list(newdata1), merge.zoo(lag(y, -1))))
[1] 2
> start(with(as.list(newdata2), merge.zoo(lag(y, -1), x1)))
[1] 1
If we wanted predict(mod1, newdata1) to start at t=1 we could add our own Intercept column and remove the default intercept to avoid duplication. That would force it to start at 1 since now the RHS has a series which starts at 1:
data.b <- cbind(y=y.orig, x1=x1.orig, Intercept = 1)
mod.b <- dyn$lm(y ~ Intercept + lag(y, -1) - 1, data.b)
newdata.b <- cbind(Intercept = 1, y = y.new)
predict(mod.b, newdata.b)
Regarding the second question, if you want to predict mod1 then use fitted(mod1) .
It seems there is lurking some third question about how it basically all works so maybe this clarifies it. All dyn does is to align the time series in the formula and then lm and predict can be run as usual. For example, if we create an aligned model frame using dyn$model.frame then everything else can be done using just ordinary lm and ordinary predict and dyn is not involved from that point onwards. Below mod1a is similar to mod1 from the question except it runs an ordinary lm on the aligned model frame. If you understand the mod1a lm and its predict then mod1 and predict are similar.
## mod1 and mod1a are similar
# from code in the question
mod1 <- dyn$lm(y ~ lag(y, -1), data = data)
mod1
# redo it using a plain lm by applying dyn to model.frame
mf <- dyn$model.frame(y ~ lag(y, -1), data = data)
mod1a <- lm(y ~ `lag(y, -1)`, mf)
mod1a
## the two predicts below are similar
# the 1 ensures its an mts rather than ts but is otherwise not used
newdata1 <- cbind(y=y.new, 1)
predict(mod1, newdata1)
newdata1a <- cbind(1, `lag(y, -1)` = lag(y.new, -1))
predict(mod1a, newdata1a)