Fixed effects regression in R: difference in specification - r

I'm doing a few regressions using state and year fixed effects. There are two ways that I've done it:
reg1 <- lm(x ~ y + z + factor(year) + factor(state) + year:state, data=df)
and:
reg1 <- lm(x ~ y + z + factor(year) + factor(state) + factor(year)*factor(state), data=df)
but I can't explain the difference in results between using each way.
Does anyone know the difference between year:state and factor(year)*factor(state) when using the lm function?
I know that plm does that for you, but in that specific case, I have to add the fixed effects manually.

Related

How to develop a hierarchical model to see the heterogeneity in mean of a specific variable using glmer?

I am using the following model in R:
glmer(y ~ x + z + (1|id), weights = specification, family = binomial, data = data)
which:
y ~ binomial(y, specification)
Logit(y) = intercept + a*x + b*z
a and b are coefficients for x and z variables.
a|a0+a1 = a0 + a1*I
One of the variables (here x) depends on other variable (here I), so I need a hierarchical model to see the heterogeneity in mean of the x.
I would appreciate it if anyone could help me with this problem?
Sorry if the question does not look professional! This is one of my first experiences.
I'm not perfectly sure I understand the question, but: if Logit(y) = intercept + a*x + b*z and a = a0 + a1*I, then it would seem that
Logit(y) = intercept + (a0+a1*I)*x + b*z
This looks like a straightforward interaction model:
glmer(y ~ 1 + x + x:I + z + (1|id), ...)
to make it more explicit, this could also be written as
glmer(y ~ 1 + x + I(x*I) + z + (1|id), ...)
(although the use of I as a predictor variable and in the I() function is a little bit confusing at first glance ...)

Problems with Fixed effects panel data

I am trying to run a regression with a panel data from the Michigan Consumers Survey. It is the first time I am using panel data on R so I am not very aware of the package "plm" that is needed. I am setting my panel data for fixed effects on individuals (CASEID) and time (YYYY):
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
Then I am using the following regression:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
However R is showing me the following error:
> mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
Error in plm.fit(data, model, effect, random.method, random.models, random.dfcor, :
empty model
Does anyone know what I am doing wrong?
Could you give the link where is this specific survey? I found various dataset with this data name.
I suspect (only suspect), you data isn't panel data, please check the CASEID variable.
Changing the order between formula and data in plm won't be solve your problem.
.
I think the error come when you write the model. Your solution is this:
mod_1 <- plm(data = Michigan_panel, ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq, model = "within")
In my view, you have to specify indexes in the formula, and follow the order of the plm package. I would like to write your formula as follows:
mod_1 <- plm(ICS ~ ICE + PX1Q2 + RATEX + ZLB + INCOME + AGE + EDUC + MARRY + SEX + AGE_sq,
data = Michigan_panel,
index= c("CASEID", "YYYY"),
model = "within")
1. Different Approach
From my knowledge we can also code this formula in a more elegant format.
library(plm)
Michigan_panel <- pdata.frame(Michigan_survey, index = c("CASEID", "YYYY"))
attach(Michigan_panel)
y <- cbind(ICS)
X <- cbind(ICE,PX1Q2,RATEX,ZLB,INCOME,AGE,EDUC,MARRY,SEX,AGE_sq)
model1 <- plm(y~X+factor(CASEID)+factor(YEAR), data=Michigan_panel, model="within")
summary(model1)
detach()
Adding factor(CASEID) and factor(YEAR) will add dummy variables in your model.

did modeling in R - right set up of data in staggered model

I appreciated any insights into staggered did (difference-in-differences) models.
I wanted to ask if I use the correct function to set-up the model for a did (data structure provided below):
did=time*treated
didreg = lm(y ~ time + treated + did + x + factor(year) + factor(firm), data = sample)
The data looks like:
I'm not familiar with difference-in-difference modelling, but from skimming the Wiki it seems that what you want is a simple interaction. To fit that, you don't even need to calculate a new variable (did), but you can specify it directly in the model. There's couple of ways to specify that with R formula syntax:
# Simple main effects models, no interactions
main_mod <- lm(y ~ time + treated + x + factor(year) + factor(firm), data = sample)
# Model with the interaction effect explicitly specified
did_mod1 <- lm(y ~ time + treated + time:treated + x + factor(year) + factor(firm), data = sample)
# Model with shortened syntax for specifying interactions
did_mod2 <- lm(y ~ time * treated + x + factor(year) + factor(firm), data = sample)
did_mod1 and did_mod2 are identical, did_mod2 is just a more compact way of writing the same model. The * indicates that you want both the main effects and the interactions of the variables to the left and the right. It's recommended to always fit main effects when you fit interactions, so the second way of writing the model saves time & space.

lmer multilevel fit with intercept constraint

I regularly have this problem: I want to fit a multilevel regression, with constraint. I don't know how to do that. I usualy end up using lavaan, as it allows to set constraints on the regression coefficients. But still it can't have random slope models (only random intercept, and truth is I don't know how to set a constraint on the intercept in lavaan either), and I would like to have a multilevel approach.
So basically I have y variable having a second order polynomial dependence on x, with coefficients depending on the subject ID:
library(data.table)
library(ggplot2)
df <- data.table(x = rep(0:10,5),ID = rep(LETTERS[1:5],each = 11))
df[,a:= rnorm(1,2,1),by = ID]
df[,b:= rnorm(1,1,0.2),by = ID]
df[,y := rnorm(.N,0,10) + a*x + b*x^2 ]
ggplot(df,aes(x,y,color = ID))+
geom_point()
and I can do normall multilevel:
lmer(y ~ x + I(x^2) + (x+ I(x^2)|ID),df)
But I would like to constrain the intercept to be 0. Is there a simple way to do so ?
Thank you
You can suppress the intercept with -1. For example:
coef(summary(lmer(y ~ x + I(x^2) + (x+ I(x^2)|ID),df)))
Estimate Std. Error t value
(Intercept) -1.960196 4.094491 -0.4787398
x 2.535092 1.754963 1.4445275
I(x^2) 1.015212 0.130004 7.8090889
coef(summary(lmer(y ~ -1 + x + I(x^2) + (x+ I(x^2)|ID),df)))
Estimate Std. Error t value
x 1.831692 0.9780500 1.872800
I(x^2) 1.050261 0.1097583 9.568856

How can I fit all variables using splines with gam in R without typing each one?

Lets say I have a data set with y and x1, x2, xp variables.
I want to fit all of my predictors with splines.
For example :
gam_object = gam(y ~ s(x1) + s(x2) + s(xp), data)
How can I do this without typing every single variable? If I would like to fit the module without the first two without using splines. How can I do that?
gam_object2 = gam(y ~ x1 + x2 + s(x1) + s(x2), data)
Maybe this could help you:
p<-10
as.formula(paste0("y~",paste0("s(x",1:p,")",collapse="+")))
If you don't want to use first two or more generally don't use splines on some specific variables use:
data<- #yours data
use<-c(3:6,9:10)
dontuse<-c(1:2,7:8)
form<-as.formula(
paste0("y~",paste0("s(x",use,")",collapse="+"),"+",paste0("x",dontuse,collapse="+"),collapse=""))
And then run model:
gam(data=data,formula=form,family=gaussian)

Resources