How to estimate a regression with both variables i and t simultaneously - r

I want to estimate a regression for a variable, LWAGE (log wage), against EXP (years of work experience). The data that I have has participants tracked across 7 years, so each year their number of years of work experience increases by 1.
When I do the regression for
πΏπ‘Šπ΄πΊπΈπ‘– = 𝛽0 + 𝛽1𝐸𝐷𝑖 + 𝑒𝑖
I used
reg1 <- lm(LWAGE~EXP, data=df)
Now I'm trying to do the following regression:
πΏπ‘Šπ΄πΊπΈπ‘–π‘‘ = 𝛽0 + 𝛽1𝐸𝑋𝑃𝑖𝑑 + 𝑒i.
But I'm not sure how to include my the time based component into my regression. I searched around but couldn't find anything relevant.

Are you attempting to include time-fixed effects in your model or an interaction between your variable EXP and time (calling this TIME for this demonstration)?
For time fixed effects using lm() you can just include time as a variable in your model. Time should be a factor.
reg2 <- lm(LWAGE~EXP + TIME, data = df)
As an interaction between EXP and TIME it would be
reg3 <- lm(LWAGE~EXP*TIME, data = df)
Based on your description it sounds like you might be looking for the interaction. I.e. How does the effect of experience on log of wages vary by time?
You can also take a look at the plm package for working with panel data.
https://cran.r-project.org/web/packages/plm/vignettes/plmPackage.html

Related

backwards selection of glm does not change the complete model

I am very new to working with GLM. I have a dataset with categorical (as factors) and numerical predictor variables and the response variable is count data wiht a poisson distribution. These I put in a glm:
glm2<- glm(formula = count ~ Salinity + Period + Intensity + Depth + Temp + Treatment, data = dfglm, family = "poisson")
Treatment(1.1 - 3.6) and Period (morning/midday) are factors.
The output looks like this:
I already see multiple suprising things in this output (very big difference between the null-deviance and residual deviance, treatment 1.1 not showing, period morning and midday not shown as separate levels, very high standard errors) but I will continue for now.
For the backward selection I used this code:
backward<-step(glm2,direction="backward",trace=0)
summary(backward)
I got exactly the same output as given above. Also when checking backward$coefficients, all coefficients remained.
Lastly I tried this:
If anyone could give me advice/an interpretation of this output and how to make a better model with a working backward selection, it is greatly appreciated!

How do I include a quadratic time trend to a plm fixed effects model in R?

I am working with an unbalanced panel of 16 entities and 38 years. So far, I've simply used the plm() twoways specification for my panel and I clustered heteroscedastic standard errors at the state level. Now, I'd like to include a state-specific quadratic time trend.
So far, I've tried to do it with a variable stating the years of the observations for each state from 1980 to 2019 in a monochronic order and just included this variable as explaining variable:
fixed_trend <- plm(X ~ Y + Z + time + time^2, data = df, model = "within", effect = "twoways")
summary(fixed_trend, vcovHC(fixed_trend, type="HC3", cluster="group"))
The output of this regression, however, doesn't show the trends. It only gives me the coefficients of my explaining variables (in this example, hence, Y and Z). Can you tell me what I did wrong?

Application of a multi-way cluster-robust function in R

Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)

How to convert Afex or car ANOVA models to lmer? Observed variables

In the afex package we can find this example of ANOVA analysis:
data(obk.long, package = "afex")
# estimate mixed ANOVA on the full design:
# can be written in any of these ways:
aov_car(value ~ treatment * gender + Error(id/(phase*hour)), data = obk.long,
observed = "gender")
aov_4(value ~ treatment * gender + (phase*hour|id), data = obk.long,
observed = "gender")
aov_ez("id", "value", obk.long, between = c("treatment", "gender"),
within = c("phase", "hour"), observed = "gender")
My question is, How can I write the same model in lme4?
In particular, I don't know how to include the "observed" term?
If I just write
lmer(value ~ treatment * gender + (phase*hour|id), data = obk.long,
observed = "gender")
I get an error telling that observed is not a valid option.
Furthermore, if I just remove the observed option lmer produces the error:
Error: number of observations (=240) <= number of random effects (=240) for term (phase * hour | id); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable.
Where in the lmer syntax do I specify the "between" or "within" variable?. As far as I know you just write the dependent variable on the left side and all other variables on the right side, and the error term as (1|id).
The package "car" uses the idata for the intra-subject variable.
I might not know enough about classical ANOVA theory to answer this question completely, but I'll take a crack. First, a couple of points:
the observed argument appears only to be relevant for the computation of effect size.
observed: β€˜character’ vector indicating which of the variables are
observed (i.e, measured) as compared to experimentally
manipulated. The default effect size reported (generalized
eta-squared) requires correct specification of the obsered [sic]
(in contrast to manipulated) variables.
... so I think you'd be safe leaving it out.
if you want to override the error you can use
control=lmerControl(check.nobs.vs.nRE="ignore")
... but this probably isn't the right way forward.
I think but am not sure that this is the right way:
m1 <- lmer(value ~ treatment * gender + (1|id/phase:hour), data = obk.long,
control=lmerControl(check.nobs.vs.nRE="ignore",
check.nobs.vs.nlev="ignore"),
contrasts=list(treatment=contr.sum,gender=contr.sum))
This specifies that the interaction of phase and hour varies within id. The residual variance and (phase by hour within id) variance are confounded (which is why we need the overriding lmerControl() specification), so don't trust those particular variance estimates. However, the main effects of treatment and gender should be handled just the same. If you load lmerTest instead of lmer and run summary(m1) or anova(m1) it gives you the same degrees of freedom (10) for the fixed (gender and treatment) effects that are computed by afex.
lme gives comparable answers, but needs to have the phase-by-hour interaction constructed beforehand:
library(nlme)
obk.long$ph <- with(obk.long,interaction(phase,hour))
m2 <- lme(value ~ treatment * gender,
random=~1|id/ph, data = obk.long,
contrasts=list(treatment=contr.sum,gender=contr.sum))
anova(m2,type="marginal")
I don't know how to reconstruct afex's tests of the random effects.
As Ben Bolker correctly says, simply leave observed out.
Furthermore, I would not recommend to do what you want to do. Using a mixed model for a data set without replications within each cell of the design per participant is somewhat questionable as it is not really clear how to specify the random effects structure. Importantly, the Barr et al. maxim of "keep it maximal" does not work here as you realized. The problem is that the model is overparametrized (hence the error from lmer).
I recommend using the ANOVA. More discussion on exactly this question can be found on a crossvalidated thread where Ben and me discussed this more thoroughly.

Mixed Modelling - Different Results between lme and lmer functions

I am currently working through Andy Field's book, Discovering Statistics Using R. Chapter 14 is on Mixed Modelling and he uses the lme function from the nlme package.
The model he creates, using speed dating data, is such:
speedDateModel <- lme(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality,
random = ~1|participant/looks/personality)
I tried to recreate a similar model using the lmer function from the lme4 package; however, my results are different. I thought I had the proper syntax, but maybe not?
speedDateModel.2 <- lmer(dateRating ~ looks + personality + gender +
looks:gender + personality:gender +
(1|participant) + (1|looks) + (1|personality),
data = speedData, REML = FALSE)
Also, when I run the coefficients of these models I notice that it only produces random intercepts for each participant. I was trying to then create a model that produces both random intercepts and slopes. I can't seem to get the syntax correct for either function to do this. Any help would be greatly appreciated.
The only difference between the lme and the corresponding lmer formula should be that the random and fixed components are aggregated into a single formula:
dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+ (1|participant/looks/personality)
using (1|participant) + (1|looks) + (1|personality) is only equivalent if looks and personality have unique values at each nested level.
It's not clear what continuous variable you want to define your slopes: if you have a continuous variable x and groups g, then (x|g) or equivalently (1+x|g) will give you a random-slopes model (x should also be included in the fixed-effects part of the model, i.e. the full formula should be y~x+(x|g) ...)
update: I got the data, or rather a script file that allows one to reconstruct the data, from here. Field makes a common mistake in his book, which I have made several times in the past: since there is only a single observation in the data set for each participant/looks/personality combination, the three-way interaction has one level per observation. In a linear mixed model, this means the variance at the lowest level of nesting will be confounded with the residual variance.
You can see this in two ways:
lme appears to fit the model just fine, but if you try to calculate confidence intervals via intervals(), you get
intervals(speedDateModel)
## Error in intervals.lme(speedDateModel) :
## cannot get confidence intervals on var-cov components:
## Non-positive definite approximate variance-covariance
If you try this with lmer you get:
## Error: number of levels of each grouping factor
## must be < number of observations
In both cases, this is a clue that something's wrong. (You can overcome this in lmer if you really want to: see ?lmerControl.)
If we leave out the lowest grouping level, everything works fine:
sd2 <- lmer(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+
(1|participant/looks),
data=speedData)
Compare lmer and lme fixed effects:
all.equal(fixef(sd2),fixef(speedDateModel)) ## TRUE
The starling example here gives another example and further explanation of this issue.

Resources