Two-way Fixed Effects Model: Clustering Standard Errors - r

I'm running a two way fixed effects regression in R with "firmID" and "Year" as indexes in the model specification.  I have provincial data and would like to be able to cluster the standard errors of the data at the provincial level. Is there a way to do this? It seems that it is hard to cluster standard errors for any variable out of the index. Using province as one of the indexes is not an option. So far the most promising way would be using varvocCR in the clubSandwich package (refer here) however, this will not work for a two way fixed effect as specified in the documentation. Something similar to this approach would be great. 
x <- plm(y ~ x, data=data, model='within', index=c("firmid", "year"), effect="twoway")
then followed by
clubSandwich::vcovCR(x, type = 'CR2', cluster = df$Prov))
to get clustered standard errors at the provincial level. But as I mentioned earlier, the documentation in the package specifies that this will work with individual fixed effects. An approach like this that works with two way fixed effects model would be amazing. Any insight is appreciated in any form.

Related

I want to use a fixed effects model on a regression with one variable being the group variable

I am using felm() and the code is running on all the model… but I need it to run on state only… the problem asks "Estimate the model using fixed effects (FE) at the state level". Using felm() is not getting me the correct results because I don't know if I need to include state as a dependent variable (doesn't give me correct answers) or how to specify that one variable needs to be the group variable (I'm assuming this is how to get accurate results).
I have tried using
plm(ind~depvar+state,data=data, model='within')
I have tried using
felm(ind~depvar+state,data=data)
FELinMod3<-felm(DRIVING$totfatrte~DRIVING$D81+DRIVING$state, data=DRIVING)
FELinMod3<-plm(DRIVING$totfatrte~DRIVING$D81+DRIVING$state, data=DRIVING, model='within')
output is giving me incorrect coefficients to the ones I know are correct in STATA.
looks like felm() is for when you have multiple grouping variables, but it sounds like you're using only one grouping variable for fixed effects? (i.e., state).
you should get the same correct result for
mod3 <- lm(totfatrte ~ D81 + state, data = DRIVING)
Also, if the coefficients or standard errors disagree between stata and R, that doesn't necessarily mean that R is wrong.
Reading the documentation for felm() indicates that your code should look more like this:
model3<-felm(totfatrte ~ D81 | state, data = DRIVING)
but the code specifications for it are pretty complex based on whether you want to cluster your standard errors and so on.
Hope this helps.

Most straightforward R package for setting subject as random effect in mixed logit model

I have a dataset in which individuals, each belonging to a particular group, repeatedly chose between multiple discrete outcomes.
subID group choice
1 Big A
1 Big B
2 Small B
2 Small B
2 Small C
3 Big A
3 Big B
. . .
. . .
I want to test how group membership influences choice, and want to account for non-independence of observations due to repeated choices being made by the same individuals. In turn, I planned to implement a mixed multinomial regression treating group as a fixed effect and subID as a random effect. It seems that there are a few options for multinomial logits in R, and I'm hoping for some guidance on which may be most easily implemented for this mixed model:
1) multinom - GLM, via nnet, allows the usage of the multinom function. This appears to be a nice, clear, straightforward option... for fixed effect models. However is there a manner to implement random effects with multinom? A previous CV post suggests that multinom is able to handle mixed-effects GLM with poisson distribution and a log link. However, I don't understand (a) why this is the case or (b) the required syntax. Can anyone clarify?
2) mlogit - A fantastic package, with incredibly helpful vignettes. However, the "mixed logit" documentation refers to models that have random effects related to alternative specific covariates (implemented via the rpar argument). My model has no alternative specific variables; I simply want to account for the random intercepts of the participants. Is this possible with mlogit? Is that variance automatically accounted for by setting subID as the id.var when shaping the data to long form with mlogit.data? EDIT: I just found an example of "tricking" mlogit to provide random coefficients for variables that vary across individuals (very bottom here), but I don't quite understand the syntax involved.
3) MCMCglmm is evidently another option. However, as a relative novice with R and someone completely unfamiliar with Bayesian stats, I'm not personally comfortable parsing example syntax of mixed logits with this package, or, even following the syntax, making guesses at priors or other needed arguments.
Any guidance toward the most straightforward approach and its syntax implementation would be thoroughly appreciated. I'm also wondering if the random effect of subID needs to be nested within group (as individuals are members of groups), but that may be a question for CV instead. In any case, many thanks for any insights.
I would recommend the Apollo package by Hess & Palma. It comes with a great documentation and a quite helpful user group.

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

Clustered robust standard errors on country-year pairs

I want to replicate a Stata do.file (panel model) in R, but unfortunately I'm ending up with the wrong standard error estimates. The data is proprietary, so I can't post it here. The Stata code used looks like:
xtreg Y X, vce(cluster countrycodeid) fe nonest dfadj
With fe for fixed effects, nonest indicating that the panels are not nested within the clusters, and dfadj for the fact that some sort of DF-adjustment takes place - not possible to find out which sort as of now.
My R-Code looks like this and makes me end up with the right coefficient values:
model <- plm(Y~X+as.factor(year),data=panel,model="within",index=c("codeid","year"))
Now comes the difficult part, which I haven't found a solution for so far, even after trying out numerous sorts of standard error robust estimation methods, for example making extensive use of lmtest and various degrees of freedom transformation methods. The standard errors are supposed to follow a country-year pair pattern (captured by the variable countrycodeid in the Stata code, which takes the form codeid-year, as there appears to be missing data for some variables which are not available on a monthly basis.
Does anyone know if there are special tricks to keep in mind when working with unbalanced panels and the plm() package, which sort of DF-adjustment can be used, and if there is a possibility to group data in the coeftest() function on a country-year basis?
This is not a complete answer.
Stata uses a finite sample correction described in this post. I think that may get your standard errors a tad closer.
Moreover, you can learn more about the nonest/dfadj by issuing the help whatsnew9. Stata used to adjust the VCE for the within transformation when the cluster() option was specified. The cluster-robust VCE no longer adjusts unless the dfadj is specified. You may need to use the version control to replicate old estimates.

Test for Multicollinearity in Panel Data R

I am running a panel data regression using the plm package in R and want to control for multicollinearity between the explanatory variables.
I know there is the vif() function in the car-package, however as far as I know, it cannot deal with panel data output.
The plm can do other diagnostics such as a unit root test but I found no method to calculate for multicollinearity.
Is there a way to calculate a similar test to vif, or can I just regard each variable as a time-series, leaving out the panel information and run tests using the car package?
I cannot disclose the data, but the problem should be relevant to all panel data models.
The dimension is roughly 1,000 observations, over 50 time-periods.
The code I use looks like this:
pdata <- plm.data(RegData, index=c("id","time"))
fixed <- plm(Y~X, data=pdata, model="within")
and then
vif(fixed)
returns an error.
Thank you in advance.
This question has been asked with reference to other statistical packages such as SAS https://communities.sas.com/thread/47675 and Stata http://www.stata.com/statalist/archive/2005-08/msg00018.html and the common answer has been to use pooled model to get VIF. The logic is that since multicollinearity is only about independent variable there is no need to control for individual effects using panel methods.
Here's some code extracted from another site:
mydata=read.csv("US Panel Data.csv")
attach(mydata) # not sure is that's really needed
Y=cbind(Return) # not sure what that is doing
pdata=plm.data(mydata, index=c("id","t"))
model=plm(Y ~ 1+ESG+Beta+Market.Cap+PTBV+Momentum+Dummy1+Dummy2+Dummy3+Dummy4+Dummy5+
Dummy6+Dummy7+Dummy8+Dummy9,
data=pdata,model="pooling")
vif(model)

Resources