Does first-difference work on unbalanced data with holes in plm? - r

On page 423 in Computational Laboratory for Economics, it is stated that "the argument model="fd" doesn't work correctly, with the current version (1.3-1) of plm, on unbalanced data with holes." Has this been fixed in the newer versions of plm?
As a workaround, the author used diff to obtain the first differences and fitted a model="pooling"on the differenced data. Can someone explain how the diff function works on unbalanced data with holes?
Also, on page 68 in the plm documenation version (1.6-5), it is stated that
"plm is a general function for the estimation of linear panel models. It supports the following estimation methods: pooled OLS (model = "pooling"), fixed effects ("within"), random effects ("random"), first–differences ("fd"), and between ("between"). It supports unbalanced panels and two–way effects (although not with all methods)."

Related

Multilevel mixed-effects tobit regression in R

I have a dataset with data left censored and I wanted to apply a multilevel mixed-effects tobit regression, but I only find information about how to do it in Stata. Is it possible to do it in R?
I found the packages 'VGAM' and 'CensREG', but I don't get how to add fixed and random effects.
Also my data is log-normal distributed, is there a way to add this to the model?
Thanks!
According to Section 3.5 of a vignette, the censReg package can handle a mixed model if the data are prepared properly via the plm package.
This Cross Validated page shows an example.
I don't have experience with this; it might only work with formal panel data rather than more general random-effects structures.
If your data are truly log-normal, you could take logs first and set the lower censoring limit on the log scale. Note that an apparent log-normal distribution of outcomes might just represent a corresponding distribution of predictor values with an underlying normal error distribution around the predictions. Don't jump blindly into a log-normal assumption.

random effects test in GLMM or zero-inflated mixed model

I`m considering several models such as GLM, GLMM, zero-inflated, and zero-inflated mixed in the count data.
All my work was done in R.
Prior studies confirmed that there is a problem of zero excess and over-dispersion as a consideration in counter data analysis.
So I tried the following tests.
1. zero excess
Voung test was performed using the zero-inflated model and the GLM.
Vuong of the pscl package was used.
ZIP vs. GLM Poisson
ZINB vs. GLM NB
Significant results were obtained from the above two tests (p<0.05).
2. over-dispersion
dispersion test was performed to find out why over-dispersion should be
considered in real data using the Poisson model.
dispersiontest of the AER package was used (Cameron, A.C. and Trivedi 1990).
The above test results in rejection of the null hypothesis (p<0.05)
In addition, it was confirmed that dispersion parameter(1/theta) had a value of about 0.39.
However, I have not yet found a verification method for the reason why random effects should be considered.
My data is traffic accident data according to the year of each road. i.e. it is longitudinal count data.
I was told by a professor of statistics that a mixed model should be used considering road heterogeneity.
Therefore, I constructed GLMM poisson/NB and zero-inflated mixed poisson/NB using random effects by road and confirmed the results.
GLMM used glmer of lme4, and glmmTMB of glmmTMB was used as the zero-inflated mixed model.
I did the Houseman test at first. However, this test compares the fixed-effects model with the random-effects model and was considered inappropriate for the count data (not linear model).
Crucially, when testing the random effect of the mixed model from the count data, no previous study was seen that conducted the Hausmann test.
Therefore, my question is as follows:
1. I would like to know if there is a previous study that identifies the reason for considering ramdom effect in modeling in longitudinal study data.
2. Is there a validation method to verify the significant effects of random effects in the mixed model?
The AIC and BIC comparison has already been carried out.
3. If there is a way, what package does R use? Additionally, how to use it

Fixed Effects in betareg

I am currently working with 20+ years panel data and upon implementing a standard linear fixed effects model I realized that there was some considerable non-linearity towards the boundary. As I am working with data on the Gini coefficient I implemented a beta regression.
There is definitely a need to include fixed effects but I am not entirely sure how to do this in a beta regression framework as implemented through the betareg package in R. Is it unproblematic to simply include individual dummies in the regression?

R: mixed model with heteroscedastic data -> only lm function works?

This question asks the same question, but hasn't been answered. My question relates to how to specify the model with the lm() function and is therefore a programming (not statistical) question.
I have a mixed design (2 repeated and 1 independent predictors). Participants were first primed into group A or B (this is the independent predictor) and then they rated how much they liked 4 different statements (these are the two repeated predictors).
There are many great online resources how to model this data. However, my data is heterscedastic. So I like to use heteroscedastic-consistent covariance matrices. This paper explains it well. The sandwich and lmtest packages are great. Here is a good explanation how to do it for a indpendent design in R with lm(y ~ x).
It seems that I have use lm, else it wont work?
Here is the code for a regression model assuming that all variances are equal (which they are not as Levene's test comes back significant).
fit3 <- nlme:::lme(DV ~ repeatedIV1*repeatedIV2*independentIV1, random = ~1|participants, df) ##works fine
Here is the code for an indepedent model correcting for heteroscedasticity, which works.
fit3 <- lm(DV ~ independentIV1)
library(sandwich)
vcovHC(fit3, type = 'HC4', sandwich = F)
library(lmtest)
coef(fit3, vcov. = vcovHC, type = 'HC4')
So my question really is, how to specify my model with lm?
Alternative approaches in R how to fit my model accounting for heteroscedasticity are welcome too!
Thanks a lot!!!
My impression is that your problems come from mixing various approaches for various aspects (repeated measurements/correlation vs. heteroscedasticity) that cannot be mixed so easily. Instead of using random effects you might also consider fixed effects, or instead of only adjusting the inference for heteroscedasticity you might consider a Gaussian model and model both mean and variance, etc. For me, it's hard to say what is the best route forward here. Hence, I only comment on some aspects regarding the sandwich package:
The sandwich package is not limited to lm/glm only but it is in principle object-oriented, see vignette("sandwich-OOP", package = "sandwich") (also published as doi:10.18637/jss.v016.i09.
There are suitable methods for a wide variety of packages/models but not
for nlme or lme4. The reason is that it's not so obvious for which mixed-effects models the usual sandwich trick actually works. (Disclaimer: But I'm no expert in mixed-effects modeling.)
However, for lme4 there is a relatively new package
called merDeriv (https://CRAN.R-project.org/package=merDeriv) that
supplies estfun and bread methods so that sandwich covariances can be
computed for lmer output etc. There is also a working paper associated
with that package: https://arxiv.org/abs/1612.04911

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Resources