I`m considering several models such as GLM, GLMM, zero-inflated, and zero-inflated mixed in the count data.
All my work was done in R.
Prior studies confirmed that there is a problem of zero excess and over-dispersion as a consideration in counter data analysis.
So I tried the following tests.
1. zero excess
Voung test was performed using the zero-inflated model and the GLM.
Vuong of the pscl package was used.
ZIP vs. GLM Poisson
ZINB vs. GLM NB
Significant results were obtained from the above two tests (p<0.05).
2. over-dispersion
dispersion test was performed to find out why over-dispersion should be
considered in real data using the Poisson model.
dispersiontest of the AER package was used (Cameron, A.C. and Trivedi 1990).
The above test results in rejection of the null hypothesis (p<0.05)
In addition, it was confirmed that dispersion parameter(1/theta) had a value of about 0.39.
However, I have not yet found a verification method for the reason why random effects should be considered.
My data is traffic accident data according to the year of each road. i.e. it is longitudinal count data.
I was told by a professor of statistics that a mixed model should be used considering road heterogeneity.
Therefore, I constructed GLMM poisson/NB and zero-inflated mixed poisson/NB using random effects by road and confirmed the results.
GLMM used glmer of lme4, and glmmTMB of glmmTMB was used as the zero-inflated mixed model.
I did the Houseman test at first. However, this test compares the fixed-effects model with the random-effects model and was considered inappropriate for the count data (not linear model).
Crucially, when testing the random effect of the mixed model from the count data, no previous study was seen that conducted the Hausmann test.
Therefore, my question is as follows:
1. I would like to know if there is a previous study that identifies the reason for considering ramdom effect in modeling in longitudinal study data.
2. Is there a validation method to verify the significant effects of random effects in the mixed model?
The AIC and BIC comparison has already been carried out.
3. If there is a way, what package does R use? Additionally, how to use it
Related
I am currently working on univariate GARCH models with different specifications and got stuck on including the exponential term in the variance equation:
mean model (setting ω4 = 0)
variance model
I am using the rugarch package in R and (unsuccessfully) tried the 'eGARCH' model type and external regressor option for the recession dummy INBER to get the estimates. Is this generally the correct way for including the exponential part or am I completely off?
I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.
As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample.
This results in predicted "Theta" values for each document in my testset-sample.
While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.
I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.
Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?
Thank you!
I am working on a dataset that has random effects (so I need a mixed-effects model). The response variable is a count (non-negative, integer) which is also zero-inflated (51% zeros). The model that I have arrived at is a zero-inflated generalized linear mixed-effects model (ZIGLMM). Several packages that I have attempted to use to fit such a model include glmmTMB and glmmADMB in R.
My question is: is it possible to account for spatial autocorrelation using such a model and if so, how can it be done? I am unsure if this has been done before since both packages are relatively new..
I'm trying to optimize a multivariate linear regression model lmMod=lm(depend_var~var1+var2+var3+var4....,data=df) and I'm presently working on the premises of the model: the constant variance of residuals and the absence of auto-correlation. For this I'm using:
Breusch-Pagan test for homo/heteroscedasticity: lmtest::bptest(lmMod)
Durbin Watson test for auto-correlation: durbinWatsonTest(lmMod)
I found examples which are testing either one independent variable at a time:
example for Breush-Pagan test – one independent variable:
https://datascienceplus.com/how-to-detect-heteroscedasticity-and-rectify-it/
example for Durbin Watson test - one independent variable:
http://math.furman.edu/~dcs/courses/math47/R/library/lmtest/html/dwtest.html
or the whole model with several independent variables at a time:
example for Durbin Watson test – multiple independent variable:
https://www.rdocumentation.org/packages/car/versions/2.1-6/topics/durbinWatsonTest
Here are the questions:
Can durbinWatsonTest() and bptest() be fed with a whole multivariate model
If answer to 1 is yes, how is it then possible to determine which variable is causing heteroscedasticity or auto-correlation in the model in order to fix it as each of those tests give only one p-value for the entire multivariate model?
If answer to 1 is no, the test should be then performed with one dependent variable at a time. But in the case of homoscedasticity, it can only be tested AFTER a particular regression has been modelled. Hence a pattern of homo/heteroscedasticity in an univariate regression model lmMod_1=lm(depend_var~var1, data=df) will be different from the pattern of a multivariate regression model lmMod_2=lm(depend_var~var1+var2+var3+var4....,data=df)
Thank very much in advance for your help!
I would like to try to give a first help
The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).
Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.
Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.