I am modeling coastal dolphin distribution with GAMs (package mgcv) and because most "events" modelled result in no sightings- the model has many zeros.
The zero-inflated poisson model seemed to work well with the data but gam.summary was not able to provide an R-squared value, and deviance explained resulted in a negative value.
I moved on to attempting negative binomial but i am unsure whether to model presence-absence data or to model the number of individuals in a sighting as "count data".
Any ideas in regards to the zero inflated poisson errors?
Any recommendations on modeling negative binomial as presence/absence or count?
Many thanks!
Related
I am using GLMMs in R to examine the influence of continuous predictor variables (x) on biological counts variable (y). My response variables (n=5) each have a high number of zeros (data distribution), so I have tested the fit of various distributions (genpois, poisson, nbinom1, nbinom2, zip, zinb1, zinb2) and selected the best fit one according to the lowest AIC/LogLik value.
According to this selection criteria, three of my response variables with the highest number of zeros are best fit to the zero inflated negative binomial (zinb2) distribution. Compared to the regular NB distribution (non-zero inflated), the delta AIC is between 30-150.
My question is: must I use the ZI models for these variables considering the dAIC? I have received advice from a statistician that if dAIC is small enough between the ZI and non-ZI model, use the non-ZI model even if it is marginally worse fit since ZI models involve much more complicated modelling & interpretation. The distribution matters in this case because ZINB / NB models select a different combination of top candidate models when testing my predictors.
Thank you for any clarification!
My outcome variable has a range from 0-8.5 with many zero values. The values contain decimals. The distribution is also right-skewed. I understand models like zero-inflated poisson and zero-inflated negative binomial assume that the outcome is a count ie. values 0,1,2 etc. Is there an alternative GLM that can predict this type of data?
I need to specifically use a GLM in this case.
I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.
The output of the zeroinfl regression from pscl provides a list of coefficients under "count model coefficients" as well as a list of coefficients under "zero-inflation model coefficients."
Given the interest is to follow the z inflated model, what is the utility of the count model coefficients? Is it simply provided for reference?
Your zero inflated regression consists of two models. The zero part is usually a binomial part, such as a logit or probit model, and accounts for the probability that Y is not zero. The count part is usually a model for count data (usually integers), such as a poisson or negative binomial model, and only considers those observations that are not zero. When you compare the number of observations of both models, e.g. using summary(fit), you will see the difference. In sum, your zero model calculates the probability that an observations is not zero, the count model fits a model on those observations that are not zero.
This zero inflated regression is similar to a hurdle model. You can read more on this at Cross Validated: What is the difference between zero-inflated and hurdle models?. BTW that platform is actually better suited for this kind of merely statistical questions.
I want to report the results of an one factorial lme from the nlme package. I want to know the overall effect of A on y. To do so I would compare the model with a Null model:
m1 <- lme(y~A,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
m0 <- lme(y~1,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
I am using maximum likelihood because I am comparing models with different main effects.
stats::anova(m0,m1) gives me a significant p value, meaning that there is a significant effect of A on y. However, in contrast to lmer models made with lme4, no Chi2 values are given. First: Is this approach valid? And second: What is the best way to report the result?
Thanks for your answers
An anova with lme should give you the same information as with lmer. Both use what's called a deviance test or likelihood ratio test. The L.ratio part in the table returned by anova is simply the difference in the loglikelihood of the two models multiplied by -2. A deviance test tests this value against a Chi2 distribution with the difference in model parameters (in your case 1) degrees of freedom. So the value reported under L.ratio for lme models is the same as the Chi2 value reported for lmer models (assuming the models are the same of course, and lmer rounds the value to a decimal).
The approach is valid and you could report the value under L.ratio along with the degrees of freedom and p-value, but I would add more information in your report such as the fixed and random coefficients of both models and other parameters that you've added (such as the difference in variance for levels of A specified under weights). If you're only interested in the fixed effect of A than a Wald test should also be appropriate though REML estimates are recommended in cases with a small number of groups (Snijders & Bosker, 2012). The test statistic is the t-value and associated p-value in the model summary output summary(m1). Chapter 6 in Snijders & Bosker (2012) gives a great explanation on tests for fixed and random parameters. Along with reporting examples.