Justifying need for zero-inflated model in GLMMs - r

I am using GLMMs in R to examine the influence of continuous predictor variables (x) on biological counts variable (y). My response variables (n=5) each have a high number of zeros (data distribution), so I have tested the fit of various distributions (genpois, poisson, nbinom1, nbinom2, zip, zinb1, zinb2) and selected the best fit one according to the lowest AIC/LogLik value.
According to this selection criteria, three of my response variables with the highest number of zeros are best fit to the zero inflated negative binomial (zinb2) distribution. Compared to the regular NB distribution (non-zero inflated), the delta AIC is between 30-150.
My question is: must I use the ZI models for these variables considering the dAIC? I have received advice from a statistician that if dAIC is small enough between the ZI and non-ZI model, use the non-ZI model even if it is marginally worse fit since ZI models involve much more complicated modelling & interpretation. The distribution matters in this case because ZINB / NB models select a different combination of top candidate models when testing my predictors.
Thank you for any clarification!

Related

How to assess the model and prediction of random forest when doing regression analysis?

I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.

Interpreting zero-inflated regression summary

The output of the zeroinfl regression from pscl provides a list of coefficients under "count model coefficients" as well as a list of coefficients under "zero-inflation model coefficients."
Given the interest is to follow the z inflated model, what is the utility of the count model coefficients? Is it simply provided for reference?
Your zero inflated regression consists of two models. The zero part is usually a binomial part, such as a logit or probit model, and accounts for the probability that Y is not zero. The count part is usually a model for count data (usually integers), such as a poisson or negative binomial model, and only considers those observations that are not zero. When you compare the number of observations of both models, e.g. using summary(fit), you will see the difference. In sum, your zero model calculates the probability that an observations is not zero, the count model fits a model on those observations that are not zero.
This zero inflated regression is similar to a hurdle model. You can read more on this at Cross Validated: What is the difference between zero-inflated and hurdle models?. BTW that platform is actually better suited for this kind of merely statistical questions.

Subset selection with LASSO involving categorical variables

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?
Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

R vs. SPSS mixed model repeated measures code [from Cross Validated]

NOTE: This question was originally posted on Cross Validated, where it was suggested that it should be asked in StackOverflow instead.
I am trying to model a 3-way repeated measures experiment, FixedFactorA * FixedFactorB * Time[days]. There are no missing observations, but my groups (FactorA * FactorB) are unequal (close, but not completely balanced). From reading online, the best way to model a repeated measures experiment in which observation order matters (due to the response mean and variance changing in a time-dependent way) and for unequal groups is to use a mixed model and specify an appropriate covariance structure. However, I am new to the idea of mixed models and I am confused as to whether I am using the correct syntax to model what I am trying to model.
I would like to do a full factorial analysis, such that I could detect significant time * factor interactions. For example, for subjects with FactorA = 1, their responses over time might have a different slope and/or intercept than subjects with FactorA =2. I also want to be able to check whether certain combinations of FactorA and FactorB have significantly different responses over time (hence the full three-way interaction term).
From reading online, it seems like AR1 is a reasonable covariance structure for longitudinal-like data, so I decided to try that. Also, I saw that one is supposed to use ML if one plans to compare two different models, so I chose that approach in anticipation of needing to fine-tune the model. It is also my understanding that the goal is to minimize the AIC during model selection.
This is the code in the log for what I tried in SPSS (for long-form data), which yielded an AIC of 2471:
MIXED RESPONSE BY FactorA FactorB Day
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=FactorA FactorB Day FactorA*FactorB FactorA*Day FactorB*Day FactorA*FactorB*Day | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION TESTCOV
/REPEATED=Day | SUBJECT(Subject_ID) COVTYPE(AR1)
This is what I tried in R, which yielded an AIC of 2156:
require(nlme)
#output error fix: https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached
ctrl <- lmeControl(opt='optim') #I used this b/c otherwise I get the iteration limit reached error
fit1 <- lme(RESPONSE ~ Day*FactorA*FactorB, random = ~ Day|Subject_ID, control=ctrl,
correlation=corAR1(form=~Day), data, method="ML")
summary(fit1)
These are my questions:
The SPSS code above yielded a model with AIC = 2471, while the R code yielded a model with AIC = 2156. What is it about the codes that makes the models different?
From what I described above, are either of these models appropriate for what I am trying to test? If not, what would be a better way, and how would I do it in both programs to get the same results?
Edits
Another thing to note is that I didn't dummy-code my factors. I don't know if this is a problem for either software, or if the built-in coding is different in SPSS vs R. I also don't know if this will be a problem for my three-way interaction term.
Also, when I say "factor", I mean an unchanging group or characteristic (like "sex").
Start with an unconditional model, one with an identity variance-covariance structure at level-1 and one with an AR(1) var-covar structure at level 1:
unconditional.identity<-lme(RESPONSE~Day, random=~Day|Subject_ID, data=data, method='ML')
unconditional.ar1<-lme(RESPONSE~Day, random=~Day|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Find the intra-class correlation coefficient of this unconditional model, which is the level-2 error divided by the sum of level-1 and level-2 errors. This is probably easier in a spreadsheet program, but in R:
intervals(unconditional.identity)$reStruct$Subject_ID[2]^2/(intervals(unconditional.identity)$reStruct$Subject_ID[2]^2+intervals(unconditional.identity)$sigma[2]^2)
intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2/(intervals(unconditional.ar1)$reStruct$Subject_ID[2]^2+intervals(unconditional.ar1)$sigma[2]^2)
It depends on your field, but in educational research, an ICC below 0.2, definitely below 0.1, is considered not ready for hierarchical linear models. That is to say, multiple regression would be better because the assumption of independence is confirmed. If your ICC is below a cutoff for your field, then do not use a hierarchical longitudinal model.
If your ICC is acceptable for hierarchical linear models, then add in your control grouping variable with identity and AR(1) var-covar matrix:
conditional1.identity<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional1.ar1<-lme(RESPONSE~Day+Group, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
If your factors are time-invariant (which you said on Cross Validated), then your model gets bigger because time and group are nested in these fixed effects:
conditional2.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional2.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
You can get confidence intervals on the coefficients with intervals() or p-values with summary(). Remember, lme reports error terms in standard deviation format.
I do not know your area of study, so I can't say if your three-way interaction effect makes theoretical sense. But your model is getting quite dense at this point. The more parameters you estimate, the more degrees of freedom the model has when you compare them, so the statistical significance will be biased. If you are really interested in a three-way interaction effect, I suggest you consider the theoretical meaning of such an interaction and what it would mean if such an interaction did occur. Nonetheless, you can estimate it by adding it to the code above:
conditional3.identity<-lme(RESPONSE~Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, data=data, method='ML')
conditional3.ar1<-lme(Day+Group+FactorA+FactorB+FactorA*Day+FactorB*Day+FactorA*Group+FactorB*Group+FactorB+Day*FactorA*FactorB, random=~Day+Group|Subject_ID, correlation=corAR1(form=~Day), data=data, method='ML')
Finally, compare the nested models:
anova(unconditional.identity,conditional1.identity,conditional2.identity,conditional3.identity)
anova(unconditional.ar1,conditional1.ar1,conditional2.ar1,conditional3.ar1)
Like I said, the more parameters you estimate, the more biased your statistical significance will be: i.e., more parameters = more degrees of freedom = less chance of a statistically significant model.
HOWEVER, the best part about multi-level models is comparing effect sizes, so then you don't have to worry about p-values at all. Effect sizes are in the form of a "proportional reduction in variance explained."
This is comparing models. For example, to comapre the proportional reduction in variance explained in level 1 from the unconditional model to the conditional1 model:
(intervals(unconditional.identity)$sigma[2]^2 - intervals(conditional1.identity)$sigma[2]^2) / intervals(unconditional.identity)$sigma[2]^2
Hopefully you can "plug and play" the same code for the number of level-2 error terms you have (which is more than one in some of your cases). Make sure to compare only nested models in this way.

interactions in logistical regression R

I am struggling to interpret the results of a binomial logistic regression I did. The experiment has 4 conditions, in each condition all participants receive different version of treatment.
DVs (1 per condition)=DE01,DE02,DE03,DE04, all binary (1 - participants take a spec. decision, 0 - don't)
Predictors: FTFinal (continuous, a freedom threat scale)
SRFinal (continuous, situational reactance scale)
TRFinal (continuous, trait reactance scale)
SVO_Type(binary, egoists=1, altruists=0)
After running the binomial (logit) models, I ended up with the following:see output. Initially I tested 2 models per condition, when condition 2 (DE02 as a DV) got my attention. In model(3)There are two variables, which are significant predictors of DE02 (taking a decision or not) - FTFinal and SVO Type. In context, the values for model (3) would mean that all else equal, being an Egoist (SVO_Type 1) decreases the (log)likelihood of taking a decision in comparison to being an altruist. Also, higher scores on FTFinal(freedom threat) increase the likelihood of taking the decision. So far so good. Removing SVO_Type from the regression (model 4) made the FTFinal coefficient non-significant. Removing FTFinal from the model does not change the significance of SVO_Type.
So I figured:ok, mediaiton, perhaps, or moderation.
I tried all models both in R and SPSS, and entering an interaction term SVO_Type*FTFinal makes !all variables in model(2) non-significant. I followed this "http: //www.nrhpsych.com/mediation/logmed.html" mediation procedure for logistic regression, but found no mediation. (sorry for link, not allowd to post more than 2, remove space after http:) To sum up: Predicting DE02 from SVO_Type only is not significant.Predicting DE02 from FTFinal is not significantPutitng those two in the regression makes them significant predictors
code and summaries here
Including an interaction between these both in a model makes all coefficients insignificant.
So I am at a total loss: As far as I know, to test moderation, you need an interaction term. This term is between a categorical var (SVO_Type) and the continuous one(FTFinal), perhaps that goes wrong?
And to test mediation outside SPSS, I tried the "mediate" package (sorry I am a noob, so I am allowed max 3 links per post), only to discover that there is a "treatment" argument in the funciton, which is to be the treatment variable (exp Vs cntrl). I don't have such, all ppns are subjected to different versions of the same treatment.
Any help will be appreciated.

Resources