interactions in logistical regression R - r

I am struggling to interpret the results of a binomial logistic regression I did. The experiment has 4 conditions, in each condition all participants receive different version of treatment.
DVs (1 per condition)=DE01,DE02,DE03,DE04, all binary (1 - participants take a spec. decision, 0 - don't)
Predictors: FTFinal (continuous, a freedom threat scale)
SRFinal (continuous, situational reactance scale)
TRFinal (continuous, trait reactance scale)
SVO_Type(binary, egoists=1, altruists=0)
After running the binomial (logit) models, I ended up with the following:see output. Initially I tested 2 models per condition, when condition 2 (DE02 as a DV) got my attention. In model(3)There are two variables, which are significant predictors of DE02 (taking a decision or not) - FTFinal and SVO Type. In context, the values for model (3) would mean that all else equal, being an Egoist (SVO_Type 1) decreases the (log)likelihood of taking a decision in comparison to being an altruist. Also, higher scores on FTFinal(freedom threat) increase the likelihood of taking the decision. So far so good. Removing SVO_Type from the regression (model 4) made the FTFinal coefficient non-significant. Removing FTFinal from the model does not change the significance of SVO_Type.
So I figured:ok, mediaiton, perhaps, or moderation.
I tried all models both in R and SPSS, and entering an interaction term SVO_Type*FTFinal makes !all variables in model(2) non-significant. I followed this "http: //www.nrhpsych.com/mediation/logmed.html" mediation procedure for logistic regression, but found no mediation. (sorry for link, not allowd to post more than 2, remove space after http:) To sum up: Predicting DE02 from SVO_Type only is not significant.Predicting DE02 from FTFinal is not significantPutitng those two in the regression makes them significant predictors
code and summaries here
Including an interaction between these both in a model makes all coefficients insignificant.
So I am at a total loss: As far as I know, to test moderation, you need an interaction term. This term is between a categorical var (SVO_Type) and the continuous one(FTFinal), perhaps that goes wrong?
And to test mediation outside SPSS, I tried the "mediate" package (sorry I am a noob, so I am allowed max 3 links per post), only to discover that there is a "treatment" argument in the funciton, which is to be the treatment variable (exp Vs cntrl). I don't have such, all ppns are subjected to different versions of the same treatment.
Any help will be appreciated.

Related

Justifying need for zero-inflated model in GLMMs

I am using GLMMs in R to examine the influence of continuous predictor variables (x) on biological counts variable (y). My response variables (n=5) each have a high number of zeros (data distribution), so I have tested the fit of various distributions (genpois, poisson, nbinom1, nbinom2, zip, zinb1, zinb2) and selected the best fit one according to the lowest AIC/LogLik value.
According to this selection criteria, three of my response variables with the highest number of zeros are best fit to the zero inflated negative binomial (zinb2) distribution. Compared to the regular NB distribution (non-zero inflated), the delta AIC is between 30-150.
My question is: must I use the ZI models for these variables considering the dAIC? I have received advice from a statistician that if dAIC is small enough between the ZI and non-ZI model, use the non-ZI model even if it is marginally worse fit since ZI models involve much more complicated modelling & interpretation. The distribution matters in this case because ZINB / NB models select a different combination of top candidate models when testing my predictors.
Thank you for any clarification!

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

Negative Binomial Regression Assumption Testing

First post!
I'm a biologist with limited background in applied statistics and R. Basically know enough to be dangerous, so I'd appreciate it someone could confirm/deny that I'm on the right path.
My datasets consists of count data (wildlife visits to water wells) as a response variable and multiple continuous predictor variables (environmental measurements).
First, I eliminated multicolinearity by dropping a few predictor variables. Second, I investigated the distribution of the response variable. Initially, it looked Poisson. However, a Poisson exact test came back as significant, and the variance of the response variable was around 200 with a mean around 9, i.e. overdispersed. Due to this, I decided to move forward with Negative Binomial and Quasipoisson regressions. Both selected the same model, the residuals of which are in a normal distribution. Further, a plot of residuals over predicted values is unbiased and homoscedastic.
Questions:
1. Have I selected the correct regressions to model this data?
2. Are there additional assumptions of the NBR and QpR that I need to test? How should I/Where can I learn about how to do these?
3. Did I check for overdispersion correctly? Is there a difference in comparing the mean and variance vs comparing the conditional mean and variance of the response variable?
4. While the NBR and QpR called the same model, is there a way to select which is the "better" approach?
5. I would like to eventually publish. Are there more analyses I should perform on my selected model?

Subset selection with LASSO involving categorical variables

I ran a LASSO algorithm on a dataset that has multiple categorical variables. When I used model.matrix() function on the independent variables, it automatically created dummy values for each factor level.
For example, I have a variable "worker_type" that has three values: FTE, contr, other. Here, reference is modality "FTE".
Some other categorical variables have more or fewer factor levels.
When I output the coefficients results from LASSO, I noticed that worker_typecontr and worker_typeother both have coefficients zero. How should I interpret the results? What's the coefficient for FTE in this case? Should I just take this variable out of the formula?
Perhaps this question is suited more for Cross Validated.
Ridge Regression and the Lasso are both "shrinkage" methods, typically used to deal with high dimensional predictor space.
The fact that your Lasso regression reduces some of the beta coefficients to zero indicates that the Lasso is doing exactly what it is designed for! By its mathematical definition, the Lasso assumes that a number of the coefficients are truly equal to zero. The interpretation of coefficients that go to zero is that these predictors do not explain any of the variance in the response compared to the non-zero predictors.
Why does the Lasso shrink some coefficients to zero? We need to investigate how the coefficients are chosen. The Lasso is essentially a multiple linear regression problem that is solved by minimizing the Residual Sum of Squares, plus a special L1 penalty term that shrinks coefficients to 0. This is the term that is minimized:
where p is the number of predictors, and lambda is a a non-negative tuning parameter. When lambda = 0, the penalty term drops out, and you have a multiple linear regression. As lambda becomes larger, your model fit will have less bias, but higher variance (ie - it will be subject to overfitting).
A cross-validation approach should be taken towards selecting the appropriate tuning parameter lambda. Take a grid of lambda values, and compute the cross-validation error for each value of lambda and select the tuning parameter value for which the cross-validation error is the lowest.
The Lasso is useful in some situations and helps in generating simple models, but special consideration should be paid to the nature of the data itself, and whether or not another method such as Ridge Regression, or OLS Regression is more appropriate given how many predictors should be truly related to the response.
Note: See equation 6.7 on page 221 in "An Introduction to Statistical Learning", which you can download for free here.

Resources