ggeffects & dummy coding; sjPlot & odds ratios - r

I am currently examining marginal effects of some fixed effects factors in a mixed effects logistic. To do so, I've employed the ggpredict function of the tremendously helpful ggeffects package. I then also used the tab_model function of the associated sjPlot package to produce tables that include odd ratios. However, I was a bit surprised by the output of each:
1) I now see that all levels of my factor predictors are included in the output (as opposed to R's usual dummy coding in which one level of each factor serves as a reference for contrasts). Is it possible to retain a reference level in the ggpredict output? I was hoping to use it to i) check against manual calculations and ii) compare it to the glmer model coefficients that are not similarly calculated conditionally upon the random effects.
2) The odds ratios provided by tab_model are identical to those that I obtained by exponentiating the coefficients provided by my original glmer model (per the IDRE example procedure). However, I was under the impression that the ORs calculated were derived from marginal coefficients that did not account for the influence of the random effect in my model (see the paragraph starting with "Many people prefer" here, the "Predicted Probabilities and Graphing" section here, and top answer here for more information). In turn, does this mean that the ORs for fixed effects variables provided by tab_model similarly do not account for the influence of the random effect? If that's the case, is there an argument or other means by which to do so?
Thanks!

Related

Multilevel mixed-effects tobit regression in R

I have a dataset with data left censored and I wanted to apply a multilevel mixed-effects tobit regression, but I only find information about how to do it in Stata. Is it possible to do it in R?
I found the packages 'VGAM' and 'CensREG', but I don't get how to add fixed and random effects.
Also my data is log-normal distributed, is there a way to add this to the model?
Thanks!
According to Section 3.5 of a vignette, the censReg package can handle a mixed model if the data are prepared properly via the plm package.
This Cross Validated page shows an example.
I don't have experience with this; it might only work with formal panel data rather than more general random-effects structures.
If your data are truly log-normal, you could take logs first and set the lower censoring limit on the log scale. Note that an apparent log-normal distribution of outcomes might just represent a corresponding distribution of predictor values with an underlying normal error distribution around the predictions. Don't jump blindly into a log-normal assumption.

R: Using relative importance (relaimpo package) to build a linear model for prediction?

I have a huge dataset and I'm trying to build a good predictive linear model using the relaimpo package.
Using the calc.relimp function with type="lmg, i get an output of variables which are of relative importance. Although the proportion of variance explained by the model is only at 52%, I want to go and build a linear model using these variables.
Is there a way to build a lm model using these variables and somehow take into account the relative importance values into the model?
I'm not too familiar with this and was thinking maybe something along the lines of weighting each variable based on its relative importance value...?
I'm not a statistician, so I won't give you any Greek symbols, but I think you are confusing a few things.
As you correctly say, the relative importances based on the LMG method are more or less some sort of variance decomposition in case of correlated predictor variables, i.e. it tells you how much of your variance in the model is explained by which predictor.
However, this doesn't have to do anything with the lm function and its estimation itself. In fact, the R² of your lm model is exactly the same as you'll get by summing up the relative importances from calc.relimp.
There is no way to tell the lm function to pay more attention to a certain predictor during prediction/estimation.
What you probably want to do is an elastic net (which is a combination of LASSO and RIDGE regression), which basically does what you want, i.e. it shrinks the impact of "unimportant"/small predictors and emphasizes the impact of important/large predictors: https://en.wikipedia.org/wiki/Elastic_net_regularization (Lasso and Ridge regression are linked in the Wikipedia article).
I think this one here is the original package from Jerome Friedman, Trevor Hastie, Rob Tibshirani, et al.: https://cran.r-project.org/web/packages/glmnet/index.html

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

gbm::interact.gbm vs. dismo::gbm.interactions

Background
The reference manual for the gbm package states the interact.gbm function computes Friedman's H-statistic to assess the strength of variable interactions. the H-statistic is on the scale of [0-1].
The reference manual for the dismo package does not reference any literature for how the gbm.interactions function detects and models interactions. Instead it gives a list of general procedures used to detect and model interactions. The dismo vignette "Boosted Regression Trees for ecological modeling" states that the dismo package extends functions in the gbm package.
Question
How does dismo::gbm.interactions actually detect and model interactions?
Why
I am asking this question because gbm.interactions in the dismo package yields results >1, which the gbm package reference manual says is not possible.
I checked the tar.gz for each of the packages to see if the source code was similar. It is different enough that I cannot determine if these two packages are using the same method to detect and model interactions.
To summarize, the difference between the two approaches boils down to how the "partial dependence function" of the two predictors is estimated.
The dismo package is based on code originally given in Elith et al., 2008 and you can find the original source in the supplementary material. The paper very briefly describes the procedure. Basically the model predictions are obtained over a grid of two predictors, setting all other predictors at their means. The model predictions are then regressed onto the grid. The mean squared errors of this model are then multiplied by 1000. This statistic indicates departures of the model predictions from a linear combination of the predictors, indicating a possible interaction.
From the dismo package, we can also obtain the relevant source code for gbm.interactions. The interaction test boils down to the following commands (copied directly from source):
interaction.test.model <- lm(prediction ~ as.factor(pred.frame[,1]) + as.factor(pred.frame[,2]))
interaction.flag <- round(mean(resid(interaction.test.model)^2) * 1000,2)
pred.frame contains a grid of the two predictors in question, and prediction is the prediction from the original gbm fitted model where all but two predictors under consideration are set at their means.
This is different than Friedman's H statistic (Friedman & Popescue, 2005), which is estimated via formula (44) for any pair of predictors. This is essentially the departure from additivity for any two predictors averaging over the values of the other variables, NOT setting the other variables at their means. It is expressed as a percent of the total variance of the partial dependence function of the two variables (or model implied predictions) so will always be between 0-1.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

Resources