predict and multiplicative variables / interaction terms in probit regressions - r

I want to determine the marginal effects of each dependent variable in a probit regression as follows:
predict the (base) probability with the mean of each variable
for each variable, predict the change in probability compared to the base probability if the variable takes the value of mean + 1x standard deviation of the variable
In one of my regressions, I have a multiplicative variable, as follows:
my_probit <- glm(a ~ b + c + I(b*c), family = binomial(link = "probit"), data=data)
Two questions:
When I determine the marginal effects using the approach above, will the value of the multiplicative term reflect the value of b or c taking the value mean + 1x standard deviation of the variable?
Same question, but with an interaction term (* and no I()) instead of a multiplicative term.
Many thanks

When interpreting the results of models involving interaction terms, the general rule is DO NOT interpret coefficients. The very presence of interactions means that the meaning of coefficients for terms will vary depending on the other variate values being used for prediction. The right way to go about looking at the results is to construct a "prediction grid", i.e. a set of values that are spaced across the range of interest (hopefully within the domain of data support). The two essential functions for this process are expand.grid and predict.
dgrid <- expand.grid(b=fivenum(data$b)[2:4], c=fivenum(data$c)[2:4]
# A grid with the upper and lower hinges and the medians for `a` and `b`.
predict(my_probit, newdata=dgrid)
You may want to have the predictions on a scale other than the default (which is to return the linear predictor), so perhaps this would be easier to interpret if it were:
predict(my_probit, newdata=dgrid, type ="response")
Be sure to read ?predict and ?predict.glm and work with some simple examples to make sure you are getting what you intended.
Predictions from models containing interactions (at least those involving 2 covariates) should be thought of as being surfaces or 2-d manifolds in three dimensions. (And for 3-covariate interactions as being iso-value envelopes.) The reason that non-interaction models can be decomposed into separate term "effects" is that the slopes of the planar prediction surfaces remain constant across all levels of input. Such is not the case with interactions, especially those with multiplicative and non-linear model structures. The graphical tools and insights that one picks up in a differential equations course can be productively applied here.

Related

Output from Linear Mixed Models differs from Estimated Marginal Means

I have a query about the output statistics gained from linear mixed models (using the lmer function) relative to the output statistics taken from the estimated marginal means gained from this model
Essentially, I am running an LMM comparing the within-subjects effect of different contexts (with "Negative" coded as the baseline) on enjoyment ratings. The LMM output suggests that the difference between negative and polite contexts is not significant, with a p-value of .35. See the screenshot below with the relevant line highlighted:
LMM output
However, when I then run the lsmeans function on the same model (with the Holm correction), the p-value for the comparison between Negative and Polite context categories is now .05, and all of the other statistics have changed too. Again, see the screenshot below with the relevant line highlighted:
LSMeans output
I'm probably being dense because my understanding of LMMs isn't hugely advanced, but I've tried to Google the reason for this and yet I can't seem to find out why? I don't think it has anything to do with the corrections because the smaller p-value is observed when the Holm correction is used. Therefore, I was wondering why this is the case, and which value I should report/stick with and why?
Thank you for your help!
Regression coefficients and marginal means are not one and the same. Once you learn these concepts it'll be easier to figure out which one is more informative and therefore which one you should report.
After we fit a regression by estimating its coefficients, we can predict the outcome yi given the m input variables Xi = (Xi1, ..., Xim). If the inputs are informative about the outcome, the predicted yi is different for different Xi. If we average the predictions yi for examples with Xij = xj, we get the marginal effect of the jth feature at the value xj. It's crucial to keep track of which inputs are kept fixed (and at what values) and which inputs are averaged over (aka marginalized out).
In your case, contextCatPolite in the coefficients summary is the difference between Polite and Negative when smileType is set to its reference level (no reward, I'd guess). In the emmeans contrasts, Polite - Negative is the average difference over all smileTypes.
Interactions have a way of making interpretation more challenging and your model includes an interaction between smileType and contextCat. See Interaction analysis in emmeans.
To add to #dipetkov's answer, the coefficients in your LMM are based on treatment coding (sometimes called 'dummy' coding). With the interactions in the model, these coefficients are no longer "main-effects" in the traditional sense of factorial ANOVA. For instance, if you have:
y = b_0 + b_1(X_1) + b_2(X_2) + b_3 (X_1 * X_2)
...b_1 is "the effect of X_1" only when X_2 = 0:
y = b_0 + b_1(X_1) + b_2(0) + b_3 (X_1 * 0)
y = b_0 + b_1(X_1)
Thus, as #dipetkov points out, 1.625 is not the difference between Negative and Polite on average across all other factors (which you get from emmeans). Instead, this coefficient is the difference between Negative and Polite specifically when smileType = 0.
If you use contrast coding instead of treatment coding, then the coefficients from the regression output would match the estimated marginal means, because smileType = 0 would now be on average across smile types. The coding scheme thus has a huge effect on the estimated values and statistical significance of regression coefficients, but it should not effect F-tests based on the reduction in deviance/variance (because no matter how you code it, a given variable explains the same amount of variance).
https://stats.oarc.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/

Mixed effect model or multiple regressions comparison in nested setup

I have a response Y that is a percentage ranging between 0-1. My data is nested by taxonomy or evolutionary relationship say phylum/genus/family/species and I have one continuous covariate temp and one categorial covariate fac with levels fac1 & fac2.
I am interested in estimating:
is there a difference in Y between fac1 and fac2 (intercept) and how much variance is explained by that
does each level of fac responds differently in regard to temp (linearly so slope)
is there a difference in Y for each level of my taxonomy and how much variance is explained by those (see varcomp)
does each level of my taxonomy responds differently in regard to temp (linearly so slope)
A brute force idea would be to split my data into the lowest taxonomy here species, do a linear beta regression for each species i as betareg(Y(i)~temp) . Then extract slope and intercepts for each speies and group them to a higher taxonomic level per fac and compare the distribution of slopes (intercepts) say, via Kullback-Leibler divergence to a distribution that I get when bootstrapping my Y values. Or compare the distribution of slopes (or interepts) just between taxonomic levels or my factor fac respectively.Or just compare mean slopes and intercepts between taxonomy levels or my factor levels.
Not sure is this is a good idea. And also not sure of how to answer the question of how many variance is explained by my taxonomy level, like in nested random mixed effect models.
Another option may be just those mixed models, but how can I include all the aspects I want to test in one model
say I could use the "gamlss" package to do:
library(gamlss)
model<-gamlss(Y~temp*fac+re(random=~1|phylum/genus/family/species),family=BE)
But here I see no way to incorporate a random slope or can I do:
model<-gamlss(Y~re(random=~temp*fac|phylum/genus/family/species),family=BE)
but the internal call to lme has some trouble with that and guess this is not the right notation anyways.
Is there any way to achive what I want to test, not necessarily with gamlss but any other package that inlcuded nested structures and beta regressions?
Thanks!
In glmmTMB, if you have no exact 0 or 1 values in your response, something like this should work:
library(glmmTMB)
glmmTMB(Y ~ temp*fac + (1 + temp | phylum/genus/family/species),
data = ...,
family = beta_family)
if you have zero values, you will need to do something . For example, you can add a zero-inflation term in glmmTMB; brms can handle zero-one-inflated Beta responses; you can "squeeze" the 0/1 values in a little bit (see the appendix of Smithson and Verkuilen's paper on Beta regression). If you have only a few 0/1 values it won't matter very much what you do. If you have a lot, you'll need to spend some serious time thinking about what they mean, which will influence how you handle them. Do they represent censoring (i.e. values that aren't exactly 0/1 but are too close to the borders to measure the difference)? Are they a qualitatively different response? etc. ...)
As I said in my comment, computing variance components for GLMMs is pretty tricky - there's not necessarily an easy decomposition, e.g. see here. However, you can compute the variances of intercept and slope at each taxonomic level and compare them (and you can use the standard deviations to compare with the magnitudes of the fixed effects ...)
The model given here might be pretty demanding, depending on the size of your phylogeny - for example, you might not have enough replication at the phylum level (in which case you could fit the model ~ temp*(fac + phylum) + (1 + temp | phylum:(genus/family/species)), i.e. pull out the phylum effects as fixed effects).
This is assuming that you're willing to assume that the effects of fac, and its interaction with temp, do not vary across the phylogeny ...

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Address unequal variance between groups before applying contrasts for a linear model? (r)

My Goal: I have an ordinal factor variable (5 levels) to which I would like to apply contrasts to test for a linear trend. However, the factor groups have heterogeneity of variance.
What I've done: Upon recommendation, I used lmRob() from robust pckg to create a robust linear model, then applied the contrasts.
# assign the codes for a linear contrast of 5 groups, save as object
contrast5 <- contr.poly(5)
# set contrast property of sf1 to contain the weights
contrasts(SCI$sf1) <- contrast5
# fit and save a robust model (exhaustive instead of subsampling)
robmod.sf1 <- lmRob(ICECAP_A ~ sf1, data = SCI, nrep = Exhaustive)
summary.lmRob(robmod.sf1)
My problem: I have since been reading that robust regression is more suited to address outliers, and not heterogeneity of variance. (bottom of https://stats.idre.ucla.edu/r/dae/robust-regression/_ ) This UCLA page (among others) suggests the sandwich package to get heteroskedastic-consistent (HC) standard errors (such as in https://thestatsgeek.com/2014/02/14/the-robust-sandwich-variance-estimator-for-linear-regression-using-r/ ).
But these examples use a series of functions/calls to generate output that gives you the HC that could be used to calculate confidence intervals, t-values, p-values etc.
My thinking is that if I use vcovHC(), I could get the HC std errors, but the HC std errors would not have been 'applied'/a property of the model, so I couldn't pass the model (with the HC errors) through a function to apply the contrasts that I ultimately want. I hope I am not conflating two separate concepts, but surely if a function addresses/down-weights outliers, that should at least somewhat address unequal variances as well?
Can anyone confirm if my reasoning is sound (and thus remain with lmRob()? Or suggest how I could just correct my standard errors and still apply the contrasts?
vcovHC is the right function to deal with heteroscedasticity. HC stands for heteroscedasticity-consistent estimator. This will not downweight outliers in estimates of model effects, but it will calculated the CIs and p-values differently to accommodate the impact of such outlying observations. lmRob does downweight outlying values and does not handle heteroscedasticity
See more here:
https://stats.stackexchange.com/questions/50778/sandwich-estimator-intuition/50788#50788

Resources