Plotting standard errors for effects - r

I have a lme4 model I have run for a hierarchical logistic regression, and I'm plotting the effects using the effects package. I would like to create an effects graph with the standard error of the mean as the error bars. I can get the point estimates, 95% confidence intervals, and standard errors into a dataframe. The standard errors, however, seem at odds with the confidence limit parameters, see below for an example in a regular glm.
library(effects)
library(dplyr)
mtcars <- mtcars %>%
mutate(vs = factor(vs))
glm1 <- glm(am ~ vs, mtcars, family = "binomial")
(glm1_eff <- Effect("vs", glm1) %>%
as.data.frame())
vs fit se lower upper
1 0 0.3333333 0.4999999 0.1580074 0.5712210
2 1 0.5000000 0.5345225 0.2596776 0.7403224
My understanding is that the fit column displays the point estimate for the probability of am is equal to 1 and that lower and upper correspond to the 95% confidence intervals for the probability that am equals 1. Note that the standard error does not seem to correspond to the confidence interval (e.g., .33+.49 > .57).
Here's what I am shooting for. As opposed to a 95% confidence interval, I would like to have an effects plot with +- the standard error of the mean.
Are the standard errors in log-odds instead of probability? Is there a simply way to convert them to probabilities and plot them so that I can make the graph?

John Fox shared this helpful response:
From ?Effect: "se: (for "eff" objects) a vector of standard errors for the effect, on the scale of the linear predictor." So the standard errors are on the log-odds scale." You could use the delta method to get standard errors on the probability scale but that would be very ill-advised, since the approach to asymptotic normality of estimated probabilities will be much slower than of log-odds. Effect() computes confidence limits on the scale of the linear predictor (log-odds for a logit model) and then inverse-transforms them to the scale of the response (probabilities).
All of the information you need to create a custom plot is in the "eff" object returned by Effect(); the contents of the object are documented in ?Effect.
I agree, by the way, that the as.data.frame.eff() method could be improved, and I'll do that when I have a chance. In particular, it invites misunderstanding to report the effects and confidence limits on the scale of the response but to show standard errors for the linear-predictor scale.

I'm answering the mystery first, then addressing the "show SE on the plot" question
Explanation of the SE mystery: All math in a GLM needs to be done on the link scale because this is the additive scale (where stuff can be added up). So...
The values in the column "fit" are the predicted probability of success (or the "predictions on the response scale"). Their values are expit(b0) and expit(b0 + b1). expit() is the inverse logit function. The SEs are on the link scale. An SE on the response scale doesn't make much sense because the response scale is non-linear (although its kinda weird to have stats on the response and link scale in the same table). "lower" and "upper" are on the response scale, so these are the CIs of the predicted probabilities of success. They are computed as expit(b0 ± 1.96SE) and expit(b0 + b1 ± 1.96SE). To recover these values with what is given
library(boot) # inv.logit and logit functions
expit.pred_0 <- 1/3 # fit 0
expit.pred_1 <- 1/2 # fit 1
se1 <- 1/2
se2 <- .5345225
inv.logit(logit(expit.pred_0) - qnorm(.975)*se1)
inv.logit(logit(expit.pred_0) + qnorm(.975)*se1)
inv.logit(logit(expit.pred_1) - qnorm(.975)*se2)
inv.logit(logit(expit.pred_1) + qnorm(.975)*se2)
> inv.logit(logit(expit.pred_0) - qnorm(.975)*se1)
[1] 0.1580074
> inv.logit(logit(expit.pred_0) + qnorm(.975)*se1)
[1] 0.5712211
> inv.logit(logit(expit.pred_1) - qnorm(.975)*se2)
[1] 0.2596776
> inv.logit(logit(expit.pred_1) + qnorm(.975)*se2)
[1] 0.7403224
Showing an SE computed from a glm on the response (non additive) scale doesn't make any sense because the SE is only additive on the link scale. In other words Multiplying SE by some quantile on the response scale (the scale of the plot you envision, with probability on the y axis) is meaningless. A CI is a point estimate back transformed from the link scale and so makes sense for plotting.
I frequently see researchers plotting SE bars computed from a linear model, like you envision, even though the statistics presented are from a GLM. These SE's are meaningful in a sense I guess but they often imply absurd consequences (like probabilities that could be less than zero or greater than one) so...don't do that either.

Related

How do I specify the dispersion parameter when computing the confidence interval for a GLM?

I have a model of exponential decay in the form Y = exp{a + bX + cW}. In R, I represent this as a generalized linear model (GLM) using a gamma random component with log link function.
fitted <- glm(Y ~ X + W, family=Gamma(link='log'))
I know from this post that for the standard errors to really represent an exponential rather than gamma random component, I need to specify the dispersion parameter as being 1 when I call summary.
summary(fitted, dispersion=1)
summary(fitted) # not the same!
Now, I want to find the 95% confidence intervals for my estimates of a, b, c. However, there seems to be no way to specify the dispersion parameter for the confint, even though I know it should affect the confidence interval (because it affects the standard error).
confint(fitted)
confint(fitted, dispersion=1) # same as the last confint :(
So, in order to get the confidence intervals corresponding to an exponential rather than gamma random component, how do I specify the dispersion parameter when computing the confidence interval for a GLM?

Test of second differences for average marginal effects in logistic regression

I have a question similar to the one here: Testing the difference between marginal effects calculated across factors. I used the same code to generate average marginal effects for two groups. The difference is that I am running a logistic rather than linear regression model. My average marginal effects are on the probability scale, so emmeans will not provide the correct contrast. Does anyone have any suggestions for how to test whether there is a significant difference in the average marginal effects between group 1 and group 2?
Thank you so much,
Ilana
It is a bit unclear what the issue really is, but I'll try. I'm supposing your logistic regression model was fitted using, say, glm:
mod <- glm(cbind(heads, tails) ~ treat, data = mydata, family = binomial())
If you then do
emm <- emmeans(mod, "treat")
emm ### marginal means
pairs(emm) ### differences
Your results will be presented on the logit scale.
If you want them on the probability scale, you can do
summary(emm, type = "response")
summary(pairs(emm), type = "response")
However, the latter will back-transform the differences of logits, thereby producing odds ratios.
If you actually want differences of probabilities rather than ratios of odds, use regrid(), which will construct a new grid of values after back-transforming (and hence it will forget the log transformation):
pairs(regrid(emm))
It seems possible that two or more factors are present and you want contrasts of contrasts on the probability scale. In that case, extend this idea by calling regrid() on the table of EMMs to put everything on the probability scale, then follow the analogous procedure used in the linked article.

Contrast plot for GAM using mgcv

When using the visreg package to visualise a GAM with a contrast plot, the confidence interval goes to zero at the inflection point when the graph is U-shaped:
# Load libraries
library(mgcv)
library(visreg)
# Synthetic data
df <- data.frame(a = -10:10, b = jitter((-10:10)^2, amount = 10))
# Fit GAM
res <- gam(b ~ s(a), data = df)
# Make contrast figure
visreg(res, type = "contrast")
This seems dodgy and doesn't happen when making a conditional plot (i.e., visreg(res, type = "conditional")), so instead I'm looking at the mgcv package to make the same plot. I can make a conditional plot using mgcv (e.g., plot.gam(res)), but I don't see the option to make a contrast plot. Is this possible with the mgcv package?
This is due to identifiability constraints imposed on the spline basis/bases used in the model. This is a sum-to-zero constraint and effectively removes an intercept-like basis function from the basis used for each smooth term so that these are not confounded with the model intercept. This allows the model to be identifiable, rather than having an infinity of solutions.
Using standard theory, the confidence interval has to tend to zero where it crosses zero on the y-axis (the centred effect usually, but here as shown it is on on some transformed scale) as the constraint implies that at some point x, the effect is 0 and has 0 variance.
This is nonsense of course and recent research has investigated this problem. One solution provided by Simon Wood and colleagues employs extensions to Nychka's observation that, for the Gaussian case, the Bayesian credible interval for a smooth has good across-the-function interpretation as a frequentist confidence interval (so not pointwise, but not simultaneous either). Nychka's results (the coverage properties of the interval) fail in situations where the estimated smooth has squared bias that is not substantially less than the variance of the estimate; clearly this fails to be the case when the variance hits zero where the estimated smooth passes through zero effect as the bias is not actually quite zero at this point.
Marra and Wood (2012) have extended these results to the generalized model setting, basically estimating the confidence interval for one smooth by assuming that all the other terms in the model have had the identifiability constraints applied to them, but not the smooth of interest. This shifts the focus of inference from the smooth directly to the smooth + intercept. You can turn this on in plot.gam() with the argument seWithMean = TRUE.
I don't see an easy way to make visreg do this however, although it is trivial to get back the information you want via predict.gam() with the options type = 'iterms', se.fit = TRUE. This returns, on the scale of the linear predictor, the contributions of each model smooth term plus the standard error that include the correction implied by seWithMean. You can then fiddle with this to your heart's content; adding on the model constant term (the estimate for the intercept) for example should provide you something close to the figure you show in your question.

Confidence intervals for predicted probabilities from predict.lrm

I am trying to determine confidence intervals for predicted probabilities from a binomial logistic regression in R. The model is estimated using lrm (from the package rms) to allow for clustering standard errors on survey respondents (each respondent appears up to 3 times in the data):
library(rms)
model1<-lrm(outcome~var1+var2+var3,data=mydata,x=T,y=T,se.fit=T)
model.rob<-robcov(model1,cluster=respondent.id)
I am able to estimate a predicted probability for the outcome using predict.lrm:
predicted.prob<-predict(model.rob,newdata=data.frame(var1=1,var2=.33,var3=.5),
type="fitted")
What I want to determine is a 95% confidence interval for this predicted probability. I have tried specifying se.fit=T, but this not permissible in predict.lrm when type=fitted.
I have spent the last few hours scouring the Internet for how to do this with lrm to no avail (obviously). Can anyone point me toward a method for determining this confidence interval? Alternatively, if it is impossible or difficult with lrm models, is there another way to estimate a logit with clustered standard errors for which confidence intervals would be more easily obtainable?
The help file for predict.lrm has a clear example. Here is a slight modification of it:
L <- predict(fit, newdata=data.frame(...), se.fit=TRUE)
plogis(with(L, linear.predictors + 1.96*cbind(- se.fit, se.fit)))
For some problems you may want to use the gendata or Predict functions, e.g.
L <- predict(fit, gendata(fit, var1=1), se.fit=TRUE) # leave other vars at median/mode
Predict(fit, var1=1:2, var2=3) # leave other vars at median/mode; gives CLs

Scale back linear regression coefficients in R from scaled and centered data

I'm fitting a linear model using OLS and have scaled my regressors with the function scale in R because of the different units of measure between variables. Then, I fit the model using the lm command and get the coefficients of the fitted model. As far as I know the coefficients of the fitted model are not in the same units of the original regressors variables and therefore must be scaled back before they can be interpreted. I have been searching for a direct way to do it by couldn't find anything. Does anyone know how to do that?
Please have a look to the code, could you please help me implementing what you proposed?
library(zoo)
filename="DataReg4.csv"
filepath=paste("C:/Reg/",filename, sep="")
separator=";"
readfile=read.zoo(filepath, sep=separator, header=T, format = "%m/%d/%Y", dec=".")
readfile=as.data.frame(readfile)
str(readfile)
DF=readfile
DF=as.data.frame(scale(DF))
fm=lm(USD_EUR~diff_int+GDP_US+Net.exports.Eur,data=DF)
summary(fm)
plot(fm)
I'm sorry this is the data.
http://www.mediafire.com/?hmcp7urt0ag8187
If you used the scale function with default arguments then your regressors will be centered (subtracting their mean) and divided by their standard deviations. You can interpret the coefficients without transforming them back to the original units:
Holding everything else constant, on average, a one standard deviation change in one of the regressors is associated with a change in the dependent variable corresponding to the coefficient of that regressor.
If you have included an intercept term in your model keep in mind that the interpretation of the intercept will change. The estimated intercept now represents the average level of the dependent variable when all of the regressors are at their average levels. This is a result of subtracting the mean from each variable.
To interpret the coefficients in non-standard deviation terms, just calculate the standard deviation of each regressor and multiple that by the coefficient.
To de-scale or back-transform regression coefficients from a regression done with scaled predictor variable(s) and non-scaled response variable the intercept and slope should be calculated as:
A = As - Bs*Xmean/sdx
B = Bs/sdx
thus the regression is,
Y = As - Bs*Xmean/sdx + Bs/sdx * X
where
As = intercept from the scaled regression
Bs = slope from the scaled regression
Xmean = the mean of the scaled predictor variable
sdx = the standard deviation of the predictor variable
This can be adjusted if Y was also scaled but it appears you decided not to do that ultimately with your dataset.
If I understand your description (that is unfortunately at the moment code-free), you are getting standardized regression coefficients for Y ~ As + Bs*Xs where all those "s" items are scaled variables. The coefficients then are the predicted change on a std deviation scale of Y associated with a change in X of one standard deviation of X. The scale function would have recorded the means and standard deviations in attributes for hte scaled object. If not, then you will have those estimates somewhere in your console log. The estimated change in dY for a change dX in X should be: dY*(1/sdY) = Bs*dX*(1/sdX). Predictions should be something along these lines:
Yest = As*(sdX) + Xmn + Bs*(Xs)*(sdX)
You probably should not have needed to standardize the Y values, and I'm hoping that you didn't because it makes dealing with the adjustment for the means of the X's easier. Put some code and example data in if you want implemented and checked answers. I think #DanielGerlance is correct in saying to multiply rather than divide by the SD's.

Resources