plot logistic regression line over heat plot - r

My data is binary with two linear independent variables. For both predictors, as they get bigger, there are more positive responses. I have plotted the data in a heatplot showing density of positive responses along the two variables. There are the most positive responses in the top right corner and negative responses in the bottom left, with a gradient change visible along both axes.
I would like to plot a line on the heatplot showing where a logistic regression model predicts that positive and negative responses are equally likely. (My model is of the form response~predictor1*predictor2+(1|participant).)
My question: How can I figure out the line based on this model at which the positive response rate is 0.5?
I tried using predict(), but that works the opposite way; I have to give it values for the factor rather than giving the response rate I want. I also tried using a function that I used before when I had only one predictor (function(x) (log(x/(1-x))-fixef(fit)[1])/fixef(fit)[2]), but I can only get single values out of that, not a line, and I can only get values for one predictor at a time.

Using a simple example logistic regression model fitted to the mtcars dataset, and the algebra described here, I can produce a heatmap with a decision boundary using:
library(ggplot2)
library(tidyverse)
data("mtcars")
m1 = glm(am ~ hp + wt, data = mtcars, family = binomial)
# Generate combinations of hp and wt across their observed range. Only
# generating 50 values of each here, which is not a lot but since each
# combination is included, you get 50 x 50 rows
pred_df = expand.grid(
hp = seq(min(mtcars$hp), max(mtcars$hp), length.out = 50),
wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 50)
)
pred_df$pred_p = predict(m1, pred_df, type = "response")
# For a given value of hp (predictor1), find the value of
# wt (predictor2) that will give predicted p = 0.5
find_boundary = function(hp_val, coefs) {
beta_0 = coefs['(Intercept)']
beta_1 = coefs['hp']
beta_2 = coefs['wt']
boundary_wt = (-beta_0 - beta_1 * hp_val) / beta_2
}
# Find the boundary value of wt for each of the 50 values of hp
# Using the algebra in the linked question you can instead find
# the slope and intercept of the boundary, so you could potentially
# skip this step
boundary_df = pred_df %>%
select(hp) %>%
distinct %>%
mutate(wt = find_boundary(hp, coef(m1)))
ggplot(pred_df, aes(x = hp, y = wt)) +
geom_tile(aes(fill = pred_p)) +
geom_line(data = boundary_df)
Producing:
Note that this only takes into account the fixed effects from the model, so if you want to somehow take into account random effects this could be more complex.

Related

How to make a forest plot for a mixed model

How to make a forest plots for mixed models co-effiecents and their corresponding confidence interval.
I tried this code
Model = lme (fixed = score~ Age+Sex+yearsofeducation+walkspeed,
random = ~1|ID,
data=DB,
na.action = na.omit, method = "ML",
)
plot_summs (model)
However, I want the OR in the forest plots to be ordered in a descending fashion.
Thanks for the help.
I would call this a "coefficient plot", not a "forest plot". (A forest plot is used in meta-analyses, when you are comparing the magnitude of estimates of the same effect from many different studies.)
example setup
This is a slightly silly example, but should be close enough to yours (not clear to me why you're mentioning OR (= odds ratios?), these are typically from a logistic regression ... ?)
library(nlme)
mtcars <- transform(mtcars, cylgear = interaction(cyl, gear))
m1 <- lme(mpg ~ disp + hp + drat + wt + qsec,
random = ~1|cylgear,
data = mtcars)
coefficient plots: dotwhisker
You could get approximately what you want directly from the dotwhisker package, but it won't sort effects (or not easily, as far as I know):
library(dotwhisker)
library(broom.mixed) ## required to 'tidy' (process) lme fits
dwplot(m1, effects = "fixed")
coefficient plots: tidyverse
I usually do the processing myself, as I prefer increased flexibility.
library(tidyverse)
tt <- (m1
## extract estimates and CIs
|> tidy(effects = "fixed", conf.int = TRUE)
## usually *don't* want to compare intercept (dwplot does this automatically)
|> filter(term != "(Intercept)")
## scale parameters by 2SD - usually necessary for comparison
|> dotwhisker::by_2sd(data = mtcars)
## take only the bits we need, rename some (cosmetic)
|> select(term, estimate, lwr = conf.low, upr = conf.high)
## order terms by estimate value
|> mutate(across(term, ~reorder(factor(.), estimate)))
)
gg0 <- (ggplot(tt,
aes(estimate, term))
+ geom_pointrange(aes(xmin = lwr, xmax = upr))
+ geom_vline(xintercept = 0, lty = 2)
)
print(gg0)
The only remaining/possibility tricky question here is what to do if you have positive and negative coefficients of similar magnitude. If you want to sort by absolute value then
|> mutate(across(term, ~reorder(factor(.), estimate,
FUN = function(x) mean(abs(x)))
although this gets a bit ugly.
If you like the tidyverse you can substitute forcats::fct_reorder for reorder.
I’m just adding one more option to Ben Bolker’s excellent answer: using the modelsummary package. (Disclaimer: I am the author.)
With that package, you can use the modelplot() function to create a forest plot, and the coef_map argument to rename and reorder coefficients. If you are estimating a logit model and want the odds ratios, you can use the exponentiate argument.
The order in which you insert coefficients in the coef_map vector sorts them in the plot, from bottom to top. For example:
library(lme4)
library(modelsummary)
mod <- lmer(mpg ~ wt + drat + (1 | gear), data = mtcars)
modelplot(
mod,
coef_map = c("(Intercept)" = "Constant",
"drat" = "Rear Axle Ratio",
"wt" = "Weight"))

Find Cook's Distance on Predicted Values for LM

Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)

Shaded confidence interval bands for glm coefficients with covariates set to mean values

I would like to plot the line and the shaded 95% confidence interval bands (for example using polygon)from a glm model (family binomial)or using gglot. For linear models (lm), I have previously been able to plot the confidence intervals from the predictions as they included the fit, lower and upper level but I do not know how to do it here. I have tried to use the function predict.glm with the optional argument se.fit set to TRUE, and then using the prediction +/- 1.96 * std.error to calculate the confidence intervals but it did not work for me.
Thanks for help in advance. You can find here the data that I used (it contains 10 variables and 996 observations): https://drive.google.com/file/d/1Yu7Dk2eh0R1ztKiuNTtN_W5Yg4C2Ne-2/view?usp=sharing Code and figure here:
# Models
mod= glm(site ~S + age + pH + soil + peat+
spruce+ I(spruce^2)+pine+ birch+
tsumma+ I(tsumma^2),
data=test.dat,family=binomial)
# Means of all covariates
means = apply(test.dat[,c("S", "pH","soil", "spruce", "pine","birch", "tsumma")],2,mean,na.rm=T)
# Calculate the constant given by all other covariates being at their means and assuming only pine on the plot
const = mod$coefficients[1]+
mod$coefficients["S"]*means["S"]+
mod$coefficients["pH"]*means["pH"]+
mod$coefficients["soil"]*means["soil"]+
mod$coefficients["spruce"]*means["spruce"]+
mod$coefficients["I(spruce^2)"]*means["spruce"]*means["spruce"]+
mod$coefficients["pine"]*means["pine"]+
mod$coefficients["birch"]*means["birch"]+
mod$coefficients["tsumma"]*means["tsumma"]+
mod$coefficients["I(tsumma^2)"]*means["tsumma"]*means["tsumma"]
# Plot
age = seq(from=min(test.dat$age,na.rm=T),to=150,length=100)
lin= const + mod$coefficients["age"]*age
Pr = exp(lin) / (exp(lin)+1)
par(mar = c(4, 4, 1.5, 0.3))
plot(age,Pr,type="l", ylim=c(0,.5),las=1, main="Probability of hotspot", ylab="Probability of occurrence",xlab="Forest age (years)")
You can use a package, indicating the term to plot while holding others constant:
library(sjPlot)
set.seed(888)
data = mtcars
data$vs = data$vs + rnorm(nrow(data))
mod = glm(am ~ disp + vs + carb+ I(vs^2),data=data,family="binomial")
plot_model(mod,type="pred",terms="disp")
Or derive it like you did, except I think you might need to create the extra term for the squared value, so that you can hold the other terms at their means, and use the predict.lm function :
data$vs2 = data$vs^2
mod = glm(am ~ disp + vs + carb+ vs2,data=data,family="binomial")
varMeans = colMeans(mod$model)[c("vs","carb","vs2")]
pred_disp = seq(min(data$disp),max(data$disp),length.out=100)
df = data.frame(
disp = pred_disp,
t(replicate(length(pred_disp),varMeans))
)
pred = predict(mod,df,se=TRUE)
plot(df$disp,plogis(pred$fit),"l")
lines(df$disp,plogis(pred$fit + 1.96*pred$se.fit),col="blue",lty=8)
lines(df$disp,plogis(pred$fit - 1.96*pred$se.fit),col="blue",lty=8)

How do I plot predicted probabilities for a Logit regression with fixed effects in R?

I am a complete newbie to R.
I have the following logit equation I am estimating:
allAM <- glm (AM ~ VS + Prom + LS_Exp + Sex + Age + Age2 + Jpart + X2004LS + X2009LS + X2014LS + factor(State), family = binomial(link = "logit"), data = mydata)
AM is a standard binary (happened/didn’t happen). The three “X****LS” variables are dummies indicating different sessions of congress and “factor(State)” is used to generate fixed effects/dummies for each state.
VS is the key independent variable of interest and I want to generate the predicated probability that AM=1 for each value of VS between 0 and 60, holding everything else at its mean.
I am running into trouble, however, generating and plotting the predicted probabilities because “State” is a factor. I want to be able to show the average effects, not 50 different charts/effects for each state.
Per (Hanmer and Kalkan 2013) http://onlinelibrary.wiley.com/doi/10.1111/j.1540-5907.2012.00602.x/abstract I was advised to do the following to plot the predicted probabilities:
pred.seq <- seq(from=0, to=60, by=0.01)
pred.out <- c()
for(i in 1:length(pred.seq)){
mydata.c <- mydata
mydata.c$VS <- pred.seq[i]
pred.out[i] <- mean(predict(allAM, newdata=mydata.c, type="response"))
}
plot(pred.out ~ pred.seq, type="l")
This approach seems to work, though I don’t really understand it.
I want to add the upper and lower 95% confidence intervals to the plot, but when I attempt to do it by hand the way I know how:
lower <- pred.out$fit - (1.96*pred.out$se.fit)
upper <- pred.out$fit + (1.96*pred.out$se.fit)
I get the following error:
Error in pred.outfit:fit: operator is invalid for atomic vectors
Can anyone advise how I can plot the confidence intervals and how I can specify different levels of VS so that I can report some specific predicted probabilities?

Plotting a multiple logistic regression for binary and continuous values in R

I have a data frame of mammal genera. Each row of the column is a different genus. There are three columns: a column of each genus's geographic range size (a continuous variable), a column stating whether or not a genus is found inside or outside of river basins (a binary variable), and a column stating whether the genus is found in the fossil record (a binary variable).
I have performed a multiple logistic regression to see if geographic range size and presence in/out of basins is a predictor of presence in the fossil record using the following R code.
Regression<-glm(df[ ,"FossilRecord"] ~ log(df[ ,"Geographic Range"]) + df[ ,"Basin"], family="binomial")
I am trying to find a way to visually summarize the output of this regression (other than a table of the regression summary).
I know how to do this for a single variable regression. For example, I could use a plot like if I wanted to see the relationship between just geographic range size and presence in the fossil record.
However, I do not know how to make a similar or equivalent plot when there are two independent variables, and one of them is binary. What are some plotting and data visualization techniques I could use in this case?
Thanks for the help!
Visualization is important and yet it can be very hard. With your example, I would recommend plotting one line for predicted FossilRecord versus GeographicRange for each level of your categorical covariate (Basin). Here's an example of how to do it with the ggplot2 package
##generating data
ssize <- 100
set.seed(12345)
dat <- data.frame(
Basin = rbinom(ssize, 1,.4),
GeographicRange = rnorm(ssize,10,2)
)
dat$FossilRecord = rbinom(ssize,1,(.3 + .1*dat$Basin + 0.04*dat$GeographicRange))
##fitting model
fit <- glm(FossilRecord ~ Basin + GeographicRange, family=binomial(), data=dat)
We can use the predict() function to obtain predicted response values for many GeographicRange values and for each Basin category.
##getting predicted response from model
plotting_dfm <- expand.grid(GeographicRange = seq(from=0, to = 20, by=0.1),
Basin = (0:1))
plotting_dfm$preds <- plogis( predict(fit , newdata=plotting_dfm))
Now you can plot the predicted results:
##plotting the predicted response on the two covariates
library(ggplot2)
pl <- ggplot(plotting_dfm, aes(x=GeographicRange, y =preds, color=as.factor(Basin)))
pl +
geom_point( ) +
ggtitle("Predicted FossilRecord by GeoRange and Basin") +
ggplot2::ylab("Predicted FossilRecord")
This will produce a figure like this:
You can plot a separate curve for each value of the categorical variable. You didn't provide sample data, so here's an example with another data set:
library(ggplot2)
# Data
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# Model. gre is continuous. rank has four categories.
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
# Predict admit probability
newdata = expand.grid(gre=seq(200,800, length.out=100), rank=1:4)
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(gre, prob, color=factor(rank), group=rank)) +
geom_line()
UPDATE: To respond to #Provisional.Modulation's comment: There are lots of options, depending on what you want to highlight and what is visually clear enough to understand, given your particular data and model output.
Here's an example using the built-in mtcars data frame and a logistic regression with one categorical and two continuous predictor variables:
m1 = glm(vs ~ cyl + mpg + hp, data=mtcars, family=binomial)
Now we create a new data frame with the unique values of cyl, five quantiles of hp and a continuous sequence of mpg, which we'll put on the x-axis (you could also of course do quantiles of mpg and use hp as the x-axis variable). If you have many continuous variables, you may need to set some of them to a single value, say, the median, when you graph the relationships between other variables.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=20),
hp = quantile(hp)))
newdata$prob = predict(m1, newdata, type="response")
Here are three potential graphs, with varying degrees of legibility.
ggplot(newdata, aes(mpg, prob, colour=factor(cyl))) +
geom_line() +
facet_grid(. ~ hp)
ggplot(newdata, aes(mpg, prob, colour=factor(hp), linetype=factor(cyl))) +
geom_line()
ggplot(newdata, aes(mpg, prob, colour=factor(hp))) +
geom_line() +
facet_grid(. ~ cyl)
And here's another approach using geom_tile to include two continuous dimensions in each plot panel.
newdata = with(mtcars, expand.grid(cyl=unique(cyl),
mpg=seq(min(mpg),max(mpg),length=100),
hp =seq(min(hp),max(hp),length=100)))
newdata$prob = predict(m1, newdata, type="response")
ggplot(newdata, aes(mpg, hp, fill=prob)) +
geom_tile() +
facet_grid(. ~ cyl) +
scale_fill_gradient2(low="red",mid="yellow",high="blue",midpoint=0.5,
limits=c(0,1))
If you're looking for a canned solution, the visreg package might work for you.
An example using #eipi10 's data
library(visreg)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
m1 = glm(admit ~ gre + rank, family=binomial, data=mydata)
visreg(m1, "admit", by = "rank")
Many more options described in documentation.

Resources