How do I add confidence intervals to glm model in ggplot? - r

Here is an example of what my data looks like:
DATA <- data.frame(
TotalAbund = sample(1:10),
TotalHab = sample(0:1),
TotalInv = sample(c("yes", "no"), 20, replace = TRUE)
)
DATA$TotalHab<-as.factor(DATA$TotalHab)
DATA
Here is my model:
MOD.1<-glm(TotalAbund~TotalInv+TotalHab, family=quasipoisson, data=DATA)
Here is my plot:
NEWDATA <- with(DATA,
expand.grid(TotalInv=unique(TotalInv),
TotalHab=unique(TotalHab)))
pred <- predict(MOD.1,newdata= NEWDATA,se.fit=TRUE)
gg1 <- ggplot(NEWDATA, aes(x=factor(TotalHab), y=TotalAbund,colour=TotalInv))
I get the following error...
Error in eval(expr, envir, enclos) : object 'TotalAbund' not found
...when trying to run the last line of code:
gg1 + geom_point(data=pframe,size=8,shape=17,alpha=0.7,
position=position_dodge(width=0.75))
Can anyone help? Also how do I add 95% confidence intervals to my points? Thanks.

You will need to calculate the 95% confidence intervals yourself. You were on the right track using predict and asking for the se.fit. We will first ask for the predictions on the link scale, calculate 95% confidence intervals, and then transform them to the real scale for plotting. Here is a convenience function to calculate your CI's for the log link (which you used in the model).
# get your prediction
pred <- predict(MOD.1,newdata= NEWDATA,se.fit=TRUE,
type = "link")
# CI function
make_ci <- function(pred, data){
# fit, lower, and upper CI
fit <- pred$fit
lower <- fit - 1.96*pred$se.fit
upper <- fit + 1.96*pred$se.fit
return(data.frame(exp(fit), exp(lower), exp(upper), data))
}
my_pred <- make_ci(pred, NEWDATA)
# to be used in geom_errorbar
limits <- aes(x = factor(TotalHab), ymax = my_pred$exp.upper., ymin = my_pred$exp.lower.,
group = TotalInv)
Then we plot it out, I will leave the final tweaking to you to make the figure out how you want it to.
ggplot(my_pred, aes(x = factor(TotalHab), y = exp.fit., color = TotalInv))+
geom_errorbar(limits, position = position_dodge(width = 0.75),
color = "black")+
geom_point(size = 8, position = position_dodge(width = 0.75), shape = 16)+
ylim(c(0,15))+
geom_point(data = DATA, aes(x = factor(TotalHab), y = TotalAbund, colour = TotalInv),
size = 8, shape = 17, alpha = 0.7,
position = position_dodge(width = 0.75))

Related

pROC package: ci.se. How are the CI calculated?

I have two questions:
I am using the pROC package to calculate the CI of the ROC curve for a logistic regression model and a random forest model. What I cannot understand is which algorithm is used for this computation. Is it the vertical averaging algorithm? Tom Fawsett's paper mentions, "Confidence intervals of the mean of tp rate are computed using the common
assumption of a binomial distribution." Does he mean normal approximation? Moreover the curve that I am plotting is the average curve?
forest <- randomForest(factor(extreme, levels = c("Yes", "No"))~ tas + X0+X1+X2+X3+X4+X5+X8,
train_df, ntree = 500, na.omit = TRUE)
Random_Forest <- predict(forest, test_df, type = "prob")[,2]
roc <- roc(test_df$extry, Random_Forest , plot=TRUE, legacy.axes=TRUE)
Logistic_Regression <- predict(model,test_df, type='response')
roc <- roc(test_df$extry, Logistic_Regression, plot=TRUE,legacy.axes=TRUE)
roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, .1), boot.n=2000, stratified=TRUE, conf.level=0.95,parallel = TRUE)
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list,legacy.axes=TRUE,aes = c("linetype")) +
labs(x = "False Positive Rate", y = "True Positive Rate", linetype="Model")+
scale_linetype_discrete(labels=c("Logistic Regression","Random Forest"))+
theme_classic() +
geom_abline(slope=1, intercept = 1, linetype = "dashed", alpha=0.7, color = "grey") +
coord_equal()
for(i in 1:2) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = 1-x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
p
Can I use the pROC package to calculate CI in the test datasets obtained from cross-validation? So, for example, if I want to use 10-fold validation for the logistic regression model, I will have 10 ROC curves. The part of the code:roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE) will not work since the data are not the same in the 10 different test datasets. Any idea?

Boxplot not showing range

I have predicted values, via:
glm0 <- glm(use ~ as.factor(decision), data = decision_use, family = binomial(link = "logit"))
predicted_glm <- predict(glm0, newdata = decision_use, type = "response", interval = "confidence", se = TRUE)
predict <- predicted_glm$fit
predict <- predict + 1
head(predict)
1 2 3 4 5 6
0.3715847 0.3095335 0.3095335 0.3095335 0.3095335 0.5000000
Now when I plot a box plot using ggplot2,
ggplot(decision_use, aes(x = decision, y = predict)) +
geom_boxplot(aes(fill = factor(decision)), alpha = .2)
I get a box plot with one horizontal line per categorical variable. If you look at the predict data, it's same for each categorical variable, so makes sense.
But I want a box plot with the range. How can I get that? When I use "use" instead of predict, I get boxes stretching from end to end (1 to 0). So I suppose that's not it. Thank you in advance.
To clarify, predicted_glm includes se.fit values. I wonder how to incorporate those.
It doesn't really make sense to do a boxplot here. A boxplot shows the range and spread of a continuous variable within groups. Your dependent variable is binary, so the values are all 0 or 1. Since you are plotting predictions for each group, your plot would have just a single point representing the expected value (i.e. the probability) for each group.
The closest you can come is probably to plot the prediction with 95% confidence bars around it.
You haven't provided any sample data, so I'll make some up here:
set.seed(100)
df <- data.frame(outcome = rbinom(200, 1, c(0.1, 0.9)), var1 = rep(c("A", "B"), 100))
Now we'll create our model and get the prediction for each level of my predictor variable using the newdata parameter of predict. I'm going to specify type = "link" because I want the log odds, and I'm also going to specify se.fit = TRUE so I can get the standard error of these predictions:
mod <- glm(outcome ~ var1, data = df, family = binomial)
prediction <- predict(mod, list(var1 = c("A", "B")), se.fit = TRUE, type = "link")
Now I can work out the 95% confidence intervals for my predictions:
prediction$lower <- prediction$fit - prediction$se.fit * 1.96
prediction$upper <- prediction$fit + prediction$se.fit * 1.96
Finally, I transform the fit and confidence intervals from log odds into probabilities:
prediction <- lapply(prediction, function(logodds) exp(logodds)/(1 + exp(logodds)))
plotdf <- data.frame(Group = c("A", "B"), fit = prediction$fit,
upper = prediction$upper, lower = prediction$lower)
plotdf
#> Group fit upper lower
#> 1 A 0.13 0.2111260 0.07700412
#> 2 B 0.92 0.9594884 0.84811360
Now I am ready to plot. I will use geom_points for the probability estimates and geom_errorbars for the confidence intervals :
library(ggplot2)
ggplot(plotdf, aes(x = Group, y = fit, colour = Group)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2, width = 0.5) +
geom_point(size = 3, colour = "black") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability estimate with 95% CI", y = "Probability")
Created on 2020-05-11 by the reprex package (v0.3.0)

Confidence interval over a normal distribution plot

I want to plot vertical lines on the x position for the confidence interval. I did the statistics, but I cannot find a way to add it to the plot. Please follow this MWE:
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
par(mfrow=c(1,3), mar=c(3,4,4,2))
plot(xseq, densities, col="darkgreen",xlab="", ylab="Densidade", type="l",lwd=2, cex=2, main="Normal", cex.axis=.8)
Generates:
The ci is:
x<-t.test(xseq, conf.level = 0.95)$conf.int
But when I try to plot the line with:
line(x[1], x[2])
It gives me the error:
Error in structure(.Call(C_tukeyline, as.double(xy$x[ok]), as.double(xy$y[ok]), :
insufficient observations
After comments pointing out abline() it works:
I am, however, incorrect to think that t.test will give cis for a normal distribution.
What am I doing wrong?
Using ggplot2:
ggplot(data = df, aes(x = xseq, y = densities)) +
geom_point() +
geom_vline(xintercept = c(x[1], x[2]))
With proper confidence intervals:
ggplot(data = df, aes(x = xseq, y = densities)) +
geom_point() +
geom_vline(xintercept = c(x2[1], x2[2]))
Sample data:
df <- data.frame(xseq = seq(-4,4,.01),
densities = dnorm(xseq, 0,1))
x <- t.test(xseq, conf.level = 0.95)$conf.int
x2 <- qnorm(c(0.05, 0.95), mean = mean(xseq), sd = sd(xseq))

How is `level` used to generate the confidence interval in geom_smooth?

I'm having trouble emulating how stat_smooth calculates it's confidence interval.
Let's generate some data and a simple model:
library(tidyverse)
# sample data
df = tibble(
x = runif(10),
y = x + rnorm(10)*0.2
)
# simple linear model
model = lm(y ~ x, df)
Now use predict() to generate values and confidence intervals
# predict
df$predicted = predict(
object = model,
newdata = df
)
# predict 95% confidence interval
df$CI = predict(
object = model,
newdata = df,
se.fit = TRUE
)$se.fit * qnorm(1 - (1-0.95)/2)
Notice that qnorm is used to expand from standard error to 95% CI
Plot the data (black dots), geom_smooth (black line + gray ribbon), and the predicted ribbon (red and blue lines).
ggplot(df) +
aes(x = x, y = y) +
geom_point(size = 2) +
geom_smooth(method = "lm", level = 0.95, fullrange = TRUE, color = "black") +
geom_line(aes(y = predicted + CI), color = "blue") + # upper
geom_line(aes(y = predicted - CI), color = "red") + # lower
theme_classic()
The red and blue lines should be the same as the ribbon's edges. What am I doing wrong?
As posted in a comment by #Dason, the answer is that geom_smooth uses a t-distribution, not a normal distribution.
In my original question, replace qnorm(1 - (1-0.95)/2) with qt(1 - (1-0.95)/2, nrow(df)) for the lines to match up.

Customize how the smooth confidence interval is computed

I use to plot the loess estimation of a bunch of points along with the confidence interval by means of the geom_smooth function.
Now I need to change the method by which the confidence bounds are computed (i.e. I need to change the shape of the blur band). Is there a way to do that in geom_smooth?
Or, how can I emulate it with ggplot2? How can I such a blur band?
If you need a to plot something that isn't one of the options in geom_smooth your best bet is to manually fit the model yourself.
You haven't said what method you need.
But here is an example of fitting the loess with family symmetric and computing the standard errors of that.
d <- data.frame(x = rnorm(100), y = rnorm(100))
# The original plot using the default loess method
p <- ggplot(d, aes(x, y)) + geom_smooth(method = 'loess', se = TRUE)
# Fit loess model with family = 'symmetric'
# Replace the next 2 lines with whatever different method you need
loess_smooth <- loess(d$x ~ d$y, family = 'symmetric')
# Predict the model over the range of data you are interested in.
loess_pred <- predict(loess_smooth,
newdata = seq(min(d$x), max(d$x), length.out = 1000),
se = TRUE)
loess.df <- data.frame(fit = loess_pred$fit,
x = seq(min(d$x), max(d$x), length.out = 1000),
upper = loess_pred$fit + loess_pred$se.fit,
lower = loess_pred$fit - loess_pred$se.fit)
# plot to compare
p +
geom_ribbon(data = loess.df, aes(x = x, y = fit, ymax = upper, ymin = lower), alpha = 0.6) +
geom_line(data = loess.df, aes(x = x, y = fit))

Resources