I'm having trouble emulating how stat_smooth calculates it's confidence interval.
Let's generate some data and a simple model:
library(tidyverse)
# sample data
df = tibble(
x = runif(10),
y = x + rnorm(10)*0.2
)
# simple linear model
model = lm(y ~ x, df)
Now use predict() to generate values and confidence intervals
# predict
df$predicted = predict(
object = model,
newdata = df
)
# predict 95% confidence interval
df$CI = predict(
object = model,
newdata = df,
se.fit = TRUE
)$se.fit * qnorm(1 - (1-0.95)/2)
Notice that qnorm is used to expand from standard error to 95% CI
Plot the data (black dots), geom_smooth (black line + gray ribbon), and the predicted ribbon (red and blue lines).
ggplot(df) +
aes(x = x, y = y) +
geom_point(size = 2) +
geom_smooth(method = "lm", level = 0.95, fullrange = TRUE, color = "black") +
geom_line(aes(y = predicted + CI), color = "blue") + # upper
geom_line(aes(y = predicted - CI), color = "red") + # lower
theme_classic()
The red and blue lines should be the same as the ribbon's edges. What am I doing wrong?
As posted in a comment by #Dason, the answer is that geom_smooth uses a t-distribution, not a normal distribution.
In my original question, replace qnorm(1 - (1-0.95)/2) with qt(1 - (1-0.95)/2, nrow(df)) for the lines to match up.
Related
I have two questions:
I am using the pROC package to calculate the CI of the ROC curve for a logistic regression model and a random forest model. What I cannot understand is which algorithm is used for this computation. Is it the vertical averaging algorithm? Tom Fawsett's paper mentions, "Confidence intervals of the mean of tp rate are computed using the common
assumption of a binomial distribution." Does he mean normal approximation? Moreover the curve that I am plotting is the average curve?
forest <- randomForest(factor(extreme, levels = c("Yes", "No"))~ tas + X0+X1+X2+X3+X4+X5+X8,
train_df, ntree = 500, na.omit = TRUE)
Random_Forest <- predict(forest, test_df, type = "prob")[,2]
roc <- roc(test_df$extry, Random_Forest , plot=TRUE, legacy.axes=TRUE)
Logistic_Regression <- predict(model,test_df, type='response')
roc <- roc(test_df$extry, Logistic_Regression, plot=TRUE,legacy.axes=TRUE)
roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, .1), boot.n=2000, stratified=TRUE, conf.level=0.95,parallel = TRUE)
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list,legacy.axes=TRUE,aes = c("linetype")) +
labs(x = "False Positive Rate", y = "True Positive Rate", linetype="Model")+
scale_linetype_discrete(labels=c("Logistic Regression","Random Forest"))+
theme_classic() +
geom_abline(slope=1, intercept = 1, linetype = "dashed", alpha=0.7, color = "grey") +
coord_equal()
for(i in 1:2) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = 1-x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
p
Can I use the pROC package to calculate CI in the test datasets obtained from cross-validation? So, for example, if I want to use 10-fold validation for the logistic regression model, I will have 10 ROC curves. The part of the code:roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE) will not work since the data are not the same in the 10 different test datasets. Any idea?
I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)
I need to create an insightful graphic with a regression line, data points, and confidence intervals. I am not looking for smoothed lines. I have tried multiple codes, but I just can't get it right.
I am looking for something like this:
Some codes I have tried:
p <- scatterplot(df.regsoft$w ~ df.regsoft$b,
data = df.regsoft,
boxplots = FALSE,
regLine = list(method=lm, col="red"),
pch = 16,
cex = 0.7,
xlab = "Fitted Values",
ylab = "Residuals",
legend = TRUE,
smooth = FALSE)
abline(coef = confint.lm(result.rs))
But this doesn't create what I want to create, however it is closest to what I intended. Notice that I took out "smooth" since this is not really what I am looking for.
How can I make this plot interactive?
If you don't mind switch to ggplot and the tidyverse, then this is simply a geom_smooth(method = "lm"):
library(tidyverse)
d <- tibble( #random stuff
x = rnorm(100, 0, 1),
y = 0.25 * x + rnorm(100, 0, 0.25)
)
m <- lm(y ~ x, data = d) #linear model
d %>%
ggplot() +
aes(x, y) + #what to plot
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
without method = "lm" it draws a smoothed line.
As for the Conf. interval (Obs 95%) lines, it seems to me that's simply a quantile regression. In that case, you can use the quantreg package.
If you want to make it interactive, you can use the plotly package:
library(plotly)
p <- d %>%
ggplot() +
aes(x, y) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw()
ggplotly(p)
================================================
P.S.
I am not completely sure this is what the figure you posted is showing (I guess so), but to add the quantile lines, I would just perform two quantile regressions (upper and lower) and then calculate the values of the quantile lines for your data:
library(tidyverse)
library(quantreg)
d <- tibble( #random stuff
x = rnorm(100, 0, 1),
y = 0.25 * x + rnorm(100, 0, 0.25)
)
m <- lm(y ~ x, data = d) #linear model
# 95% quantile, two tailed
rq_low <- rq(y ~ x, data = d, tau = 0.025) #lower quantile
rq_high <- rq(y ~ x, data = d, tau = 0.975) #upper quantile
d %>%
mutate(low = rq_low$coefficients[1] + x * rq_low$coefficients[2],
high = rq_high$coefficients[1] + x * rq_high$coefficients[2]) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_smooth(aes(x, y), method = "lm") +
geom_line(aes(x, low), linetype = "dashed") +
geom_line(aes(x, high), linetype = "dashed") +
theme_bw()
I use to plot the loess estimation of a bunch of points along with the confidence interval by means of the geom_smooth function.
Now I need to change the method by which the confidence bounds are computed (i.e. I need to change the shape of the blur band). Is there a way to do that in geom_smooth?
Or, how can I emulate it with ggplot2? How can I such a blur band?
If you need a to plot something that isn't one of the options in geom_smooth your best bet is to manually fit the model yourself.
You haven't said what method you need.
But here is an example of fitting the loess with family symmetric and computing the standard errors of that.
d <- data.frame(x = rnorm(100), y = rnorm(100))
# The original plot using the default loess method
p <- ggplot(d, aes(x, y)) + geom_smooth(method = 'loess', se = TRUE)
# Fit loess model with family = 'symmetric'
# Replace the next 2 lines with whatever different method you need
loess_smooth <- loess(d$x ~ d$y, family = 'symmetric')
# Predict the model over the range of data you are interested in.
loess_pred <- predict(loess_smooth,
newdata = seq(min(d$x), max(d$x), length.out = 1000),
se = TRUE)
loess.df <- data.frame(fit = loess_pred$fit,
x = seq(min(d$x), max(d$x), length.out = 1000),
upper = loess_pred$fit + loess_pred$se.fit,
lower = loess_pred$fit - loess_pred$se.fit)
# plot to compare
p +
geom_ribbon(data = loess.df, aes(x = x, y = fit, ymax = upper, ymin = lower), alpha = 0.6) +
geom_line(data = loess.df, aes(x = x, y = fit))
Here is an example of what my data looks like:
DATA <- data.frame(
TotalAbund = sample(1:10),
TotalHab = sample(0:1),
TotalInv = sample(c("yes", "no"), 20, replace = TRUE)
)
DATA$TotalHab<-as.factor(DATA$TotalHab)
DATA
Here is my model:
MOD.1<-glm(TotalAbund~TotalInv+TotalHab, family=quasipoisson, data=DATA)
Here is my plot:
NEWDATA <- with(DATA,
expand.grid(TotalInv=unique(TotalInv),
TotalHab=unique(TotalHab)))
pred <- predict(MOD.1,newdata= NEWDATA,se.fit=TRUE)
gg1 <- ggplot(NEWDATA, aes(x=factor(TotalHab), y=TotalAbund,colour=TotalInv))
I get the following error...
Error in eval(expr, envir, enclos) : object 'TotalAbund' not found
...when trying to run the last line of code:
gg1 + geom_point(data=pframe,size=8,shape=17,alpha=0.7,
position=position_dodge(width=0.75))
Can anyone help? Also how do I add 95% confidence intervals to my points? Thanks.
You will need to calculate the 95% confidence intervals yourself. You were on the right track using predict and asking for the se.fit. We will first ask for the predictions on the link scale, calculate 95% confidence intervals, and then transform them to the real scale for plotting. Here is a convenience function to calculate your CI's for the log link (which you used in the model).
# get your prediction
pred <- predict(MOD.1,newdata= NEWDATA,se.fit=TRUE,
type = "link")
# CI function
make_ci <- function(pred, data){
# fit, lower, and upper CI
fit <- pred$fit
lower <- fit - 1.96*pred$se.fit
upper <- fit + 1.96*pred$se.fit
return(data.frame(exp(fit), exp(lower), exp(upper), data))
}
my_pred <- make_ci(pred, NEWDATA)
# to be used in geom_errorbar
limits <- aes(x = factor(TotalHab), ymax = my_pred$exp.upper., ymin = my_pred$exp.lower.,
group = TotalInv)
Then we plot it out, I will leave the final tweaking to you to make the figure out how you want it to.
ggplot(my_pred, aes(x = factor(TotalHab), y = exp.fit., color = TotalInv))+
geom_errorbar(limits, position = position_dodge(width = 0.75),
color = "black")+
geom_point(size = 8, position = position_dodge(width = 0.75), shape = 16)+
ylim(c(0,15))+
geom_point(data = DATA, aes(x = factor(TotalHab), y = TotalAbund, colour = TotalInv),
size = 8, shape = 17, alpha = 0.7,
position = position_dodge(width = 0.75))