R - Manually plot calibration plot - r

From Clinical Prediction Models by Ewout W. Steyerberg we have the following:
A calibration plot has predictions on the x axis, and the outcome on
the y axis. A line of identity helps for orientation: Perfect
predictions should be on the 45° line. For linear regression, the
calibration plot results in a simple scatter plot. For binary
outcomes, the plot contains only 0 and 1 values for the y axis.
Probabilities are not observed directly. However, smoothing techniques
can be used to estimate the observed probabilities of the outcome ( p
( y = 1)) in relation to the predicted probabilities. The observed 0/1
outcomes are replaced by values between 0 and 1 by combining outcome
values of subjects with similar predicted probabilities, e.g. using
the loess algorithm.
I'm fitting a logistic regression model with a binary outcome. Below is an example code. The calibration curve is going to look weird because the sample is so small. I'm mostly wondering if the methodology is correct.
library(tidyverse)
tibble_ex <- tibble(
event = c(1, 0, 1, 0, 0, 1),
weight = c(100, 200, 110, 210, 220, 105)
)
model <- glm(event ~ weight, family = 'binomial', data = tibble_ex)
tibble_ex <- tibble_ex %>%
mutate(pred = predict(model, type = 'response'))
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
stat_smooth(method = 'glm', method.args = list(family = binomial), se = F) +
geom_abline()

You are missing just the smoothing part if the plot. If you want to use glm to plot the curve then you have to use that with splines.
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
stat_smooth(method = "glm", formula = y ~ ns(x,1), size = 1) +
geom_abline()
However, I have noticed that Steyerberg and Harrell prefer the use of loess smoothing.
tibble_ex %>%
arrange(pred) %>%
ggplot(aes(x = pred, y = event)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
geom_smooth(aes(x = pred, y = event), color = "red", se = F, method = "loess") +
# you can use stat_smooth in place of geom_smooth
geom_abline()
I want to refer also to the rms package of Frank Harrell. There are many helpful functions to fit and validate models including calibration plots. The code below plots the calibration curve and provide other statistics.
library(rms)
val.prob(fitted(model),tibble_ex$event)

Related

How to make beautiful ROC curves for two models in the same plot?

I've trained two xgboost models, say model1 and model2. I have the AUC scores for each model and I want them to appear in the plot. I want to make beautiful ROC curves for both models in the same plot. Something like this:
How can I do that?
I usually use the library pROC, and I know I need to extract the scores, and the truth from each model, right?
so something like this maybe:
roc1 = roc(model1$truth, model1$scores)
roc2 = roc(model2$truth, model2$scores)
I also need the fpr and tpr for each model:
D1 = data.frame = (fpr = 1 - roc1$specificities, tpr = roc1$sensitivities)
D2 = data.frame = (fpr = 1 - roc2$specificities, tpr = roc2$sensitivities)
Then I can maybe add arrows to point out which curve is which:
arrows = tibble(x1 = c(0.5, 0.13) , x2 = c(0.32, 0.2), y1 = c(0.52, 0.83), y2 = c(0.7,0.7) )
And finally ggplot: (this part is missing)
ggplot(data = D1, aes(x = fpr, y = tpr)) +
geom_smooth(se = FALSE) +
geom_smooth(data = D2, color = 'red', se = FALSE) +
annotate("text", x = 0.5, 0.475, label = 'score of model 1') +
annotate("text", x = 0.13, y = 0.9, label = scores of model 2') +
So I need help with two things:
How do I get the right information out from the models, to make ROC curves? How do I get the truth and the prediction scores? The truth are just the labels of the target feature in the training set maybe?
How do I continue the code? and is my code right so far?
You can get the sensitivity and specifity in a data frame using coords from pROC. Just rbind the results for the two models after first attaching a column labelling each set as model 1 or model 2. To get the smooth-looking ROC with automatic labels you can use geom_textsmooth from the geomtextpath package:
library(pROC)
library(geomtextpath)
roc1 <- roc(model1$truth, model1$scores)
roc2 <- roc(model2$truth, model2$scores)
df <- rbind(cbind(model = "Model 1", coords(roc1)),
cbind(model = "Model 2", coords(roc2)))
ggplot(df, aes(1 - specificity, sensitivity, color = model)) +
geom_textsmooth(aes(label = model), size = 7, se = FALSE, span = 0.2,
textcolour = "black", vjust = 1.5, linewidth = 1,
text_smoothing = 50) +
geom_abline() +
scale_color_brewer(palette = "Set1", guide = "none", direction = -1) +
scale_x_continuous("False Positive Rate", labels = scales::percent) +
scale_y_continuous("True Positive Rate", labels = scales::percent) +
coord_equal(expand = FALSE) +
theme_classic(base_size = 20) +
theme(plot.margin = margin(10, 30, 10, 10))
Data used
set.seed(2023)
model1 <- model2 <- data.frame(scores = rep(1:100, 50))
p1 <- model2$scores + rnorm(5000, 0, 20)
p2 <- model1$scores/100
model1$truth <- rbinom(5000, 1, (p1 - min(p1))/diff(range(p1)))
model2$truth <- rbinom(5000, 1, p2)

Plot binomial GAM in ggplot

I'm trying to visualize a dataset that uses a binomial response variable (proportions). I'm using a gam to examine the trend, but having difficult getting it to plot with ggplot. How do I get the smooth added to the plot?
Example:
set.seed(42)
df <- data.frame(y1 = sample.int(100),
y2 = sample.int(100),
x = runif(100, 0, 100))
ggplot(data = df,
aes(y = y1/(y1+y2), x = x)) +
geom_point(shape = 1) +
geom_smooth(method = "gam",
method.args = list(family = binomial),
formula = cbind(y1, y2) ~ s(x))
Warning message:
Computation failed in `stat_smooth()`
Caused by error in `cbind()`:
! object 'y1' not found
The formula in geom_smooth has to be in terms of x and y, representing the variables on your x and y axes, so you can't pass in y1 and y2.
The way round this is that rather than attempting to use the cbind type left-hand side of your gam, you can expand the counts into 1s and 0s so that there is only a single y variable. Although this makes for a little extra pre-processing, it allows you to draw your points just as easily using stat = 'summary' inside geom_point and makes your geom_smooth very straightforward:
library(tidyverse)
set.seed(42)
df <- data.frame(y1 = sample.int(100),
y2 = sample.int(100),
x = runif(100, 0, 100))
df %>%
rowwise() %>%
summarize(y = rep(c(1, 0), times = c(y1, y2)), x = x) %>%
ggplot(aes(x, y)) +
geom_point(stat = 'summary', fun = mean, shape = 1) +
geom_smooth(method = "gam",
method.args = list(family = binomial),
formula = y ~ s(x)) +
theme_classic()
Created on 2023-01-20 with reprex v2.0.2

pROC package: ci.se. How are the CI calculated?

I have two questions:
I am using the pROC package to calculate the CI of the ROC curve for a logistic regression model and a random forest model. What I cannot understand is which algorithm is used for this computation. Is it the vertical averaging algorithm? Tom Fawsett's paper mentions, "Confidence intervals of the mean of tp rate are computed using the common
assumption of a binomial distribution." Does he mean normal approximation? Moreover the curve that I am plotting is the average curve?
forest <- randomForest(factor(extreme, levels = c("Yes", "No"))~ tas + X0+X1+X2+X3+X4+X5+X8,
train_df, ntree = 500, na.omit = TRUE)
Random_Forest <- predict(forest, test_df, type = "prob")[,2]
roc <- roc(test_df$extry, Random_Forest , plot=TRUE, legacy.axes=TRUE)
Logistic_Regression <- predict(model,test_df, type='response')
roc <- roc(test_df$extry, Logistic_Regression, plot=TRUE,legacy.axes=TRUE)
roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, .1), boot.n=2000, stratified=TRUE, conf.level=0.95,parallel = TRUE)
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list,legacy.axes=TRUE,aes = c("linetype")) +
labs(x = "False Positive Rate", y = "True Positive Rate", linetype="Model")+
scale_linetype_discrete(labels=c("Logistic Regression","Random Forest"))+
theme_classic() +
geom_abline(slope=1, intercept = 1, linetype = "dashed", alpha=0.7, color = "grey") +
coord_equal()
for(i in 1:2) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = 1-x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
p
Can I use the pROC package to calculate CI in the test datasets obtained from cross-validation? So, for example, if I want to use 10-fold validation for the logistic regression model, I will have 10 ROC curves. The part of the code:roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE) will not work since the data are not the same in the 10 different test datasets. Any idea?

How to visualize spline regression with ggplot2?

I'm working with the Wage dataset in the ISLR library. My objective is to perform a spline regression with knots at 3 locations (see code below). I can do this regression. That part is fine.
My issue concerns the visualization of the regression curve. Using base R functions, I seem to get the correct curve. But I can't seem to get quite the right curve using the tidyverse. This is what is expected, and what I get with the base functions:
This is what ggplot spits out
It's noticeably different. R gives me the following message when running the ggplot functions:
geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")
What does this mean and how do I fix it?
library(tidyverse)
library(ISLR)
attach(Wage)
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60), degree = 3), data = Wage) #Default is 3
plot(age, wage, col = 'grey', xlab = 'Age', ylab = 'Wages')
points(age.grid, predict(fit, newdata = list(age = age.grid)), col = 'darkgreen', lwd = 2, type = "l")
abline(v = c(25, 40, 60), lty = 2, col = 'darkgreen')
ggplot(data = Wage) +
geom_point(mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age, y = fit$fitted.values), color = 'red')
I also tried
ggplot() +
geom_point(data = Wage, mapping = aes(x = age, y = wage), color = 'grey') +
geom_smooth(mapping = aes(x = age.grid, y = predict(fit, newdata = list(age = age.grid))), color = 'red')
but that looks very similar to the 2nd picture.
Thanks for any help!
splines::bs() and s(., type="bs") from mgcv do very different things; the latter is a penalized regression spline. I would try (untested!)
geom_smooth(method="lm",
formula= y ~ splines::bs(x, knots = c(25, 40, 60), degree = 3))

Plotting lift curve in MLR

I would like to know how to plot lift curves in MLR especially for a Benchmark experiment with multiple algorithms and tasks. Help with ROC curve plotting will also be appreciated.
Thanks.
I am not a mlr user but here is a general way.
First some data:
Two class problem
iris2 = iris[iris$Species!="setosa",]
iris2$Species = factor(iris2$Species)
1st model:
log_model = glm(Species~., data = iris2, family = "binomial")
prob = predict(log_model, iris2, type = "response") #get the logistic regression prob
2nd model:
library(e1071)
svm_model = svm(Species~., data = iris2, probability = TRUE)
prob_svm = predict(svm_model, iris2, probability = TRUE)
prob_svm = attr(prob_svm , "probabilities")[,2] #get the probability for svm model
make a data frame from classes (1/0 coding) and additional columns for predicted probabilities for each model
for_lift = data.frame(Class = as.factor(ifelse(iris2$Species == "versicolor", 1, 0)), glm = prob, svm = prob_svm)
make a lift object
library(caret)
lift_obj = lift(Class ~ glm+svm, data = for_lift)
xyplot(lift_obj, auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
You can use the same data frame to plot ROC curves
library(pROC)
plot(pROC::roc(response = for_lift$Class,
predictor = for_lift$glm,
levels=c(0, 1)),
lwd=1.5)
plot(
pROC::roc(response = for_lift$Class,
predictor = for_lift$svm ,
levels=c(0, 1)),
add=T, lty=2, lwd=1.5)
legend(0.9, 0.9, c("logistic", "svm"), lty = c(1,2))
You can also check the ROCR package: https://cran.r-project.org/web/packages/ROCR/ROCR.pdf it has methods to plot both types of plots
Additionally if you are a ggplot2 user you can use the lift_obj to plot lift and ROC curves with it also.
library(ggplot2)
p1 = ggplot(lift_obj$data)+
geom_line(aes(CumTestedPct, CumEventPct, color = liftModelVar))+
xlab("% Samples tested")+
ylab("% Samples found")+
scale_color_discrete(guide = guide_legend(title = "method"))+
geom_polygon(data = data.frame(x = c(0, lift_obj$pct, 100, 0),
y = c(0, 100, 100, 0)),
aes(x = x, y = y), alpha = 0.1)
p2 = ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
library(cowplot)
plot_grid(p1, p2, labels=c("lift", "ROC"))

Resources