pROC package: ci.se. How are the CI calculated? - r

I have two questions:
I am using the pROC package to calculate the CI of the ROC curve for a logistic regression model and a random forest model. What I cannot understand is which algorithm is used for this computation. Is it the vertical averaging algorithm? Tom Fawsett's paper mentions, "Confidence intervals of the mean of tp rate are computed using the common
assumption of a binomial distribution." Does he mean normal approximation? Moreover the curve that I am plotting is the average curve?
forest <- randomForest(factor(extreme, levels = c("Yes", "No"))~ tas + X0+X1+X2+X3+X4+X5+X8,
train_df, ntree = 500, na.omit = TRUE)
Random_Forest <- predict(forest, test_df, type = "prob")[,2]
roc <- roc(test_df$extry, Random_Forest , plot=TRUE, legacy.axes=TRUE)
Logistic_Regression <- predict(model,test_df, type='response')
roc <- roc(test_df$extry, Logistic_Regression, plot=TRUE,legacy.axes=TRUE)
roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE)
ci.list <- lapply(roc.list, ci.se, specificities = seq(0, 1, .1), boot.n=2000, stratified=TRUE, conf.level=0.95,parallel = TRUE)
dat.ci.list <- lapply(ci.list, function(ciobj)
data.frame(x = as.numeric(rownames(ciobj)),
lower = ciobj[, 1],
upper = ciobj[, 3]))
p <- ggroc(roc.list,legacy.axes=TRUE,aes = c("linetype")) +
labs(x = "False Positive Rate", y = "True Positive Rate", linetype="Model")+
scale_linetype_discrete(labels=c("Logistic Regression","Random Forest"))+
theme_classic() +
geom_abline(slope=1, intercept = 1, linetype = "dashed", alpha=0.7, color = "grey") +
coord_equal()
for(i in 1:2) {
p <- p + geom_ribbon(
data = dat.ci.list[[i]],
aes(x = 1-x, ymin = lower, ymax = upper),
fill = i + 1,
alpha = 0.2,
inherit.aes = F)
}
p
Can I use the pROC package to calculate CI in the test datasets obtained from cross-validation? So, for example, if I want to use 10-fold validation for the logistic regression model, I will have 10 ROC curves. The part of the code:roc.list <- roc(test_df$extry ~ Logistic_Regression+Random_Forest,legacy.axes=TRUE) will not work since the data are not the same in the 10 different test datasets. Any idea?

Related

Unable to plot confidence intervals using ggplot, (geom_ribbon() argument)

I am trying to plot 95% confidence intervals on some simulated values but am running into so issues when i am trying to plot the CIs using the geom_ribbon() argument. The trouble I'm having it that my model does not show the CIs when i plot them, like so;
I have included all of my code below if anyone knows where i have gone wrong here;
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + .1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#transform the CI limit to get one at the level of the mean
upper_mod2 = exp(upper_mod2)/(1+exp(upper_mod2))
lower_mod2 = exp(lower_mod2)/(1+exp(lower_mod2))
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds2), col = 'blue')
In a comment to the question, it's asked why not to logit transform the predicted values. The reason why is that the type of prediction asked for is "response". From the documentation, my emphasis.
type
the type of prediction required. The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities. The "terms" option returns a matrix giving the fitted values of each term in the model formula on the linear predictor scale.
There is a good way to answer, to show the code.
library(ggplot2, quietly = TRUE)
set.seed(20220520)
#simulating 200 values between 0 and 1 from a uniform distribution
x = runif(200, min = 0, max = 1)
lam = exp(0.3+5*x)
y = rpois(200, lambda = lam)
#before we do this each Yi may contain zeros so we need to add a small constant
y <- y + 0.1
#combining x and y into a dataframe so we can plot
df = data.frame(x, y)
#fitting a Poisson GLM
suppressWarnings(
model2 <- glm(y ~ x,
data = df,
family = poisson(link='log'))
)
#make predictions (this may be the same as predictions_mod2)
preds <- predict(model2, type = "response")
#making CI predictions
predictions_mod2 = predict(model2, df, se.fit = TRUE, type = 'response')
#calculate confidence intervals limit
upper_mod2 = predictions_mod2$fit+1.96*predictions_mod2$se.fit
lower_mod2 = predictions_mod2$fit-1.96*predictions_mod2$se.fit
#combining into a df
predframe = data.frame(lwr=lower_mod2,upr=upper_mod2, x = df$x, y = df$y)
#plot model with 95% confidence intervals using ggplot
ggplot(df, aes(x, y)) +
geom_ribbon(data = predframe, aes(ymin=lwr, ymax=upr), alpha = 0.4) +
geom_point() +
geom_line(aes(x, preds), col = 'blue')
Created on 2022-05-29 by the reprex package (v2.0.1)

How can I get confidence intervals for an nls(broken stick) class object

I am trying to figure out how to obtain/plot confidence bounds for nls objects in R.
For example here is a nls model
bstick.lm.mean <- nls(TCTmean ~ cbind("intercept" = 1,
"l2Flow" = l2Flow,
"l2FlowBr" = ifelse(l2Flow > Br,
l2Flow - Br, 0)),
start = list(Br = 6),
algorithm = "plinear",
data = flow.new.sum.dat)
So bstick.lm.mean is a nls class object.
new.seq4 = seq(min(flow.new.sum.dat$l2Flow), max(flow.new.sum.dat$l2Flow), length = 200)
new.seq4 = data.frame(new.seq4)
names(new.seq4) = 'l2Flow'
pz = predict(bstick.lm.mean, newdata = new.seq4,
interval = 'confidence', se.fit = TRUE, level = 0.95)
test.frame2 = data.frame(new.seq4,pz)
ggplot(data = test.frame2)+
geom_point(mapping = aes(x = l2Flow, y = pz),
shape = 1, col = 'red') +
geom_point(data = test.frame,
aes(x = l2Flow, y = TCTmean),
shape = 0) +
theme(panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black")) +
xlab("Log2 Flow (KL)") +
ylab("Mean Transformed Ct")
Produces the following plot:
My question is, how can I obtain confidence bands for this non linear plot? I am referring to the bands around the regression line.
I extracted data from the scatterplot for analysis, and performed an equation search. I found that several different sigmoidal equations were fitting the data best, and for individual sigmoidal equations it should be simpler to determine the confidence intervals with standard statistical software. As an example, here are my results for a Hyperbolic Logistic sigmoidal equation, "y = (a * pow(x, b)) / (c + pow(x, b))", with fitted parameters a = 1.6177820755100655E+01, b = -1.5270446610701983E+01, and c = 4.2601082365916449E-12 yielding RMSE = 2.58 and R-squared = 0.85. Again, there were several equally "good" sigmoidal equations to choose from.
Estimates of uncertainty around change points (point of break) are notoriously hard to do analytically. Try the R package mcp which takes a computational (Bayesian) approach:
library(mcp)
model = list(
y ~ 1 + x, # Slope
~ 0 # joined plateau
)
fit = mcp(model, df)
plot(fit, q_fit = TRUE) # Plot with quantiles
The red lines are the highest-density interval, the grey lines random posterior draws, and the blue curve is the posterior distribution of the change point location. Use plot_pars(fit) and summary(fit) to get parameter-wise summaries and plots, including uncertainty.

How to find x value for OR = 1 in logit crude and adjusted GAM

I've some data for fitting crude and adjusted logit GAMs:
library(mgcv)
## Simulate some data...
set.seed(3);n<-400
dat <- gamSim(1,n=n)
mu <- binomial()$linkinv(dat$f/4-2)
phi <- .5
a <- mu*phi;b <- phi - a;
dat$y <- rbeta(n,a,b)
## Fitting GAMs
crude <- gam(y~s(x0),family=binomial(link="logit"),data=dat)
adj <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=binomial(link="logit"),data=dat)
Now I would intercept the value of x0 with the odds ratio (OR) 1.00 (i.e. probability 0.50). For this purpose I use visreg with argument plot = FALSE.
## Prepare data for ggplotting
library(visreg)
p.crude <- visreg(crude, "x0", plot = FALSE)
p.adj <- visreg(adj, "x0", plot = FALSE)
library(dplyr)
bind_rows(
mutate(p.crude$fit, Model = "crude"),
mutate(p.adj$fit, Model = "adj")
) -> fits
Ok. I gonna compute OR from LogOR. Is the following code correct?
# Compute ORs and CI from LogOR
fits$or <- exp(fits$visregFit)
fits$ci.low <- exp(fits$visregLwr)
fits$ci.up <- exp(fits$visregUpr)
Now I use approx in order to interpolate the x0 value with OR 1.00
## Interpolate x0 which give OR 1.00 (or 50% of probability)
x.crude <- round(approx(x = crude$fitted.values, y=crude$model$x0, xout = .5)$y, 1)
x.adj <- round(approx(x = adj$fitted.values, y=adj$model$x0, xout = .5)$y, 1)
Finally, I'm plotting the two models in a single graph:
## Plotting using ggplot
library(ggplot2)
ggplot(data = fits) +
geom_vline(aes(xintercept = x.crude), size=.2, color="black")+
geom_vline(aes(xintercept = x.adj), size=.2, color="red")+
annotate(geom ="text", x= x.crude - 0.05, y=.5, label = x.crude, size=3.5) +
annotate(geom ="text", x= x.adj - 0.05, y=.5, label = x.adj, size=3.5, color="red") +
geom_ribbon(aes(x0, ymin=ci.low, ymax=ci.up, group=Model, fill=Model), alpha=.05) +
geom_line(aes(x0, or, group=Model, color=Model)) +
labs(x="X0", y="Odds ratio")+
theme_bw(16)
As you can see, only the crude model shows an intercept with OR almost equal to 1.00 (x0 = 0.9), while this never happens for the adj model.
First, how can I get an interpolation with OR that is exactly at 1?
Second...With the limitation of my statistical knowledge, it was my understanding that I should have observed an intercept with OR=1 for the adj model, as well, based on the observed values for x0 according to this model. Why is the relative curve set upwards?

How is `level` used to generate the confidence interval in geom_smooth?

I'm having trouble emulating how stat_smooth calculates it's confidence interval.
Let's generate some data and a simple model:
library(tidyverse)
# sample data
df = tibble(
x = runif(10),
y = x + rnorm(10)*0.2
)
# simple linear model
model = lm(y ~ x, df)
Now use predict() to generate values and confidence intervals
# predict
df$predicted = predict(
object = model,
newdata = df
)
# predict 95% confidence interval
df$CI = predict(
object = model,
newdata = df,
se.fit = TRUE
)$se.fit * qnorm(1 - (1-0.95)/2)
Notice that qnorm is used to expand from standard error to 95% CI
Plot the data (black dots), geom_smooth (black line + gray ribbon), and the predicted ribbon (red and blue lines).
ggplot(df) +
aes(x = x, y = y) +
geom_point(size = 2) +
geom_smooth(method = "lm", level = 0.95, fullrange = TRUE, color = "black") +
geom_line(aes(y = predicted + CI), color = "blue") + # upper
geom_line(aes(y = predicted - CI), color = "red") + # lower
theme_classic()
The red and blue lines should be the same as the ribbon's edges. What am I doing wrong?
As posted in a comment by #Dason, the answer is that geom_smooth uses a t-distribution, not a normal distribution.
In my original question, replace qnorm(1 - (1-0.95)/2) with qt(1 - (1-0.95)/2, nrow(df)) for the lines to match up.

Example of fitting marginal distributions to histogram in R

Could someone show me how to fit a polynomial marginal distribution to my data? I have done a binomial and beta binomial, but I would like to see how to fit a polynomial. I would also be interested in trying a gamma if that is something you know how to do.
This is what I have done so far.
nodes <- read.table("https://web.stanford.edu/~hastie/CASI_files/DATA/nodes.txt",
header = T)
nodes %>%
ggplot(aes(x=x/n))+
geom_histogram(bins = 30)+
theme_bw()+
labs(x = "nodes",
n = "p=x/n")
# log-likelihood function
ll <- function(alpha, beta) {
x <- nodes$x
total <- nodes$n
-sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log = TRUE))
}
# maximum likelihood estimation
m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B",
lower = c(0.0001, .1))
ab <- coef(m)
alpha0 <- ab[1]
beta0 <- ab[2]
nodes %>%
ggplot() +
geom_histogram(aes(x/n, y = ..density..), bins= 30) +
stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
size = 1) +
xlab("p=x/n")
Here is another fit
ll <- function(a){
x <- nodes$x
total <- nodes$n
-sum(stats::dbinom(x, total, a, log = TRUE))
}
#stats::dbinom()
m <- mle(ll, start = list(a=.5), method = "L-BFGS-B",
lower = c(0.0001, .1))
a = coef(m)
nodes %>%
ggplot() +
geom_histogram(aes(x/n, y = ..density..), bins=40) +
stat_function(fun = function(x) dbeta(x, a, 1), color = "red",
size = 1) +
xlab("proportion x/n")
For fitting a gamma distribution:
data(iris)
library(MASS) ##for the fitdistr function
fit.params <- fitdistr(iris$Sepal.Length, "gamma", lower = c(0, 0))
ggplot(data = iris) +
geom_histogram(data = as.data.frame(x), aes(x=iris$Sepal.Length, y=..density..)) +
geom_line(aes(x=iris$Sepal.Length,
y=dgamma(iris$Sepal.Length,fit.params$estimate["shape"],
fit.params$estimate["rate"])), color="red", size = 1) +
theme_classic()
You might also like to take a look at the distribution of the quantiles using the qqp function in the car package. Here are a few examples:
library(car)
qqp(iris$Sepal.Length, "norm") ##normal distribution
qqp(iris$Sepal.Length, "lnorm") ##log-normal distribution
gamma <- fitdistr(iris$Sepal.Length, "gamma")
qqp(iris$Sepal.Length, "gamma", shape = gamma$estimate[[1]],
rate = gamma$estimate[[2]]) ##gamma distribution
nbinom <- fitdistr(iris$Sepal.Length, "Negative Binomial")
qqp(iris$Sepal.Length, "nbinom", size = nbinom$estimate[[1]],
mu = nbinom$estimate[[2]]) ##negative binomial distribution
You can use the fitdistr function for ggplots or qqPlots. It supports lots of different distributions. Take a look at ?fitdistr

Resources