Prediction and Confidence intervals for Logistic Regression - r

Below is a set of fictitious probability data, which I converted into binomial with a threshold of 0.5. I ran a glm() model on the discrete data to test if the intervals returned from glm() were 'mean prediction intervals' ("Confidence Interval") or 'point prediction intervals'("Prediction Interval"). It appears from the plot below that the returned intervals are the latter--'Point Prediction Intervals'; note, with 95% confidence, 2/20 points fall outside of the line in this sample.
If this is indeed the case, how do I generate the the 'mean prediction interval' (i.e, "Confidence Intervals") in R for a binomial data set bound by 0 and 1 using glm()? Please show your code and plot similar to mine with the fit line, given probabilities, 'confidence intervals' and 'prediction intervals'.
# Fictitious data
xVal <- c(15,15,17,18,32,33,41,42,47,50,
53,55,62,63,64,65,66,68,70,79,
94,94,94,95,98)
randRatio <- c(.01,.03,.05,.04,.01,.2,.1,.08,.88,.2,
.2,.99,.49,.88,.2,.88,.66,.87,.66,.90,
.98,.88,.95,.95,.95)
# Converted to binomial
randBinom <- ifelse(randRatio < .5, 0, 1)
# Data frame for model
binomData <- data.frame(
randBinom = randBinom,
xVal = xVal
)
# Model
mode1 <- glm(randBinom~ xVal, data = binomData, family = binomial(link = "logit"))
# Predict all points in xVal range
frame <- data.frame(xVal=(0:100))
predAll <- predict(mode1, newdata = frame,type = "link", se.fit=TRUE)
# Params for intervals and plot
confidence <- .95
score <- qnorm((confidence / 2) + .5)
frame <- data.frame(xVal=(0:100))
#Plot
with(binomData, plot(xVal, randBinom, type="n", ylim=c(0, 1),
ylab = "Probability", xlab="xVal"))
lines(frame$xVal, plogis(predAll$fit), col = "red", lty = 1)
lines(frame$xVal, plogis(predAll$fit + score * predAll$se.fit), col = "red", lty = 3)
lines(frame$xVal, plogis(predAll$fit - score * predAll$se.fit), col = "red", lty = 3)
points(xVal, randRatio, col = "red") # Original probabilities
points(xVal, randBinom, col = "black", lwd = 3) # Binomial Points used in glm
Here's the plot, presumably with 'point prediction intervals' (i.e., "Prediction Intervals") in dashed red, and the mean fit in solid red. Black dots represent the discrete binomial data from original probabilities in randRatio:

I am not sure if you are asking for the straight up prediction interval, but if you are you can calculate it simply.
You can extract a traditional confidence interval for the model as such:
confint(model)
And then once you run a prediction, you can calculate a prediction interval based on the prediction like so:
upper = predAll$fit + 1.96 * predAll$se.fit
lower = predAll$fit - 1.96 * predAll$se.fit
You are simply taking the prediction (at any given point if you use a single set of predictor variables) and adding and subtracting 1.96 * absolute value of the standard error. (1.96 se includes 97.5% of the normal distribution and represents the 95% interval as it does for the standard deviation in the normal distribution)
This is the same formula that you would use for a traditional confidence interval except that using the standard error (as opposed to the standard deviation) makes the interval wider to account for the uncertainty in prediction itself.
Update:
Method for plotting prediction invervals courtesy of Rstudio!
As requested...though not done by me!

Related

R plot confidence interval lines with a robust linear regression model (rlm)

I need to plot a Scatterplot with the confidence interval for a robust linear regression (rlm) model, all the examples I had found only work with LM.
This is my code:
model1 <- rlm(weightsE$brain ~ weightsE$body)
newx <- seq(min(weightsE$body), max(weightsE$body), length.out=70)
newx<-as.data.frame(newx)
colnames(newx)<-"brain"
conf_interval <- predict(model1, newdata = data.frame(x=newx), interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$body)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
but the results of predict don't produce a straight line for the upper and lower level, they are more like random predictions.
You have a few problems to fix here.
When you generate a model, don't use rlm(weightsE$brain ~ weightsE$body), instead use rlm(brain ~ body, data = weightsE). Otherwise, the model cannot take new data for predictions. Any predictions you get will be produced from the original weightsE$body values, not from the new data you pass into predict
You are trying to create a prediction data frame with a column called "brain', but you are trying to predict the value of "brain", so you need a column called "body"
newx is already a data frame, but for some reason you are wrapping it inside another data frame when you do newdata = data.frame(x=newx). Just pass newx.
You are plotting with plot(weightsE$body, weightsE$body), when it should be plot(weightsE$body, weightsE$brain)
Putting all this together, and using a dummy data set with the same names as your own (see below), we get:
library(MASS)
model1 <- rlm(brain ~ body, data = weightsE)
newx <- data.frame(body = seq(min(weightsE$body),
max(weightsE$body), length.out=70))
conf_interval <- predict(model1, newdata = data.frame(x=newx),
interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$brain)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx$body, conf_interval[, 2], col = "blue", lty = 2)
lines(newx$body, conf_interval[, 3], col = "blue", lty = 2)
Incidentally, you could do the whole thing in ggplot in much less code:
library(ggplot2)
ggplot(weightsE, aes(body, brain)) +
geom_point() +
geom_smooth(method = MASS::rlm)
Reproducible dummy data
data(mtcars)
weightsE <- setNames(mtcars[c(1, 6)], c("brain", "body"))
weightsE$body <- 10 - weightsE$body

How to solve these problems about inverted ROC curve, small AUC, and the cutoff?

I am constructing this ROC curve from my SVM model, but the curve came out inverted. Also, although my SVM prediction has high accuracy (~93%), my ROC curve shows that my area under the curve is just about 2.7%. Moreover, it tells me that the optimal cutoff value is infinity, which is not what I expected from my model fitting.
I have fitted my SVM model using the built-in SVM function just like in the code I showed below, and then I predicted using the function predict(). Then, I computed the prediction() and calculated the performance(), the cutoff value, and the AUC (all code shown below)
svm.fit <- svm(label ~ NDAI + SD + CORR, data = trainSet, scale = FALSE, kernel = "radial", cost = 2, probability=TRUE)
svm.pred <- predict(svm.fit, testSet, probability=TRUE)
mean(svm.pred== testSet$label)*100
prediction.svm <- prediction(attr(svm.pred, "probabilities")[,2], testSet$label)
eval.svm <- performance(prediction.svm, "acc")
roc.svm <- performance(prediction.svm, "tpr", "fpr")
#identify best values and cutoff
max_index.svm <- which.max(slot(eval.svm, "y.values")[[1]])
max.acc_svm <- (slot(eval.svm, "y.values")[[1]])[max_index.svm]
opt.cutoff_svm <- (slot(eval.svm, "x.values")[[1]])[max_index.svm][[1]]
#AUC
auc.svm <- performance(prediction.svm, "auc")
auc.svm <- unlist(slot(auc.svm, "y.values"))
auc.svm <- round(auc.svm, 4)
plot(roc.svm,colorize=TRUE)
points(0.072, 0.93, pch= 20)
legend(.6,.2, auc.svm, title = "AUC", cex = 0.8)
legend(.8,.2, round(opt.cutoff_svm,4), title = "cutoff", cex = 0.8)
I expect the output to have AUC close to 1, and a small cutoff which is close to 0.5, with a curve with AUC close to 1. Has anyone encountered a similar problem like this one? If yes, how should I fix my code?

Bootstrapping Confidence Envelope for Binary Logistic Model in R

I am trying to develop a lower and upper 95% CI for a binary logistic regression model in R for some biological data. The response is pregnancy state based on hormone values, so an individual is either pregnant (1) or not (0). I then predict a series of unknown values across the model to get probabilities of pregnancy for individuals. I need to get/develop a 95% upper and lower CI envelope for the model and plot it. I have been able to do this in Matlab but can not get it to work in R using the bootstrap function to develop 1000 replicates from a vector/array of values that I can then take the upper .975 and lower .025 of to develop my CIs. Any help and feedback would be great. Thanks a bunch.
R code:
model5 <-glm(Preg~logP4, data = controls, family=binomial(link="logit"))
summary(model5)
range(controls$logP4)
xlogP4 <- seq(-1, 3, 0.01)
ylogP4 <- predict(model5, list(logP4=xlogP4, type ="response"))
plot(controls$logP4, controls$Preg, pch =16, xlab ="Log10(Progesterone)", ylab ="Probability of being pregnant")
curve(predict(model5, data.frame(logP4=x), type="resp"), add=TRUE)
Matlab code:
data = GoMControlsFinal;
x = data(1:29,4)%logP4 value
X = table2array(x)
y = data(1:29,6)% pregnant binary response
Y = table2array(y)
x1 = (-1:0.001:3)'
[b,dev,stats] = glmfit(X,Y,'binomial', 'logit') % linear regression analysis
yfit = glmval(b, x1, 'logit') % linear regression analysis
%% not yet giving me a P of 0 to 1 for pregnancy, still working as linear model
for i=1:10000 %number of replicates
b2 = bootstrp(1,#glmBfit,X,Y); %generates bootstrap error envelop
yfitBoot(:,i) = glmval(b2', x1, 'logit');
%plot (x1, yfitBoot(:,i), '-','LineWidth',1)
end
s =sort(yfitBoot');
s_lo = s(250,:) %number of replicates * 0.025
s_hi = s(9750,:)%number of replicates * 0.975
s_lo3 = s_lo'
s_hi3 = s_hi'
figure
z1= plot(x1, s_lo, 'b:', 'linewidth',2) % CI low line
hold on
z2 = plot(x1, s_hi, 'b:', 'linewidth',2) %ci hi line
z3= plot (x1, yfit, 'k-', 'LineWidth',2) % Model line
z4=scatter(X,Y, 'r', 'filled')
legend([z1, z3, z4], {'95% CI','Logistic Model', 'GoM Control Samples'})
xlabel('Log10 progesterone concentration')
ylabel('Probability of being pregnant')

How to obtain profile confidence intervals of the difference in probability of success between two groups from a logit model (glmer)?

I am struggling to transform the log odds ratio profile confidence intervals obtained from a logit model into probabilities. I would like to know how to calculate the confidence intervals of the difference between two groups.
If the p-value is > 0.05, the 95% CI of the difference should span from below zero to above zero. However, I don’t know how negative values can be obtained when the log ratios have to be exponentiated. Therefore I tried to calculate the CI of one of the groups (B) and see what the difference of the lower and the upper end of the CI to the estimate of group A is. I believe this is not the correct way to calculate the CI of the difference because the estimate of A is also uncertain.
I would be happy if anyone could help me out.
library(lme4)
# Example data:
set.seed(11)
treatment = c(rep("A",30), rep("B", 40))
site = rep(1:14, each = 5)
presence = c(rbinom(30, 1, 0.6),rbinom(40, 1, 0.8))
df = data.frame(presence, treatment, site)
# Likelihood ratio test
M0 = glmer(presence ~ 1 + (1|site), family = "binomial", data = df)
M1 = glmer(presence ~ treatment + (1|site), family = "binomial", data = df)
anova(M1, M0)
# Calculating confidence intervals
cc <- confint(M1, parm = "beta_")
ctab <- cbind(est = fixef(M1), cc)
cdat = as.data.frame(ctab)
# Function to back-transform to probability (0-1)
unlogit = function(y){
y_retransfromed = exp(y)/(1+exp(y))
y_retransfromed
}
# Getting estimates
A_est = unlogit(cdat$est[1])
B_est = unlogit(cdat$est[1] + cdat$est[2])
B_lwr = unlogit(cdat$est[1] + cdat[2,2])
B_upr = unlogit(cdat$est[1] + cdat[2,3])
Difference_est = B_est - A_est
# This is how I tried to calculate the CI of the difference
Difference_lwr = B_lwr - A_est
Difference_upr = B_upr - A_est
# However, I believe this is wrong because A_est is also “uncertain”
How to get the confidence interval of the difference of the probability of presence?
We can calculate the average treatment effect in the following way. From the original data, create two new datasets, one in which all units receive treatment A, and one in which all units receive treatment B. Now, based on your model estimates (in your case, M1), we compute predicted outcomes for units in each of these two datasets. We then compute the mean difference in the outcomes between the two datasets to get our estimated average treatment effect. Here, we can write a function that takes a glmer object and computes the average treatment effect:
ate <- function(.) {
treat_A <- treat_B <- df
treat_A$treatment <- "A"
treat_B$treatment <- "B"
c("ate" = mean(predict(., newdata = treat_B, type = "response") -
predict(., newdata = treat_A, type = "response")))
}
ate(M1)
# ate
# 0.09478276
How do we get the uncertainty interval? We can use the bootstrap, i.e. re-estimate the model many times using randomly generated samples from your original data, calculating the average treatment effect each time. We can then use the distribution of the bootstrapped average treatment effects to compute our uncertainty interval. Here we generate 100 simulations using the bootMer function
out <- bootMer(M1, ate, seed = 1234, nsim = 100)
and inspect the distribution of the effect:
quantile(out$t, c(0.025, 0.5, 0.975))
# 2.5% 50% 97.5%
# -0.06761338 0.10508751 0.26907504

How to plot confidence bands for my weighted log-log linear regression?

I need to plot an exponential species-area relationship using the exponential form of a weighted log-log linear model, where mean species number per location/Bank (sb$NoSpec.mean) is weighted by the variance in species number per year (sb$NoSpec.var).
I am able to plot the fit, but have issues figuring out how to plot the confidence intervals around this fit. The following is the best I have come up with so far. Any advice for me?
# Data
df <- read.csv("YearlySpeciesCount_SizeGroups.csv")
require(doBy)
sb <- summaryBy(NoSpec ~ Short + Area + Regime + SizeGrp, df,
FUN=c(mean,var, length))
# Plot to fill
plot(S ~ A, xlab = "Bank Area (km2)", type = "n", ylab = "Species count",
ylim = c(min(S), max(S)))
text(A, S, label = Pisc$Short, col = 'black')
# The Arrhenius model
require(vegan)
gg <- data.frame(S=S, A=A, W=W)
mloglog <- lm(log(S) ~ log(A), weights = 1 / (log10(W + 1)), data = gg)
# Add exponential fit to plot (this works well)
lines(xtmp, exp(predict(mloglog, newdata = data.frame(A = xtmp))),
lty=1, lwd=2)
Now I want to add confidence bands... This is where I'm finding issues...
## predict using original model.. get standard errors
pp<-data.frame(A = xtmp)
p <- predict(mloglog, newdata = pp, se.fit = TRUE)
pp$fit <- p$fit
pp$se <- p$se.fit
## Calculate lower and upper bounds for each estimate using standard error * 1.96
pp$upr95 <- pp$fit + (1.96 * pp$se)
pp$lwr95 <- pp$fit - (1.96 * pp$se)
But I am not sure whether the following is correct. I couldn't find any answers that didn't involve ggplot when searching google / stack overflow / cross validated.
## Create new linear models to create a fitted line given upper and lower bounds?
upr <- lm(log(upr95) ~ log(A), data=pp)
lwr <- lm(log(lwr95) ~ log(A), data=pp)
lines(xtmp, exp(predict(upr, newdata=pp)), lty=2, lwd=1)
lines(xtmp, exp(predict(lwr, newdata=pp)), lty=2, lwd=1)
Thanks in advance for any help!
It is OK for this question to be without data provided, because:
OP's code is said to be working fine so there is nothing "not working";
this question is more related to statistical procedure: what is the right thing to do.
I would make a brief answer, as I saw you added "solved" to question title in your last update. Note it is not recommended to add such key word to question title. If something is solved, use an answer.
Strictly speaking, using 1.96 is incorrect. You can read How does predict.lm() compute confidence interval and prediction interval? for details. We need residual degree of freedom and 0.025 quantile of t-distribution.
What I want to say, is that predict.lm can return confidence interval for you:
pp <- data.frame(A = xtmp)
p <- predict(mloglog, newdata = pp, interval = "confidence")
p will be a three-column matrix, with "fit", "lwr" and "upr".
Since you fitted a log-log model, both fitted values and confidence interval need be back transformed. Simply take exp on this matrix p:
p <- exp(p)
Now you can easily use matplot to produce nice regression plot:
matplot(xtmp, p, type = "l", col = c(1, 2, 2), lty = c(1, 2, 2))

Resources