R plot confidence interval lines with a robust linear regression model (rlm) - r

I need to plot a Scatterplot with the confidence interval for a robust linear regression (rlm) model, all the examples I had found only work with LM.
This is my code:
model1 <- rlm(weightsE$brain ~ weightsE$body)
newx <- seq(min(weightsE$body), max(weightsE$body), length.out=70)
newx<-as.data.frame(newx)
colnames(newx)<-"brain"
conf_interval <- predict(model1, newdata = data.frame(x=newx), interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$body)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
but the results of predict don't produce a straight line for the upper and lower level, they are more like random predictions.

You have a few problems to fix here.
When you generate a model, don't use rlm(weightsE$brain ~ weightsE$body), instead use rlm(brain ~ body, data = weightsE). Otherwise, the model cannot take new data for predictions. Any predictions you get will be produced from the original weightsE$body values, not from the new data you pass into predict
You are trying to create a prediction data frame with a column called "brain', but you are trying to predict the value of "brain", so you need a column called "body"
newx is already a data frame, but for some reason you are wrapping it inside another data frame when you do newdata = data.frame(x=newx). Just pass newx.
You are plotting with plot(weightsE$body, weightsE$body), when it should be plot(weightsE$body, weightsE$brain)
Putting all this together, and using a dummy data set with the same names as your own (see below), we get:
library(MASS)
model1 <- rlm(brain ~ body, data = weightsE)
newx <- data.frame(body = seq(min(weightsE$body),
max(weightsE$body), length.out=70))
conf_interval <- predict(model1, newdata = data.frame(x=newx),
interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$brain)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx$body, conf_interval[, 2], col = "blue", lty = 2)
lines(newx$body, conf_interval[, 3], col = "blue", lty = 2)
Incidentally, you could do the whole thing in ggplot in much less code:
library(ggplot2)
ggplot(weightsE, aes(body, brain)) +
geom_point() +
geom_smooth(method = MASS::rlm)
Reproducible dummy data
data(mtcars)
weightsE <- setNames(mtcars[c(1, 6)], c("brain", "body"))
weightsE$body <- 10 - weightsE$body

Related

(R) Adding Confidence Intervals To Plots

I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html

Plotting a 95% confidence interval for a lm object

How can I calculate and plot a confidence interval for my regression in r? So far I have two numerical vectors of equal length (x,y) and a regression object(lm.out). I have made a scatterplot of y given x and added the regression line to this plot. I am looking for a way to add a 95% prediction confidence band for lm.out to the plot. I've tried using the predict function, but I don't even know where to start with that :/. Here is my code at the moment:
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
plot(x,y)
regression.data = summary(lm.out) #save regression summary as variable
names(regression.data) #get names so we can index this data
a= regression.data$coefficients["(Intercept)","Estimate"] #grab values
b= regression.data$coefficients["x","Estimate"]
abline(a,b) #add the regression line
Thank you!
Edit: I've taken a look at the proposed duplicate and can't quite get to the bottom of it.
You have yo use predict for a new vector of data, here newx.
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
newx = seq(min(x),max(x),by = 0.05)
conf_interval <- predict(lm.out, newdata=data.frame(x=newx), interval="confidence",
level = 0.95)
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
EDIT
as it is mention in the coments by Ben this can be done with matlines as follow:
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
matlines(newx, conf_interval[,2:3], col = "blue", lty=2)
I'm going to add a tip that would have saved me a lot of frustration when trying the method given by #Alejandro Andrade: If your data are in a data frame, then when you build your model with lm(), use the data= argument rather than $ notation. E.g., use
lm.out <- lm(y ~ x, data = mydata)
rather than
lm.out <- lm(mydata$y ~ mydata$x)
If you do the latter, then this statement
predict(lm.out, newdata=data.frame(x=newx), interval="confidence", level = 0.95)
seems to either ignore the new values passed using newdata= or there's a silent error. Either way, the output is the predictions from the original data, not the new data.
Also, be sure your x variable gets the same name in the new data frame that it had
in the original. That's easier to figure out because you do get an error, but knowing it ahead of time might save you a round of debugging.
Note: Tried to add this as a comment, but don't have enough reputation points.

How to plot confidence bands for my weighted log-log linear regression?

I need to plot an exponential species-area relationship using the exponential form of a weighted log-log linear model, where mean species number per location/Bank (sb$NoSpec.mean) is weighted by the variance in species number per year (sb$NoSpec.var).
I am able to plot the fit, but have issues figuring out how to plot the confidence intervals around this fit. The following is the best I have come up with so far. Any advice for me?
# Data
df <- read.csv("YearlySpeciesCount_SizeGroups.csv")
require(doBy)
sb <- summaryBy(NoSpec ~ Short + Area + Regime + SizeGrp, df,
FUN=c(mean,var, length))
# Plot to fill
plot(S ~ A, xlab = "Bank Area (km2)", type = "n", ylab = "Species count",
ylim = c(min(S), max(S)))
text(A, S, label = Pisc$Short, col = 'black')
# The Arrhenius model
require(vegan)
gg <- data.frame(S=S, A=A, W=W)
mloglog <- lm(log(S) ~ log(A), weights = 1 / (log10(W + 1)), data = gg)
# Add exponential fit to plot (this works well)
lines(xtmp, exp(predict(mloglog, newdata = data.frame(A = xtmp))),
lty=1, lwd=2)
Now I want to add confidence bands... This is where I'm finding issues...
## predict using original model.. get standard errors
pp<-data.frame(A = xtmp)
p <- predict(mloglog, newdata = pp, se.fit = TRUE)
pp$fit <- p$fit
pp$se <- p$se.fit
## Calculate lower and upper bounds for each estimate using standard error * 1.96
pp$upr95 <- pp$fit + (1.96 * pp$se)
pp$lwr95 <- pp$fit - (1.96 * pp$se)
But I am not sure whether the following is correct. I couldn't find any answers that didn't involve ggplot when searching google / stack overflow / cross validated.
## Create new linear models to create a fitted line given upper and lower bounds?
upr <- lm(log(upr95) ~ log(A), data=pp)
lwr <- lm(log(lwr95) ~ log(A), data=pp)
lines(xtmp, exp(predict(upr, newdata=pp)), lty=2, lwd=1)
lines(xtmp, exp(predict(lwr, newdata=pp)), lty=2, lwd=1)
Thanks in advance for any help!
It is OK for this question to be without data provided, because:
OP's code is said to be working fine so there is nothing "not working";
this question is more related to statistical procedure: what is the right thing to do.
I would make a brief answer, as I saw you added "solved" to question title in your last update. Note it is not recommended to add such key word to question title. If something is solved, use an answer.
Strictly speaking, using 1.96 is incorrect. You can read How does predict.lm() compute confidence interval and prediction interval? for details. We need residual degree of freedom and 0.025 quantile of t-distribution.
What I want to say, is that predict.lm can return confidence interval for you:
pp <- data.frame(A = xtmp)
p <- predict(mloglog, newdata = pp, interval = "confidence")
p will be a three-column matrix, with "fit", "lwr" and "upr".
Since you fitted a log-log model, both fitted values and confidence interval need be back transformed. Simply take exp on this matrix p:
p <- exp(p)
Now you can easily use matplot to produce nice regression plot:
matplot(xtmp, p, type = "l", col = c(1, 2, 2), lty = c(1, 2, 2))

Prediction and Confidence intervals for Logistic Regression

Below is a set of fictitious probability data, which I converted into binomial with a threshold of 0.5. I ran a glm() model on the discrete data to test if the intervals returned from glm() were 'mean prediction intervals' ("Confidence Interval") or 'point prediction intervals'("Prediction Interval"). It appears from the plot below that the returned intervals are the latter--'Point Prediction Intervals'; note, with 95% confidence, 2/20 points fall outside of the line in this sample.
If this is indeed the case, how do I generate the the 'mean prediction interval' (i.e, "Confidence Intervals") in R for a binomial data set bound by 0 and 1 using glm()? Please show your code and plot similar to mine with the fit line, given probabilities, 'confidence intervals' and 'prediction intervals'.
# Fictitious data
xVal <- c(15,15,17,18,32,33,41,42,47,50,
53,55,62,63,64,65,66,68,70,79,
94,94,94,95,98)
randRatio <- c(.01,.03,.05,.04,.01,.2,.1,.08,.88,.2,
.2,.99,.49,.88,.2,.88,.66,.87,.66,.90,
.98,.88,.95,.95,.95)
# Converted to binomial
randBinom <- ifelse(randRatio < .5, 0, 1)
# Data frame for model
binomData <- data.frame(
randBinom = randBinom,
xVal = xVal
)
# Model
mode1 <- glm(randBinom~ xVal, data = binomData, family = binomial(link = "logit"))
# Predict all points in xVal range
frame <- data.frame(xVal=(0:100))
predAll <- predict(mode1, newdata = frame,type = "link", se.fit=TRUE)
# Params for intervals and plot
confidence <- .95
score <- qnorm((confidence / 2) + .5)
frame <- data.frame(xVal=(0:100))
#Plot
with(binomData, plot(xVal, randBinom, type="n", ylim=c(0, 1),
ylab = "Probability", xlab="xVal"))
lines(frame$xVal, plogis(predAll$fit), col = "red", lty = 1)
lines(frame$xVal, plogis(predAll$fit + score * predAll$se.fit), col = "red", lty = 3)
lines(frame$xVal, plogis(predAll$fit - score * predAll$se.fit), col = "red", lty = 3)
points(xVal, randRatio, col = "red") # Original probabilities
points(xVal, randBinom, col = "black", lwd = 3) # Binomial Points used in glm
Here's the plot, presumably with 'point prediction intervals' (i.e., "Prediction Intervals") in dashed red, and the mean fit in solid red. Black dots represent the discrete binomial data from original probabilities in randRatio:
I am not sure if you are asking for the straight up prediction interval, but if you are you can calculate it simply.
You can extract a traditional confidence interval for the model as such:
confint(model)
And then once you run a prediction, you can calculate a prediction interval based on the prediction like so:
upper = predAll$fit + 1.96 * predAll$se.fit
lower = predAll$fit - 1.96 * predAll$se.fit
You are simply taking the prediction (at any given point if you use a single set of predictor variables) and adding and subtracting 1.96 * absolute value of the standard error. (1.96 se includes 97.5% of the normal distribution and represents the 95% interval as it does for the standard deviation in the normal distribution)
This is the same formula that you would use for a traditional confidence interval except that using the standard error (as opposed to the standard deviation) makes the interval wider to account for the uncertainty in prediction itself.
Update:
Method for plotting prediction invervals courtesy of Rstudio!
As requested...though not done by me!

Conditionally colour data points outside of confidence bands in R

I need to colour datapoints that are outside of the the confidence bands on the plot below differently from those within the bands. Should I add a separate column to my dataset to record whether the data points are within the confidence bands? Can you provide an example please?
Example dataset:
## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Take a look at the data
plot(
diseasesev~temperature,
data=severity,
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
par(new=TRUE) # don't start a new plot
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
lm(diseasesev~temperature),
xRange,
level=0.95,
interval="confidence"
)
## Plot lines derrived from best fit line and confidence band datapoints
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
xlim=c(0,30),
ylim=c(0,30),
xlab="",
ylab=""
)
Well, I thought that this would be pretty easy with ggplot2, but now I realize that I have no idea how the confidence limits for stat_smooth/geom_smooth are calculated.
Consider the following:
library(ggplot2)
pred <- as.data.frame(predict(severity.lm,level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$upr & diseasesev >=pred$lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') + geom_point(aes(colour=in_interval)) +
geom_line(aes(y=lwr),colour=I('red')) + geom_line(aes(y=upr),colour=I('red'))
This produces:
alt text http://ifellows.ucsd.edu/pmwiki/uploads/Main/strangeplot.jpg
I don't understand why the confidence band calculated by stat_smooth is inconsistent with the band calculated directly from predict (i.e. the red lines). Can anyone shed some light on this?
Edit:
figured it out. ggplot2 uses 1.96 * standard error to draw the intervals for all smoothing methods.
pred <- as.data.frame(predict(severity.lm,se.fit=TRUE,
level=0.95,interval="confidence"))
dat <- data.frame(diseasesev,temperature,
in_interval = diseasesev <=pred$fit.upr & diseasesev >=pred$fit.lwr ,pred)
ggplot(dat,aes(y=diseasesev,x=temperature)) +
stat_smooth(method='lm') +
geom_point(aes(colour=in_interval)) +
geom_line(aes(y=fit.lwr),colour=I('red')) +
geom_line(aes(y=fit.upr),colour=I('red')) +
geom_line(aes(y=fit.fit-1.96*se.fit),colour=I('green')) +
geom_line(aes(y=fit.fit+1.96*se.fit),colour=I('green'))
The easiest way is probably to calculate a vector of TRUE/FALSE values that indicate if a data point is inside of the confidence interval or not. I'm going to reshuffle your example a little bit so that all of the calculations are completed before the plotting commands are executed- this provides a clean separation in the program logic that could be exploited if you were to package some of this into a function.
The first part is pretty much the same, except I replaced the additional call to lm() inside predict() with the severity.lm variable- there is no need to use additional computing resources to recalculate the linear model when we already have it stored:
## Dataset from
# apsnet.org/education/advancedplantpath/topics/
# RModules/doc1/04_Linear_regression.html
## Disease severity as a function of temperature
# Response variable, disease severity
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
## Get datapoints predicted by best fit line and confidence bands
## at every 0.01 interval
xRange=data.frame(temperature=seq(min(temperature),max(temperature),0.01))
pred4plot <- predict(
severity.lm,
xRange,
level=0.95,
interval="confidence"
)
Now, we'll calculate the confidence intervals for the origional data points and run a test to see if the points are inside the interval:
modelConfInt <- predict(
severity.lm,
level = 0.95,
interval = "confidence"
)
insideInterval <- modelConfInt[,'lwr'] < severity[['diseasesev']] &
severity[['diseasesev']] < modelConfInt[,'upr']
Then we'll do the plot- first a the high-level plotting function plot(), as you used it in your example, but we will only plot the points inside the interval. We will then follow up with the low-level function points() which will plot all the points outside the interval in a different color. Finally, matplot() will be used to fill in the confidence intervals as you used it. However instead of calling par(new=TRUE) I prefer to pass the argument add=TRUE to high-level functions to make them act like low level functions.
Using par(new=TRUE) is like playing a dirty trick a plotting function- which can have unforeseen consequences. The add argument is provided by many functions to cause them to add information to a plot rather than redraw it- I would recommend exploiting this argument whenever possible and fall back on par() manipulations as a last resort.
# Take a look at the data- those points inside the interval
plot(
diseasesev~temperature,
data=severity[ insideInterval,],
xlab="Temperature",
ylab="% Disease Severity",
pch=16,
pty="s",
xlim=c(0,30),
ylim=c(0,30)
)
title(main="Graph of % Disease Severity vs Temperature")
# Add points outside the interval, color differently
points(
diseasesev~temperature,
pch = 16,
col = 'red',
data = severity[ !insideInterval,]
)
# Add regression line and confidence intervals
matplot(
xRange,
pred4plot,
lty=c(1,2,2), #vector of line types and widths
type="l", #type of plot for each column of y
add = TRUE
)
I liked the idea and tried to make a function for that. Of course it's far from being perfect. Your comments are welcome
diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4)
# Predictor variable, (Centigrade)
temperature<-c(2,1,5,5,20,20,23,10,30,25)
## For convenience, the data may be formatted into a dataframe
severity <- as.data.frame(cbind(diseasesev,temperature))
## Fit a linear model for the data and summarize the output from function lm()
severity.lm <- lm(diseasesev~temperature,data=severity)
# Function to plot the linear regression and overlay the confidence intervals
ci.lines<-function(model,conf= .95 ,interval = "confidence"){
x <- model[[12]][[2]]
y <- model[[12]][[1]]
xm<-mean(x)
n<-length(x)
ssx<- sum((x - mean(x))^2)
s.t<- qt(1-(1-conf)/2,(n-2))
xv<-seq(min(x),max(x),(max(x) - min(x))/100)
yv<- coef(model)[1]+coef(model)[2]*xv
se <- switch(interval,
confidence = summary(model)[[6]] * sqrt(1/n+(xv-xm)^2/ssx),
prediction = summary(model)[[6]] * sqrt(1+1/n+(xv-xm)^2/ssx)
)
# summary(model)[[6]] = 'sigma'
ci<-s.t*se
uyv<-yv+ci
lyv<-yv-ci
limits1 <- min(c(x,y))
limits2 <- max(c(x,y))
predictions <- predict(model, level = conf, interval = interval)
insideCI <- predictions[,'lwr'] < y & y < predictions[,'upr']
x_name <- rownames(attr(model[[11]],"factors"))[2]
y_name <- rownames(attr(model[[11]],"factors"))[1]
plot(x[insideCI],y[insideCI],
pch=16,pty="s",xlim=c(limits1,limits2),ylim=c(limits1,limits2),
xlab=x_name,
ylab=y_name,
main=paste("Graph of ", y_name, " vs ", x_name,sep=""))
abline(model)
points(x[!insideCI],y[!insideCI], pch = 16, col = 'red')
lines(xv,uyv,lty=2,col=3)
lines(xv,lyv,lty=2,col=3)
}
Use it like this:
ci.lines(severity.lm, conf= .95 , interval = "confidence")
ci.lines(severity.lm, conf= .85 , interval = "prediction")

Resources