Plotting a 95% confidence interval for a lm object - r

How can I calculate and plot a confidence interval for my regression in r? So far I have two numerical vectors of equal length (x,y) and a regression object(lm.out). I have made a scatterplot of y given x and added the regression line to this plot. I am looking for a way to add a 95% prediction confidence band for lm.out to the plot. I've tried using the predict function, but I don't even know where to start with that :/. Here is my code at the moment:
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
plot(x,y)
regression.data = summary(lm.out) #save regression summary as variable
names(regression.data) #get names so we can index this data
a= regression.data$coefficients["(Intercept)","Estimate"] #grab values
b= regression.data$coefficients["x","Estimate"]
abline(a,b) #add the regression line
Thank you!
Edit: I've taken a look at the proposed duplicate and can't quite get to the bottom of it.

You have yo use predict for a new vector of data, here newx.
x=c(1,2,3,4,5,6,7,8,9,0)
y=c(13,28,43,35,96,84,101,110,108,13)
lm.out <- lm(y ~ x)
newx = seq(min(x),max(x),by = 0.05)
conf_interval <- predict(lm.out, newdata=data.frame(x=newx), interval="confidence",
level = 0.95)
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
EDIT
as it is mention in the coments by Ben this can be done with matlines as follow:
plot(x, y, xlab="x", ylab="y", main="Regression")
abline(lm.out, col="lightblue")
matlines(newx, conf_interval[,2:3], col = "blue", lty=2)

I'm going to add a tip that would have saved me a lot of frustration when trying the method given by #Alejandro Andrade: If your data are in a data frame, then when you build your model with lm(), use the data= argument rather than $ notation. E.g., use
lm.out <- lm(y ~ x, data = mydata)
rather than
lm.out <- lm(mydata$y ~ mydata$x)
If you do the latter, then this statement
predict(lm.out, newdata=data.frame(x=newx), interval="confidence", level = 0.95)
seems to either ignore the new values passed using newdata= or there's a silent error. Either way, the output is the predictions from the original data, not the new data.
Also, be sure your x variable gets the same name in the new data frame that it had
in the original. That's easier to figure out because you do get an error, but knowing it ahead of time might save you a round of debugging.
Note: Tried to add this as a comment, but don't have enough reputation points.

Related

R plot confidence interval lines with a robust linear regression model (rlm)

I need to plot a Scatterplot with the confidence interval for a robust linear regression (rlm) model, all the examples I had found only work with LM.
This is my code:
model1 <- rlm(weightsE$brain ~ weightsE$body)
newx <- seq(min(weightsE$body), max(weightsE$body), length.out=70)
newx<-as.data.frame(newx)
colnames(newx)<-"brain"
conf_interval <- predict(model1, newdata = data.frame(x=newx), interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$body)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx, conf_interval[,2], col="blue", lty=2)
lines(newx, conf_interval[,3], col="blue", lty=2)
but the results of predict don't produce a straight line for the upper and lower level, they are more like random predictions.
You have a few problems to fix here.
When you generate a model, don't use rlm(weightsE$brain ~ weightsE$body), instead use rlm(brain ~ body, data = weightsE). Otherwise, the model cannot take new data for predictions. Any predictions you get will be produced from the original weightsE$body values, not from the new data you pass into predict
You are trying to create a prediction data frame with a column called "brain', but you are trying to predict the value of "brain", so you need a column called "body"
newx is already a data frame, but for some reason you are wrapping it inside another data frame when you do newdata = data.frame(x=newx). Just pass newx.
You are plotting with plot(weightsE$body, weightsE$body), when it should be plot(weightsE$body, weightsE$brain)
Putting all this together, and using a dummy data set with the same names as your own (see below), we get:
library(MASS)
model1 <- rlm(brain ~ body, data = weightsE)
newx <- data.frame(body = seq(min(weightsE$body),
max(weightsE$body), length.out=70))
conf_interval <- predict(model1, newdata = data.frame(x=newx),
interval = 'confidence',
level=0.95)
#create scatterplot of values with regression line
plot(weightsE$body, weightsE$brain)
abline(model1)
#add dashed lines (lty=2) for the 95% confidence interval
lines(newx$body, conf_interval[, 2], col = "blue", lty = 2)
lines(newx$body, conf_interval[, 3], col = "blue", lty = 2)
Incidentally, you could do the whole thing in ggplot in much less code:
library(ggplot2)
ggplot(weightsE, aes(body, brain)) +
geom_point() +
geom_smooth(method = MASS::rlm)
Reproducible dummy data
data(mtcars)
weightsE <- setNames(mtcars[c(1, 6)], c("brain", "body"))
weightsE$body <- 10 - weightsE$body

How to extrapolate loess result? [duplicate]

I am struggling with "out-of-sample" prediction using loess. I get NA values for new x that are outside the original sample. Can I get these predictions?
x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x)
x.all <- seq(3, 200, 3)
predict(object = lo, newdata = x.all)
I need to model full yield curve, i.e. interest rates for different maturities.
From the manual page of predict.loess:
When the fit was made using surface = "interpolate" (the default), predict.loess will not extrapolate – so points outside an axis-aligned hypercube enclosing the original data will have missing (NA) predictions and standard errors
If you change the surface parameter to "direct" you can extrapolate values.
For instance, this will work (on a side note: after plotting the prediction, my feeling is that you should increase the span parameter in the loess call a little bit):
lo <- loess(y~x, control=loess.control(surface="direct"))
predict(lo, newdata=x.all)
In addition to nico's answer: I would suggest to fit a gam (which uses penalized regression splines) instead. However, extrapolation is not advisable if you don't have a model based on science.
x <- c(24,36,48,60,84,120,180)
y <- c(3.94,4.03,4.29,4.30,4.63,4.86,5.02)
lo <- loess(y~x, control=loess.control(surface = "direct"))
plot(x.all <- seq(3,200,3),
predict(object = lo,newdata = x.all),
type="l", col="blue")
points(x, y)
library(mgcv)
fit <- gam(y ~ s(x, bs="cr", k=7, fx =FALSE), data = data.frame(x, y))
summary(fit)
lines(x.all, predict(fit, newdata = data.frame(x = x.all)), col="green")

Why abline won't show line from glm with Gamma family?

I have the following data, which I'm trying to model via GLM, using Gamma function. It works, except that abline won't show any line. What am I doing wrong?
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
hist(y,breaks=15)
plot(y~x)
fit <- glm(y~x,family='Gamma'(link='log'))
abline(fit)
abline plots linear functions, from a simple linear regression, say. A GLM with a Gamma family and a log link is nonlinear on the original scale. To visualize the fit of such a model, you could use predict (an example is given below). Several packages (e.g. effects or visreg) for R exist that feature functions that allow you to directly plot the fit on the original scale including confidence intervals.
Here is an example using visreg using your data and model:
library(visreg)
y <- c(0.00904977380111,0.009174311972687,0.022573363475789,0.081632653008122,0.005571030584803,1e-04,0.02375296916921,0.004962779106823,0.013729977117333,0.00904977380111,0.004514672640982,0.016528925619835,1e-04,0.027855153258277,0.011834319585449,0.024999999936719,1e-04,0.026809651528869,0.016348773841071,1e-04,0.009345794439034,0.00457665899303,0.004705882305772,0.023201856194357,1e-04,0.033734939711656,0.014251781472007,0.004662004755245,0.009259259166667,0.056872037917387,0.018518518611111,0.014598540145986,0.009478673032951,0.023529411811211,0.004819277060357,0.018691588737881,0.018957345923721,0.005390835525461,0.056179775223141,0.016348773841071,0.01104972381185,0.010928961639344,1e-04,1e-04,0.010869565271444,0.011363636420778,0.016085790883856,0.016,0.005665722322786,0.01117318441372,0.028818443860841,1e-04,0.022988505862069,0.01010101,1e-04,0.018083182676638,0.00904977380111,0.00961538466323,0.005390835525461,0.005763688703004,1e-04,0.005571030584803,1e-04,0.014388489208633,0.005633802760722,0.005633802760722,1e-04,0.005361930241431,0.005698005811966,0.013986013986014,1e-04,1e-04)
x <- c(600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,600,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,3500,744.47,744.47,744.47,744.47,744.47,744.47,744.47,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42,630.42)
fit <- glm(y~x,family='Gamma'(link='log'))
visreg(fit, scale = "response")
An here is the example using R base graphics and predict:
pred_frame <- data.frame(
x = seq(min(x), max(x), length.out = 1000)
)
pred_frame$fit <- predict(fit, newdata = pred_frame, type = "response")
plot(y~x, pch = 16, las = 1, cex = 1.5)
lines(fit~x, data = pred_frame, col = "steelblue", lwd = 3)
You are not being consistent here since you chose to model on the log scale but you are plotting on the raw scale. Mind you many, many published plots do the same. You need to plot the points in log space or transform the coefficients and pass them to abline() explicitly.

get equation of linear SVM regression line

I'm struggling to find a way to get the equation for a linear SVM model in the regression case, since most of the questions deal with classification... I have fit it with caret package.
1- univariate case
set.seed(1)
fit=train(mpg~hp, data=mtcars, method="svmLinear")
plot(x=mtcars$hp, y=predict(fit, mtcars), pch=15)
points(x=mtcars$hp, y=mtcars$mpg, col="red")
abline(lm(mpg~hp, mtcars), col="blue")
Which gives the plot with red=actual, black=fitted, and blue line is classic regression. In this case I know I could manually calculate the SVM prediction line from 2 points, but is there a way to get directly the equation from the model structure? I actually need the equation like this y=a+bx (here mpg=?+?*hp ) with values in the original scale.
2-multivariate
same question but with 2 dependent variables (mpg~hp+wt)
Thanks,
Yes, I believe there is. Take a look at this answer, which is similar, but does not use the caret library. If you add svp = fit$finalModelto the example, you should be able to follow it almost exactly. I applied a similar technique to your data below. I scaled the data to fit nicely on the plot of the vectors since the library scales the data at runtime.
require(caret)
set.seed(1)
x = model.matrix(data=mtcars, mpg ~ scale(hp)) #set up data
y = mtcars$mpg
fit=train(x, y, method="svmLinear") #train
svp = fit$finalModel #extract s4 model object
plot(x, xlab="", ylab="")
w <- colSums(coef(svp)[[1]] * x[unlist(alphaindex(svp)),])
b <- b(svp)
abline(b/w[1],-w[2]/w[1], col='red')
abline((b+1)/w[1],-w[2]/w[1],lty=2, col='red')
abline((b-1)/w[1],-w[2]/w[1],lty=2, col='red')
And your second question:
x = model.matrix(data=mtcars, mpg ~ scale(hp) + scale(wt) - 1) #set up data
fit=train(x, y, method="svmLinear") #train
svp = fit$finalModel #extract s4 model object
plot(x, xlab="", ylab="")
w <- colSums(coef(svp)[[1]] * x[unlist(alphaindex(svp)),])
b <- b(svp)
abline(b/w[1],-w[2]/w[1], col='red')
abline((b+1)/w[1],-w[2]/w[1],lty=2, col='red')
abline((b-1)/w[1],-w[2]/w[1],lty=2, col='red')
Edit
The above answer concerns plotting a boundary, not the linear SVM regression line. To answer the question, one easy way to get the line is to extract the predicted values and plot the regression. You actually only need a couple of points to get the line, but for simplicity, I used the following code.
abline(lm(predict(fit, newdata=mtcars) ~ mtcars$hp), col='green')
or
abline(lm(predict(fit) ~ mtcars$hp), col='green')

How to find "y" values of the already estimated monotone function of the non-monotone regression curve corresponding to the original "x" points?

The title sounds complicated but that is what I am looking for. Focus on the picture.
## data
x <- c(1.009648,1.017896,1.021773,1.043659,1.060277,1.074578,1.075495,1.097086,1.106268,1.110550,1.117795,1.143573,1.166305,1.177850,1.188795,1.198032,1.200526,1.223329,1.235814,1.239068,1.243189,1.260003,1.262732,1.266907,1.269932,1.284472,1.307483,1.323714,1.326705,1.328625,1.372419,1.398703,1.404474,1.414360,1.415909,1.418254,1.430865,1.431476,1.437642,1.438682,1.447056,1.456152,1.457934,1.457993,1.465968,1.478041,1.478076,1.485995,1.486357,1.490379,1.490719)
y <- c(0.5102649,0.0000000,0.6360097,0.0000000,0.8692671,0.0000000,1.0000000,0.0000000,0.4183691,0.8953987,0.3442624,0.0000000,0.7513169,0.0000000,0.0000000,0.0000000,0.0000000,0.1291901,0.4936121,0.7565551,1.0085108,0.0000000,0.0000000,0.1655482,0.0000000,0.1473168,0.0000000,0.0000000,0.0000000,0.1875293,0.4918018,0.0000000,0.0000000,0.8101771,0.6853480,0.0000000,0.0000000,0.0000000,0.0000000,0.4068802,1.1061434,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.0000000,0.6391678)
fit1 <- c(0.5102649100,0.5153380934,0.5177234836,0.5255544980,0.5307668662,0.5068087080,0.5071001179,0.4825657520,0.4832969250,0.4836378194,0.4842147729,0.5004039310,0.4987301366,0.4978800742,0.4978042478,0.4969807064,0.5086987191,0.4989497612,0.4936121200,0.4922210302,0.4904593166,0.4775197108,0.4757040857,0.4729265271,0.4709141776,0.4612406896,0.4459316517,0.4351338346,0.4331439717,0.4318664278,0.3235179189,0.2907908968,0.1665721429,0.1474035158,0.1443999345,0.1398517097,0.1153991839,0.1142140393,0.1022584672,0.1002410843,0.0840033244,0.0663669309,0.0629119398,0.0627979240,0.0473336492,0.0239237481,0.0238556876,0.0084990298,0.0077970954,0.0000000000,-0.0006598571)
fit2 <- c(-0.0006598571,0.0153328298,0.0228511733,0.0652889427,0.0975108758,0.1252414661,0.1270195143,0.1922510501,0.2965234797,0.3018551305,0.3108761043,0.3621749370,0.4184150225,0.4359301495,0.4432114081,0.4493565757,0.4510158144,0.4661865431,0.4744926045,0.4766574718,0.4796937554,0.4834718810,0.4836125426,0.4839450098,0.4841092849,0.4877317306,0.4930561638,0.4964939389,0.4970089201,0.4971376528,0.4990394601,0.5005881678,0.5023814257,0.5052125977,0.5056691690,0.5064254338,0.5115481820,0.5117259449,0.5146054557,0.5149729419,0.5184178197,0.5211542908,0.5216215426,0.5216426533,0.5239797875,0.5273573222,0.5273683002,0.5293994824,0.5295130266,0.5306236672,0.5307303109)
## picture
plot(x, y)
## red regression curve
points(x, fit1, col=2); lines(x, fit1, col=2)
## blue monotonic curve to the regression
points(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue"); lines(min(x) + cumsum(c(0, rev(diff(x)))), rev(fit2), col="blue")
## "x" original point matches with the regression estimated point
## but not with the estimated (fit2=estimate) monotonic curve
abline(v=1.223329, lty=2, col="grey")
Focus on the dashed grey line. The idea is to get y value of the monotonic blue curve corresponding to x original value. The grey line should cross three points (the original one "black", the regression estimate "red", the adjusted regression estimate "blue"). Can we do this?
Methodology:
The object "fit2" is the output of the function rearrangement(). It is always monotonically increasing.
library(Rearrangement)
fit2 <- rearrangement(x=as.data.frame(x), y=fit1)
It sounds like you might be interested in approxfun
fn <- approxfun(x=min(x) + cumsum(c(0, rev(diff(x)))), y=rev(fit2))
fn(1.223329)
It's not very fancy, but it will do a basic linear interpolation between observed points for unobserved x values. The code above will estimate a y value for the value x=1.223329 using the existing points. You can then use the fn function to estimate other points as well.
I found a way to do this without approxfun. The rearrangement function returns a monotonically increasing result. Your fitted curve is decreasing, and you can do a simple trick to get what you want (also what you wanted in your earlier question).
## Apply rearrangement on minus fit1
fit3 <- rearrangement(x=as.data.frame(x), y = - fit1)
## Plot the minus rearranged result
plot(x, y)
lines(x, - fit3, col="green"); points(x, - fit3, col="green")
lines(x, fit1, col="green"); points(x, fit1, col="green")
So, the result is a monotonically decreasing curve, with x values equal to that of your data and your fit.
ind <- which(x == 1.223329)
-fit3[ind]
## 0.4857717
Hope it helps,
alex

Resources