Exporting Linear Regression Results Including Confidence Intervals - r

Hey out there how can I can I export a table of the results used to make the chart I generated for this linear regression model below.
d <- data.frame(x=c(200110,86933,104429,240752,255332,75998,
204302,97321,342812,220522,110990,259706,65733),
y=c(200000,110000,165363,225362,313284,113972,
137449,113106,409020,261733,171300,344437,89000))
lm1 <- lm(y~x,data=d)
p_conf1 <- predict(lm1,interval="confidence")
nd <- data.frame(x=seq(0,80000,length=510000))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
plot(y~x,data=d,ylim=c(-21750,600000),xlim=c(0,600000)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")

Still not entirely sure what you want but this would seem to be reasonable:
dat1 <- data.frame(d,p_conf1)
dat2 <- data.frame(nd,y=NA,p_conf2)
write.csv(rbind(dat1,dat2),file="linpredout.csv")
It includes x, y (equal to the observation or NA for non-observed points), the predicted value fit, and lwr/upr bounds.
edit: fix typo.

This will return a matrix that has some of the information needed to construct the confidence intervals:
> coef(summary(lm1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21749.037058 2.665203e+04 0.8160369 4.317954e-01
x 1.046954 1.374353e-01 7.6177997 1.037175e-05
Any text on linear regression should have the formula for the confidence interval. You may need to calculate some ancillary quantities dependent on which formula you're using. The code for predict is visible ... just type at the console :
predict.lm
And don't forget that confidence intervals are different than prediction intervals.

Related

why I can't get a confidence interval using predict function in R

I am trying to get a confidence interval for my response in a poisson regression model. Here is my data:
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
What I've done so far:
(i) read my data
(ii) fit a Y by poisson regression with X as a covariate
model <- glm(Y ~ X, family = "poisson", data = mydata)
(iii) use predict()
predict(model,newdata=data.frame(X=4),se.fit=TRUE,interval="confidence",level=0.95, type = "response")
I was expecting to get "fit, lwr, upr" for my response but I got the following instead:
$fit
1
30.21439
$se.fit
1
6.984273
$residual.scale
[1] 1
Could anyone offer some suggestions? I am new to R and struggling with this problem for a long time.
Thank you very much.
First, the function predict() that you are using is the method predict.glm(). If you look at its help file, it does not even have arguments 'interval' or 'level'. It doesn't flag them as erroneous because predict.glm() has the (in)famous ... argument, that absorbs all 'extra' arguments. You can write confidence=34.2 and interval="woohoo" and it still gives the same answer. It only produces the estimate and the standard error.
Second, one COULD then take the fit +/- 2*se to get an approximate 95 percent confidence interval. However, without getting into the weeds of confidence intervals, pivotal statistics, non-normality in the response scale, etc., this doesn't give very satisfying intervals because, for instance, they often include impossible negative values.
So, I think a better approach is to form an interval in the link scale, then transform it (this is still an approximation, but probably better):
X <- c(1,0,2,0,3,1,0,1,2,0)
Y <- c(16,9,17,12,22,13,8,15,19,11)
model <- glm(Y ~ X, family = "poisson")
tmp <- predict(model, newdata=data.frame(X=4),se.fit=TRUE, type = "link")
exp(tmp$fit - 2*tmp$se.fit)
1
19.02976
exp(tmp$fit + 2*tmp$se.fit)
1
47.97273

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Parameters and AUC and IC50 of a dose response curve

I have these dose response data:
df <- data.frame(viability=c(14,81,58,78,71,83,64,16,32,100,100,81,86,83,100,90,15,100,38,100,91,84,92,100),
dose=c(10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061,10,0.62,2.5,0.16,0.039,0.0024,0.0098,0.00061),
stringsAsFactors=F)
I then use the drc package's drm function to fit a log-logistic curve to these data:
library(drc)
fit <- drm(viability~dose,data=df,fct=LL.4(names=c("slope","low","high","ED50")),type="continuous")
> summary(fit)
Model fitted: Log-logistic (ED50 as parameter) (4 parms)
Parameter estimates:
Estimate Std. Error t-value p-value
slope:(Intercept) 5.15328 18.07742 0.28507 0.7785
low:(Intercept) 20.19430 12.61122 1.60130 0.1250
high:(Intercept) 83.33181 4.96736 16.77586 0.0000
ED50:(Intercept) 2.98733 1.99685 1.49602 0.1503
Residual standard error:
21.0743 (20 degrees of freedom)
I then generate predictions so I'll be able to plot the curve:
pred.df <- expand.grid(dose=exp(seq(log(max(df$dose)),log(min(df$dose)),length=100)))
pred <- predict(fit,newdata=pred.df,interval="confidence")
pred.df$viability <- pmax(pred[,1],0)
pred.df$viability <- pmin(pred.df$viability,100)
pred.df$viability.low <- pmax(pred[,2],0)
pred.df$viability.low <- pmin(pred.df$viability.low,100)
pred.df$viability.high <- pmax(pred[,3],0)
pred.df$viability.high <- pmin(pred.df$viability.high,100)
I also use the PharmacoGx Bioconductor package to compute AUC and IC50 for both the curve and its high and low bounds:
library(PharmacoGx)
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))/((max(pred.df$viability)-min(pred.df$viability))*(max(pred.df$dose)-min(pred.df$dose)))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))/((max(pred.df$viability.low)-min(pred.df$viability.low))*(max(pred.df)-min(pred.df$dose)))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))/((max(pred.df$viability.high)-min(pred.df$viability.high))*(max(pred.df$dose)-min(pred.df$dose)))
ic50.mid <- computeIC50(rev(pred.df$dose),rev(pred.df$viability))
ic50.low <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.low))
ic50.high <- computeIC50(rev(pred.df$dose),rev(pred.df$viability.high))
Ceating a table with all the parameters so I can plot everything together:
ann.df <- data.frame(param=c("slope","low","high","ED50","auc.mid","auc.high","auc.low","ic50.mid","ic50.high","ic50.low"),value=signif(c(summary(fit)$coefficient[,1],auc.mid,auc.high,auc.low,ic50.mid,ic50.high,ic50.low),2),stringsAsFactors=F)
And finally plotting it all:
library(ggplot2)
library(grid)
library(gridExtra)
pl <- ggplot(df,aes(x=dose,y=viability))+geom_point()+geom_ribbon(data=pred.df,aes(x=dose,y=viability,ymin=viability.low,ymax=viability.high),alpha=0.2)+labs(y="viability")+
geom_line(data=pred.df,aes(x=dose,y=viability))+coord_trans(x="log")+theme_bw()+scale_x_continuous(name="dose",breaks=sort(unique(df$dose)),labels=format(signif(sort(unique(df$dose)),3),scientific=T))
ggdraw(pl)+draw_grob(tableGrob(ann.df,rows=NULL),x=0.1,y=0.175,width=0.3,height=0.4)
Which gives:
My questions are:
I thought that slope should be negative. How come it's 5.2?
the auc.mid, auc.high, and auc.lowcumputed as:
auc.mid <- computeAUC(rev(pred.df$dose),rev(pred.df$viability))
auc.low <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.low))
auc.high <- computeAUC(rev(pred.df$dose),rev(pred.df$viability.high))
give 21.47818, 37.52389, and 2.678228, respectively.
Since these are not in the [0,1] range I thought that divinding them by the area under the highest corresponding viability will give what I'm looking for, i.e., relative AUC, but these values seem too low relative to what the figure shows. What are these AUCs then?
Also, how come auc.mid > auc.low > auc.high? I would think that it should be auc.high > auc.mid > auc.low
The IC50 values also seem a little low. Do they make sense?
Bonus question: how do I avoid the trailing zeros in slope, low, high, ED50, ic50.mid, and ic50.high in the figure?
The parameter you are pulling out is the hill slope parameter, or the coefficient in front of the concentration variable in the exponential, not the actual slope of the curve.
The AUC provided is in the [0-100] range, for the area above the curve. I ran the code and got the order as auc.low>auc.mid>auc.high. Traditionally the area under the response curve was reported, or 1-viability.
It is important to note that the PharmacoGx package uses a 3 parameter hill slope model, similar to LL.3 in drc. Therefore, the plot will not correspond to the function fit by PharmacoGx to calculate the IC50 or AUC.
Source: PharmacoGx dev.

'predict' gives different results than using manually the coefficients from 'summary'

Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.

Bootstrapping CI for a Logistic Regression Model

I have a logistic regression model that I am using to predict size at maturity for king crab, but I am having trouble setting up the code for bootstrapping using the boot package. This is what I have:
#FEMALE GKC SAM#
LowerChatham<-read.table(file=file.choose(),header=TRUE)
#LOGISTIC REGRESSION FIT#
glm.out<-glm(Mature~CL,family=binomial(link=logit),data=LowerChatham)
plot(Mature~CL,data=LowerChatham)
lines(LowerChatham$CL,glm.out$fitted,col="red")
title(main="Lower Chatham")
summary(glm.out)
segments(98.9,0,98.9,0.5,col=1,lty=3,lwd=3)
SAM<-data.frame(CL=98.97)
predict(glm.out,SAM,type="response")
I would like to to bootstrap the statistic CL=98.97 since I am interested in the size at which 50% of crab are mature, but I have no idea how to setup my function to specify the that statistic and let alone the bootstrap function in general to get my 95% C.I. Any help would be greatly appreciated! Thanks!
In each bootstrap iteration, you want to do something like
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
range[match(TRUE,p>.5)] # predicted probability of 50% maturity
where you specify a range of values of CL to whatever precision you need. Then calculate the predicted probability of maturity at each of those levels. Then find the threshold value in the range where the predicted probability cross 0.5. This is the statistic it sounds like you want to bootstrap.
You also don't need the boot to do this. If you define a function that samples and outputs that statistic as its result, you can just do replicate(1000, myfun) to get your bootstrap distribution, as follows:
myfun <- function(){
srows <- sample(1:nrow(LowerChatham),nrow(LowerChatham),TRUE)
glm.out <- (Mature ~ CL, family=binomial(link=logit), data=LowerChatham[srows,])
range <- 1:100 # this could be any substantively meaningful range
p <- predict(glm.out, newdata = data.frame(CL=range), "response")
return(range[match(TRUE,p>.5)]) # predicted probability of 50% maturity
}
bootdist <- replicate(1000, myfun()) # your distribution
quantile(unlist(bootdist),c(.025,.975)) # 95% CI

Resources