R - calculating 95% confidence interval for R-squared and residual standard error by boot strap linear model - r

I'm new to R and am trying to calculate the 95% confidence intervals for the R-squared values and residual standard error for linear models have formed by using the bootstrap method to resample the response variable, and then create 999 linear models by regressing these 999 bootstrapped response variables on the original explanatory variable.
First of all, I am not sure if I should be calculating the 95% CI for R-squared and residual standard error for the ORIGINAL linear model (without the bootstrap data), because that doesn't make sense - the R-squared value is 100% exact for that linear model, and it doesn't make sense to calculate a CI for it.
Is that correct?
Importantly I'm not sure how to calculate the CI for the R-squared values and residual standard error values for the 999 linear models I've created from bootstrapping.

You can definitely use the boot package to do this. But because I may be confused about what you want Ill go step by step.
I make up some fake data
n=10
x=rnorm(n)
realerror=rnorm(n,0,.9)
beta=3
y=beta*x+realerror
make an empty place to catch the statistics I am interested in.
rsquared=NA
sse=NA
Then make a for loop that will resample the data, run a regression and collect two statistics for each iteration.
for(i in 1:999)
{
#create a vector of the index to resample data row-wise with replacement.
use=sample(1:n,replace=T)
lm1=summary(lm(y[use]~x[use]))
rsquared[i]=lm1$r.squared
sse[i]=sum(lm1$residuals^2)
}
Now I want to figure out the confidence intervals so I order each of them and report out the (n*.025)th and the (n*.975)th
first order the statistics
sse=sse[order(sse)]
rsquared=rsquared[order(rsquared)]
Then the 25th is the lower confidence limit and the 975th is the upper confidence limit
> sse[c(25,975)]
[1] 2.758037 18.027106
> rsquared[c(25,975)]
[1] 0.5613399 0.9795167

Related

No standard error output in quantile regression using R

I am running a quantile regression (as my residuals in linear regression were not normally distributed) for a study on the association of mediterranean diet and inflammatory markers. As I was building the model I got outputs for beta coefficients and standard error plus p-values, and confidence intervals. However, once I stratified for low and high levels of exercise, there was no longer an output for standard error. Any ideas?enter image description here
By default rq does not returns standard errors, but rather the confidence interval by inverting a rank test. If you want standard errors you have to specify another method for se, such as se="boot".
Note: the whole point of quantile regression is to move away from means and SD, so this may not be the most adequate estimate for your problem.

confidence interval of estimates in a fitted hybrid model by spatstat

hybrid Gibbs models are flexible for fitting spatial pattern data, however, I am confused on how to get the confidence interval for the fitted model's estimate. for instance, I fitted a hybrid geyer model including a hardcore and a geyer saturation components, got the estimates:
Mo.hybrid<-Hybrid(H=Hardcore(), G=Geyer(81,1))
my.hybrid<-ppm(my.X~1,Mo.hybrid, correction="bord")
#beta = 1.629279e-06
#Hard core distance: 31.85573
#Fitted G interaction parameter gamma: 10.241487
what I interested is the gamma, which present the aggregation of points. obviously, the data X is a sample, i.e., of cells in a anatomical image. in order to report statistical result, a confidence interval for gamma is needed. however, i do not have replicates for the image data.
can i simlate 10 time of the fitted hybrid model, then refitted them to get confidence interval of the estimate? something like:
mo.Y<-rmhmodel(cif=c("hardcore","geyer"),
par=list(list(beta=1.629279e-06,hc=31.85573),
list(beta=1, gamma=10.241487,r=81,sat=1)), w=my.X)
Y1<-rmh(model=mo.Y, control = list(nrep=1e6,p=1, fixall=TRUE),
start=list(n.start=c(npoint(my.X))))
Y1.fit<-ppm(Y1~1, Mo.hybrid,rbord=0.1)
# simulate and fit Y2,Y3,...Y10 in same way
or:
Y10<-simulate(my.hybrid,nsim=10)
Y1.fit<-ppm(Y10[1]~1, Mo.hybrid,rbord=0.1)
# fit Y2,Y3,...Y10 in same way
certainly, the algorithms is different, the rmh() can control simulated intensity while the simulate() does not.
now the questions are:
is it right to use simualtion to get confidence interval of estimate?
or the fitted model can provide estimate interval that could be extracted?
if simulation is ok, which algorithm is better in my case?
The function confint calculates confidence intervals for the canonical parameters of a statistical model. It is defined in the standard stats package. You can apply it to fitted point process models in spatstat: in your example just type confint(my.hybrid).
You wanted a confidence interval for the non-canonical parameter gamma. The canonical parameter is theta = log(gamma) so if you do exp(confint(my.hybrid) you can read off the confidence interval for gamma.
Confidence intervals and other forms of inference for fitted point process models are discussed in detail in the spatstat book chapters 9, 10 and 13.
The confidence intervals described above are the asymptotic ones (based on the asymptotic variance matrix using the central limit theorem).
If you really wanted to estimate the variance-covariance matrix by simulation, it would be safer and easier to fit the model using method='ho' (which performs the simulation) and then apply confint as before (which would then use the variance of the simulations rather than the asymptotic variance).
rmh.ppm and simulate.ppm are essentially the same algorithm, apart from some book-keeping. The differences observed in your example occur because you passed different arguments. You could have passed the same arguments to either of these functions.

Significance level of ACF and PACF in R

I want to obtain the the limits that determine the significance of autocorrelation coefficients and partial autocorrelation coefficients, but I don't know how to do it.
I obtained the Partial autocorrelogram using this function pacf(data). I want that R print me the values indicated in the figure.
The limits that determine the significance of autocorrelation coefficients are: +/- of (exp(2*1.96/√(N-3)-1)/(exp(2*1.96/√(N-3)+1).
Here N is the length of the time series, and I used the 95% confidence level.
The correlation values that correspond to the m % confidence intervals chosen for the test are given by 0 ± i/√N where:
N is the length of the time series
i is the number of standard deviations we expect m % of the correlations to lie within under the null hypothesis that there is zero autocorrelation.
Since the observed correlations are assumed to be normally distributed:
i=2 for a 95% confidence level (acf's default),
i=3 for a 99% confidence level,
and so on as dictated by the properties of a Gaussian distribution
Figure A1, Page 1011 here provides a nice example of how the above principle applies in practice.
After investigating acf and pacf functions and library psychometric with its CIz and CIr functions I found this simple code to do the task:
Compute confidence interval for z Fisher:
ciz = c(-1,1)*(-qnorm((1-alpha)/2)/sqrt(N-3))
here alpha is the confidence level (typically 0.95). N - number of observations.
Compute confidence interval for R:
cir = (exp(2*ciz)-1)/(exp(2*ciz)+1

Confidence intervals for predicted probabilities from predict.lrm

I am trying to determine confidence intervals for predicted probabilities from a binomial logistic regression in R. The model is estimated using lrm (from the package rms) to allow for clustering standard errors on survey respondents (each respondent appears up to 3 times in the data):
library(rms)
model1<-lrm(outcome~var1+var2+var3,data=mydata,x=T,y=T,se.fit=T)
model.rob<-robcov(model1,cluster=respondent.id)
I am able to estimate a predicted probability for the outcome using predict.lrm:
predicted.prob<-predict(model.rob,newdata=data.frame(var1=1,var2=.33,var3=.5),
type="fitted")
What I want to determine is a 95% confidence interval for this predicted probability. I have tried specifying se.fit=T, but this not permissible in predict.lrm when type=fitted.
I have spent the last few hours scouring the Internet for how to do this with lrm to no avail (obviously). Can anyone point me toward a method for determining this confidence interval? Alternatively, if it is impossible or difficult with lrm models, is there another way to estimate a logit with clustered standard errors for which confidence intervals would be more easily obtainable?
The help file for predict.lrm has a clear example. Here is a slight modification of it:
L <- predict(fit, newdata=data.frame(...), se.fit=TRUE)
plogis(with(L, linear.predictors + 1.96*cbind(- se.fit, se.fit)))
For some problems you may want to use the gendata or Predict functions, e.g.
L <- predict(fit, gendata(fit, var1=1), se.fit=TRUE) # leave other vars at median/mode
Predict(fit, var1=1:2, var2=3) # leave other vars at median/mode; gives CLs

Combining ROC estimates from multiple imputation data

I have used the following R packages: mice, mitools, and pROC.
Basic design: 3 predictor measures with missing data rates between 5% and 70% on n~1,000. 1 binary target outcome variable.
Analytic Goal: Determine the AUROC of each of the 3 predictors.
I used the mice package to impute data and now have m datasets of imputed data.
Using the following command, I am able to get AUROC curves for each of m datasets:
fit1<-with(imp2, (roc(target, symptom1, ci=TRUE)))
fit2<-with(imp2, (roc(target, symptom2, ci=TRUE)))
fit3<-with(imp2, (roc(target, symptom3, ci=TRUE)))
I can see the estimates for each of m datasets without any problems.
fit1
fit2
fit3
To combine the parameters, I attempted to use mitools
>summary(pool(fit1))
>summary(pool(fit2))
>summary(pool(fit3))
I get the following error message:
"Error in pool(fit): Object has no vcov() method".
When combining coefficient estimates from m datasets, my understanding is that this is a simple average of the coefficients. However, the error term is more complex.
My question: How do I pool the "m" ROC parameter estimates (AUROC and 95% C.I. or S.E.) to get an accurate estimate of the error term for significance testing/95% Confidence Intervals?
Thank you for any help in advance.
I think the following works to combine the estimates.
pROC produces a point estimate for the AUROC as well as a 95% Confidence Interval.
To combine the AUROC from m imputation dataets, it is simply averaging the AUROC.
To create an appropriate standard error estimate and then a 95% C.I., I converted the 95% C.I.s into S.E. Using the standard formulas (Multiple Imputation FAQ, I computed the within, between, and total variance for the estimate. Once I had the standard error, I converted that back to a 95% C.I.
If anyone has any better suggestions, I would very much appreciate it.
I would use bootstrapping with the boot package to assess the different sources of variance. For instance for the variance due to imputation, you could use something like this:
bootstrap.imputation <- function(d, i, symptom){
sampled.data <- d[i,]
imputed.data <- ... # here the code you use to generate one imputed dataset, but apply it to sampled.data
auc(roc(imputed.data$target, imputed.data[[symptom]]))
}
boot.n <- 2000
boot(dataset, bootstrap.imputation, boot.n, "symptom1") # symptom1 is passed with ... to bootstrap.imputation
boot(dataset, bootstrap.imputation, boot.n, "symptom2")
boot(dataset, bootstrap.imputation, boot.n, "symptom3")
Then you can then do the same to assess the variance of the AUC. Impute your data, and apply the bootstrap again (or you can do with the built-in functions of pROC).

Resources