I am trying to get the studentized CIs for a correlation coefficient using bootstrap. This is what I have now, I have not come up with a way around to get the correct studentized CIs. I'm using the R extension in spss, but it shouldn't affect the usability.
Begin Program R.
rm(list=ls())
library(boot)
allData <- spssdata.GetDataFromSPSS(variables =c('compsales','T003000_mean'))
dict <-
spssdictionary.GetDictionaryFromSPSS(variables=c('compsales','T003000_mean'))
allData <- na.omit(allData)
cortest <- cor.test(allData$compsales,allData$T003000_mean)
pearson <- function(d, i){
d2 <- d[i,]
return(cor(d2$compsales,d2$T003000_mean))
}
bootcorr <- boot(allData, pearson, R=1200)
bootcorr
boot.ci(bootcorr,type = c("norm","basic", "perc","bca"),conf = .95)
End Program.
To try to get it I wrote the function as:
pearson <- function(d, i){
d2 <- d[i,]
return(cor(d2$compsales,d2$T003000_mean), var(d2$compsales,d2$T003000_mean))
}
and in the boot.ci as:
boot.ci(bootcorr,type = "all",conf = .95)
But the vector for the variances are all N/As
You need to output the correlation and variance as a vector:
pearson <- function(d, i){
d2 <- d[i,]
return(
c(cor(d2$compsales,d2$T003000_mean),var(d2$compsales),var(d2$T003000_mean))
)
}
library(boot)
data = data.frame(compsales=runif(100),T003000_mean=runif(100))
bo =boot(data,pearson,R=99)
boot.ci(bo,type="stud")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 99 bootstrap replicates
CALL :
boot.ci(boot.out = bo, type = "stud")
Intervals :
Level Studentized
95% (-0.1593, 0.3648 )
Calculations and Intervals on Original Scale
Some studentized intervals may be unstable
Related
I used fitdist from the fitdistrplus package to fit a (gamma) distribution to my data:
fitg <- fitdist(mdt, "gamma")
The result is a list of parameters that describe the fit. I wonder if there is a way to use the result to create cumulative distribution functions and random sample generators from this distribution.
For example, if the distribution fitted with fitdist corresponded to a normal distribution with mean 0 and sd 1, how can I recreate easily pnorm(..,0,1) and rnorm(..,0,1)?
I understand I can do it manually, but it would be much easier for me to have a function doing it "automatically", as I have to do it for many different datasets that will be fitted with different kinds of distributions.
Thanks a lot for your help!
Do you want something like the following?
library(fitdistrplus)
data <- rnorm(1000, 0.01, 1.01) # sampled from original distribution N(0.01, 1.01^2)
fit_and_draw_sample <- function(data, nsamples, distr='norm') {
if (distr == 'norm') {
fitg <- fitdist(data, distr)
params <- fitg$estimate
print(params) # fitted distribution N(0.0398281, 0.9876068^2) with estimated params
# mean sd
# 0.0398281 0.9876068
mu <- params[1]
sigma <- params[2]
return (rnorm(nsamples, mu, sigma))
}
# handle other distributions here
return (NULL)
}
samples <- fit_and_draw_sample(data, 1000)
hist(data, col=scales::alpha('blue',.2), border=FALSE, main='samples from original and fitted distribution')
hist(samples, col=scales::alpha('red',.2), add=TRUE, border=FALSE)
legend('topright', c("original", "fitted"), col = c(rgb(0,0,1,0.2), rgb(1,0,0,0.2)), lwd=c(2,2))
Suppose I want to assess the goodness of a linear model before and after leaving out a covariate, and I want to implement some kind of bootstrapping.
I tried to bootstrap the sum of residuals of both models and then I applied the Kolmogorov-Smirnov test to assess if the two are the same distributions.
The minimal working code:
lm.statistic.resid <- function(data,i){
d<-data[i,]
r.gressor <- colnames(data)[1]
c.variates <- colnames(data)[-1]
lm.boot <- lm(data=d)
out <- sum(resid(lm.boot))
return(out)
}
df.restricted <- mtcars[ , names(mtcars) != c("wt")]
classical.lm <- lm(mtcars)
restricted.lm <- lm(df.restricted)
boot.regression.full = boot(df,
statistic=lm.statistic.resid,
R=1000)
boot.regression.restricted = boot(df.restricted,
statistic=lm.statistic.resid,
R=1000)
x <- boot.regression.restricted$t
y <- boot.regression.full$t
ks.test(x,y)
However, I get kind of the same result both in removing wt (which statistically significant) and am (which is not).
I should expect a smaller p-value in case I remove wt.
I'm fitting a power model to a dataset by applying a simple linear model with the R function lm after log-log transformation, as in the example below (instead of fitting directly the power model, for example by applying the nls function).
I could use the function predict.lm to apply the model on new data and calculate prediction intervals.
data(stackloss); dat <- stackloss[c(2, 4)]; colnames(dat) <- c("x","y")
dat.lm <- lm(log(y) ~ log(x), data = dat)
new <- data.frame(x = seq(0, 30, 1))
pred <- predict.lm(dat.lm, new, interval = "prediction", level = 0.95)
matplot(new$x, exp(pred), type = "l", col = 1, lty = c(1, 2, 2)); points(dat$x, dat$y)
Now, I need to sum n predicted values (which is straightforward, after applying the 'exp' function) and also to calculate the aggregated variance and prediction intervals.
The latter has been described for a simple linear model in the following Q&A: linear model with `lm`: how to get prediction variance of sum of predicted values.
In that interesting answer, the following functions lm_predict (that allows to compute complete variance-covariance matrix of predicted values) and agg_pred were introduced for the simple linear model.
lm_predict <- function (lmObject, newdata, diag = TRUE) {
## input checking
if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
## extract "terms" object from the fitted model, but delete response variable
tm <- delete.response(terms(lmObject))
## linear predictor matrix
Xp <- model.matrix(tm, newdata)
## predicted values by direct matrix-vector multiplication
pred <- c(Xp %*% coef(lmObject))
## efficiently form the complete variance-covariance matrix
QR <- lmObject$qr ## qr object of fitted model
piv <- QR$pivot ## pivoting index
r <- QR$rank ## model rank / numeric rank
if (is.unsorted(piv)) {
## pivoting has been done
B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
} else {
## no pivoting is done
B <- forwardsolve(t(QR$qr), t(Xp), r)
}
## residual variance
sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
if (diag) {
## return point-wise prediction variance
VCOV <- colSums(B ^ 2) * sig2
} else {
## return full variance-covariance matrix of predicted values
VCOV <- crossprod(B) * sig2
}
list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
}
agg_pred <- function (w, predObject, alpha = 0.95) {
## input checing
if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
## mean of the aggregation
agg_mean <- c(crossprod(predObject$fit, w))
## variance of the aggregation
agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
## adjusted variance-covariance matrix
VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
## adjusted variance of the aggregation
agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
## t-distribution quantiles
Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
## names of CI and PI
NAME <- c("lower", "upper")
## CI
CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
## PI
PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
## return
list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
}
However, these cannot be applied directly to properly aggregate variance in the case of log-log regression. Maybe I should transform the variance in the output of lm_predict, but I couldn't figure how to proceed.
Thank'you in advance for any help.
i am trying to calculate the type i error rate and power for the correlation test for bivariate normal data using Monte Carlo simulation.
But i am getting unexpected values for the type I error and for power. (type I error as 0.864)
i need to know whether i have done some mistake. Can anyone help me?
set.seed(160230)
library("mvtnorm", lib.loc="~/R/win-library/3.4")
sigma= matrix(c(1,0.8,0.8,1),2,2)
mu <- c(0,0)
#bivariate normal data
sim=replicate(n=1000 , rmvnorm(10,mean=mu , sigma = sigma))
pval1=c()
for(i in 1:1000)
{
pval1[i]=cor.test(sim[,1,i],sim[,2,i],method = c("pearson"))$p.value
}
#type1 error rate
mean(pval1<0.05)
#power
mean(pval3>0.05)
Your code is okay but you have set up your simulations wrong.
In your code, you
Simulate bivariate data with a strong correlation, rho=0.8.
Test the hypothesis that H0: rho=0.
Thus, you are simulating data under the alternative hypothesis which is why you get the result of 0.864. This is essentially your power for that particular alternative. You could do the following instead:
First simulate data under the null hypothesis
sigma <- matrix(c(1,0,0,1),2,2)
mu <- c(0,0)
#bivariate normal data under H0
sim <- replicate(n=1000, rmvnorm(10, mean=mu, sigma = sigma))
# Test the actual level under H0
result <- sapply(1:1000, function(i) {
cor.test(sim[,1,i],sim[,2,i],method = c("pearson"))$p.value})
mean(result < 0.05)
which gives a value around 0.05. Under the alternative you can use your code with the correlation 0.8 (or some other number). You can generalise this with the following code to easily get the power for several correlations.
rho <- seq(0, .9, .1)
pwr <- sapply(rho, function(r) {
sigma <- matrix(c(1,r,r,1),2,2)
mu <- c(0,0)
#bivariate normal data
sim <- replicate(n=1000, rmvnorm(10, mean=mu, sigma = sigma))
# Test the actual level
result <- sapply(1:1000, function(i) {
cor.test(sim[,1,i],sim[,2,i],method = c("pearson"))$p.value})
mean(result < 0.05)
})
Then you can see the impact of correlation on the power byt plotting the relationship
plot(rho, pwr, type="l", xlab=expression(rho), ylab="Power")
I would like to calculate a BCa confidence interval for multi-stage bootstrap using boot.ci(). Here is an example from: Non-parametric bootstrapping on the highest level of clustered data using boot() function from {boot} in R
which uses the boot command.
# creating example df
rho <- 0.4
dat <- expand.grid(
trial=factor(1:5),
subject=factor(1:3)
)
sig <- rho * tcrossprod(model.matrix(~ 0 + subject, dat))
diag(sig) <- 1
set.seed(17); dat$value <- chol(sig) %*% rnorm(15, 0, 1)
# function for resampling
resamp.mean <- function(dat,
indices,
cluster = c('subject', 'trial'),
replace = TRUE){
cls <- sample(unique(dat[[cluster[1]]]), replace=replace)
sub <- lapply(cls, function(b) subset(dat, dat[[cluster[1]]]==b))
sub <- do.call(rbind, sub)
mean(sub$value)
}
dat.boot <- boot(dat, resamp.mean, 4) # produces and estimated statistic
boot.ci(data.boot) # produces errors
How can I use boot.ci on the boot output?
You have used too few bootstrap resamples. When you call boot.ci, influence measures are needed, and if not provided they are obtained from empinf, which may fail with too few observations. See here for an explanation along similar lines.
Try
dat.boot <- boot(dat, resamp.mean, 1000)
boot.ci(dat.boot, type = "bca")
which gives:
> boot.ci(dat.boot, type = "bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca")
Intervals :
Level BCa
95% (-0.2894, 1.2979 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
As an alternative, you can provide L (the influence measures) yourself.
# proof of concept, use appropriate value for L!
> dat.boot <- boot(dat, resamp.mean, 4)
> boot.ci(dat.boot, type = "bca", L = 0.2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 4 bootstrap replicates
CALL :
boot.ci(boot.out = dat.boot, type = "bca", L = 0.2)
Intervals :
Level BCa
95% ( 0.1322, 1.2979 )
Calculations and Intervals on Original Scale
Warning : BCa Intervals used Extreme Quantiles
Some BCa intervals may be unstable