Coverage probability calculation for LM - r

I am trying to calculate coverage probability for a set of residual bootstrap replicates I generated on the intercept and slope of regression . Can anyone show me how to calculate coverage probability of confidence intervals? Many thanks.
Note that I manually ran the regression using Qr decomposition but you can use lm() if that's easier. I just thought doing it manually will be faster.
set.seed(42) ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)
mo <- lm(y ~ x, data=dd)
# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))
EDIT
Incorporating jay.sf's answer, I rewrote the code that ran the lm() method and compared the first and second approach of calculating coverage probability in the link shared by jay.sf:
library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci
FUNInter <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "X(Intercept)", level = 0.95)
}
FUNBeta <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "Xx", level = 0.95)
}
set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))
#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405

According to Morris et. al 2019, Table 6, the coverage probability is defined as the probability how often real theta lies within a bootstrapped confidence interval (CI) (i.e. those of the model applied on many samples based on the actual data, or—in other words—new experiments):
Hence, we want to compute CIs based on OP's proposed i.i.d. bootstrap R times and calculate the ratio of how often theta is or is not in these CIs.
First, we estimate our model mo using the actual data.
mo <- lm(y ~ x)
To avoid unnecessary unpacking fitted values yhat, residuals u, model matrix X, and coefficients coef0 in the replications, we extract them beforehand.
yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25) ## known from data generating process of simulation
In a bootstrap function FUN we wrap all the steps we want to do in one replication. In order to apply the very fast .lm.fit, we have to calculate the white standard errors manually (identical to lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1"))).
FUN <- function() {
## resampling residuals
y.star <- yhat + sample(u, length(u), replace=TRUE)
## refit model
fit <- .lm.fit(X, y.star)
coef <- fit$coefficients[sort.list(fit$pivot)]
## alternatively using QR, but `.lm.fit` is slightly faster
# qrX <- qr(X, LAPACK=TRUE)
# coef <- qr.coef(qrX, y.star)
## white standard errors
v.cov <- chol2inv(chol(t(X) %*% X))
meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
## degrees of freedom adjust (HC1)
d <- dim(X)
dfa <- d[1] / (d[1] - d[2])
white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
## 95% CIs
ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
## coverage
c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}
Now we execute the bootstrap using replicate.
R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
# user system elapsed
# 71.19 28.25 100.28
head(res, 3)
# intercept x
# [1,] TRUE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
The mean of TRUEs in both columns simultaneously across the rows, or in each column respectively, gives the coverage probability we are looking for.
(cp.t <- mean(rowSums(res) == ncol(res))) ## coverage probability total
(cp.i <- colMeans(res)) ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
# total intercept x
# 0.8954 0.9478 0.9444
## values with other R:
# total intercept x
# 0.90700 0.95200 0.95200 ## R == 1k
# 0.89950 0.95000 0.94700 ## R == 2k
# 0.89540 0.94780 0.94440 ## R == 5k
# 0.89530 0.94570 0.94680 ## R == 10k
# 0.89722 0.94694 0.94777 ## R == 100k
And this is how it looks like after 100k repetitions
Code for plot:
r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)
matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R',
ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')
Data:
set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)

Related

How to calculate the Power Function of the Chi Square Goodness of Fit Test through Monte Carlo simulations using R?

I need tips of how to calculate the power function of the Chi Square Goodness of Fit test using Monte Carlo Simulations in R. I am familiar with the pwr.chisq function but i need a way to use R to write the code for the Monte Carlo simulation.
I can do it for the T-test as follows:
n <- 100
mean_true <- 17
sd_true <- 2
## Null-Hypothesis (H0: mean_true = mean_0):
mean_0 <- seq(16, 18, len=15)
alpha <- 0.05
B <- 1000
Empirical_Power <- rep(NA, length(mean_0))
for(j in 1:length(Empirical_Power)){
Test_Decisions <- rep(NA, B)
for(i in 1:B){
dat_X <- rnorm(n=n, mean=mean_true, sd = sd_true)
t.Test_result <- t.test(x = dat_X, alternative = "two.sided", mu = mean_0[j])
Test_Decisions[i] <- t.Test_result$p.value < alpha
}
Number_of_Rejections <- length(Test_Decisions[Test_Decisions==TRUE])
Empirical_Power[j] <- Number_of_Rejections/B
}
I need a similar way for Chi-Square and it doesn't seem to work. That's how far i got but clearly it's wrong because no sense can be made from the results:
n <- 100
Frequency_true <- c(50,60,40,47,53)
sd_true <- 2
Frequency_0 <- c(0.2,0.2,0.2,0.2,0.2)
alpha <- 0.05
B <- 1000
Empirical_Power <- rep(NA, length(Frequency_0))
for(j in 1:length(Empirical_Power)){
Test_Decisions <- rep(NA, B)
for(i in 1:B){
dat_X <- rchisq(100000, df=99)
Chisq_result <- chisq.test(x = Frequency_true, p= Frequency_0)
Test_Decisions[i] <- Chisq_result$p.value < alpha
}
Number_of_Rejections <- length(Test_Decisions[Test_Decisions==TRUE])
Empirical_Power[j] <- Number_of_Rejections/B
}
The experiment is to draw 250 balls from a urn with 5 different types of balls where all types of balls are equi-probably drawn. The counts of types of balls drawn are given in vector Frequency_true, that follows a multinomial distribution.
So, in order to have the simulated power of the test, simulate B draws with a fixed total count of 250, using rmultinom, run chi-squared tests of Goodness-of-Fit and compute the proportion of p-values below the significance level alpha.
sim_p_value <- function(B, freq, prob){
Sum <- sum(freq)
x <- rmultinom(B, size = Sum, prob = prob)
apply(x, 2, \(y) chisq.test(y, p = prob)$p.value)
}
Frequency_true <- c(50,60,40,47,53)
Frequency_0 <- c(0.2,0.2,0.2,0.2,0.2)
alpha <- 0.05
B <- 1000
set.seed(2022)
pval <- sim_p_value(B, Frequency_true, Frequency_0)
Empirical_Power <- mean(pval < alpha)
Empirical_Power
#> [1] 0.16
Created on 2022-07-09 by the reprex package (v2.0.1)

How to perform bootstrapping for estimation and inference of quantile regression using multiply imputed data in R?

I am trying to manually pool results from quantile regression models run on multiply imputed data in R using mice. I make use of a bootstrapping procedure to get 95% CIs and P values of the model terms, in which model parameters and their standard errors are obtained after sampling a certain number of rows that is equal to the unique number of participants in my data set. This procedure is repeated 500 times for each of the m imputed data sets. Then, as a last step, I pool the estimated coefficients and their standard errors of the resulting 500 * m regression models according to Rubin's rules (1987) (see e.g. https://bookdown.org/mwheymans/bookmi/rubins-rules.html). To speed things up, I use foreach to split up the analyses over multiple processor cores and for to loop over the m imputed data sets.
However, there seems to be a flaw in the part wherein the results are pooled. When I look at the pooled results, I observe that the P values are not in accordance with the 95% CIs (e.g. P < 0.05 when 0 is included in the 95% CI).
To illustrate this issue, I have made a reproducible example, using these publicly available data: https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv
Because there are no missing data in this data set, I introduce them myself and impute the data (m = 10 multiply imputed data sets with 20 iterations). I use set.seed for reproducibility.
Note that I use lm instead of quantreg::rq in this example.
# load data
projdir <- "my_directory"
d <- read.csv(file = file.path(projdir, 'heart_failure_clinical_records_dataset.csv'))
#### introduce missing values
set.seed(1)
# age
age_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$age[age_miss_tag == 1] <- NA # MCAR
# serum creatinine
creat_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$serum_creatinine[creat_miss_tag == 1 & d$anaemia == 0] <- NA # MAR
# CK
CK_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$creatinine_phosphokinase[CK_miss_tag & d$platelets > median(d$platelets)] <- NA # MAR
# platelets
platelets_miss_tag <- rbinom(nrow(d), 1, 0.3)
d$platelets[platelets_miss_tag == 1] <- NA # MCAR
library(mice); library(mitml); library(miceadds); library(splines); library(foreach); library(doParallel)
# impute data
imp <- mice(d, maxit = 20, m = 10, seed = 2)
# log creatinine
implong <- complete(imp, 'long', include = FALSE)
implong$log_creat <- log(implong$serum_creatinine)
imp <- miceadds::datlist2mids(split(implong, implong$.imp))
# compute values for Boundary.knots
temp <- complete(imp, 'long', include = FALSE)
B_knots <- rowMeans(sapply(split(temp, temp$.imp), function(x) {
quantile(x$age, c(0.1, 0.9))
}))
# Convert mids object into a datlist
longlist <- miceadds::mids2datlist(imp)
# fit model based on origial data and use the terms in the below foreach loop
# in order to fix the position of the inner knots
fit_orig <- lm(log_creat ~
# Main effects
ns(age, df = 2, B = c(B_knots[1], B_knots[2])) * sex,
data = longlist[[1]])
To further speed things up, I use OLS instead of quantile regression here and parallelize the process.
# make cluster used in foreach
cores_2_use <- detectCores() - 1
cl <- makeCluster(cores_2_use)
clusterSetRNGStream(cl, iseed = 9956)
registerDoParallel(cl)
# No. of bootstrap samples to be taken
n_iter <- 500
boot.1 <- c()
for(k in seq_along(longlist)){
boot.1[[k]] <- foreach(i = seq_len(n_iter),
.combine = rbind,
.packages = c('mice', 'mitml', 'splines')) %dopar% {
# store data from which rows can be samples
longlist0 <- longlist[[k]]
# set seed for reproducibility
set.seed(i)
# sample rows
boot_dat <- longlist0[sample(1:nrow(longlist0), replace = TRUE), ]
# linear regression model based on sampled rows
fit1 <- lm(terms(fit_orig), data = boot_dat)
# save coefficients
fit1$coef
}
}
stopCluster(cl)
As a last step, I pool the results according to Rubin's rules.
n_cols <- dim(boot.1[[1]])[2]
list <- c()
for(i in seq_len(n_cols)) {
# extract coefficients
parameter <- lapply(boot.1, function(x){
x[,i]
})
m <- length(parameter)
for(k in seq_len(m)) {
names(parameter[[k]]) <- NULL
}
Q <- sapply(parameter, mean)
U <- sapply(parameter, var) # (standard error of estimate)^2
#### Pooling
# Pooled univariate estimate
qbar <- mean(Q)
# Mean of the variances (i.e. the pooled within-imputation variance)
ubar <- mean(U)
# Between-imputation variance
btw_var <- var(Q)
# Total variance of the pooled estimated
tot_var <- ubar + btw_var + (btw_var / m)
# Relative increase in variance due to non-response
r_var <- (btw_var + (btw_var / m)) / ubar
# Fraction of missing information
lambda <- (btw_var + (btw_var / m)) / tot_var
# degrees of freedom for the t-distribution according to Rubin (1987)
df_old <- (m - 1) / lambda^2
# sample size in the imputed data sets
n_sample <- nrow(longlist[[1]])
# observed degrees of freedom
df_observed <- (((n_sample - n_cols) + 1) / ((n_sample - n_cols) + 3)) *
(n_sample - n_cols) * (1 - lambda)
# adjusted degrees of freedom according to Barnard & Rubin (1999)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
# 95% confidence interval of qbar
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
# F statistic
q <- ((0 - qbar)^2 / tot_var)^2
# Significance level associated with the null value Q[0]
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
list[[i]] <- cbind(qbar, lwr, upr, p_value)
}
names(list) <- colnames(boot.1[[1]])
list
Obviously, the P value shown below is not in accordance with the 95% CI (as 0 is included in the CI, so the P value should be ≥0.05).
> list
$`(Intercept)`
qbar lwr upr p_value
[1,] 0.06984595 -0.02210231 0.1617942 0.008828337
EDIT (29 Dec 2021)
As #Gerko Vink notes in his answer, multiple imputation and bootstrapping both induce variance. The variance induced by imputation is taken care of by Rubin's rules, the bootstrap variance is not. Unfortunately, mice::pool will not work with the output returned by quantreg::rq.
I am aware of constructing bootstrap CIs based on a naive percentile-based approach as shown in this post, but I am inclined to think this is not the correct approach to proceed with.
Does anyone know how to appropriately take care of the extra variance induced by bootstrapping when using rq?
EDIT (30 Dec 2021)
Inspired by this recent post, I decided not to hit the road of bootstrapping anymore, but instead manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules. I have posted this approach as answer below. Any input on how to appropriately take care of the extra variance induced by bootstrapping when using rq is still very welcome though!
Bootstrapping and multiple imputation both induce variance. The imputation variance is taken care of by Rubin's rules for parameters with normal sampling distributions. The bootstrap variance is not.
Two remarks:
First, there is a small error in your code. You're calculating the bootstrap variance about Q in U <- sapply(parameter, var). No need for U <- U/n_iter. U is already the variance and sapply(parameter, sd) would yield the bootstrapped standard error.
Second, you're using bootstrap parameters to calculate a parametric interval and p-value. That seems needlessly complicated and, as you can see, potentially problematic. Why not calculate the bootstrap CI?
See also this link for some inspiration with respect to different means of calculating the CIs and their respective validity.
A small sim that demonstrates that you cannot expect both to be identical for a finite set of bootstrap replications.
library(purrr)
library(magrittr)
#fix seed
set.seed(123)
#some data
n = 1000
d <- rnorm(n, 0, 1)
# ci function
fun <- function(x){
se <- var(x)/length(x)
lwr <- mean(x) - 1.96 * se
upr <- mean(x) + 1.96 * se
ci <- c(lwr, upr)
return(ci)
}
# bootstrap
boot <- replicate(500,
d[sample(1:1000, 1000, replace = TRUE)],
simplify = FALSE)
# bootstrapped ci's based on parameters
boot.param.ci <- boot %>%
map(~.x %>% fun) %>%
do.call("rbind", args = .)
# bootstrap CI
boot.ci <- boot %>%
map(~.x %>% mean) %>%
unlist %>%
quantile(c(.025, .975))
# Overview
data.frame(param = fun(d),
boot.param = boot.param.ci %>% colMeans,
boot.ci = boot.ci)
#> param boot.param boot.ci
#> 2.5% 0.01420029 0.01517527 -0.05035913
#> 97.5% 0.01805545 0.01904181 0.07245449
Created on 2021-12-22 by the reprex package (v2.0.1)
The following reprex also demonstrates that the bootstrap applied to the imputed data yields different variance estimates under the same pooling rules.
library(purrr)
library(magrittr)
library(mice)
#fix seed
set.seed(123)
imp <- mice(boys,
m = 10,
printFlag = FALSE)
imp %>%
complete("all") %>%
map(~.x %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9601179 0.809167659 -2.422388 682.5182 0.01567825 -3.5488747
#> 2 hgt 0.1690468 0.002784939 60.700342 572.1861 0.00000000 0.1635768
#> 3 hc -0.2138941 0.021843724 -9.792018 639.0432 0.00000000 -0.2567883
#> 97.5 %
#> 1 -0.3713610
#> 2 0.1745167
#> 3 -0.1710000
imp %>%
complete("all") %>%
map(~.x %>%
.[sample(1:748, 748, replace = TRUE), ] %$%
lm(age ~ hgt + hc)) %>%
pool %>%
summary(conf.int = TRUE)
#> term estimate std.error statistic df p.value 2.5 %
#> 1 (Intercept) -1.9810146 1.253312293 -1.580623 22.57546 1.278746e-01 -4.5763892
#> 2 hgt 0.1689181 0.004124538 40.954423 24.47123 0.000000e+00 0.1604141
#> 3 hc -0.2133606 0.033793045 -6.313743 22.29686 2.217445e-06 -0.2833890
#> 97.5 %
#> 1 0.6143599
#> 2 0.1774221
#> 3 -0.1433322
Created on 2021-12-22 by the reprex package (v2.0.1)
For quantile regression, mice::pool will not work with the output returned by quantreg::rq, because (according to this post) there is no agreed upon method to calculate standard errors, which are required to pool results under multiple imputation.
An ad hoc solution would be to manually extract the point estimates and variances from each of the imputed data sets and pool them using Rubin's rules.
First, a reprex using lm to verify whether results from the manual approach and mice::pool are identical.
library(mice)
imp <- mice(nhanes, print = FALSE, seed = 123)
# fit linear model
fit <- with(imp, lm(bmi ~ chl + hyp))
# manually pool univariate estimates using Rubin's rules
pool_manual <- \(model_object) {
m <- length(model_object$analyses)
Q <- sapply(model_object$analyses, \(x) summary(x)$coefficients[, 'Estimate'])
U <- sapply(model_object$analyses, \(x) (summary(x)$coefficients[, 'Std. Error'])^2)
qbar <- rowMeans(Q)
ubar <- rowMeans(U)
btw_var <- apply(Q, 1, var)
tot_var <- ubar + btw_var + (btw_var / m)
lambda <- (btw_var + (btw_var / m)) / tot_var
df_old <- (m - 1) / lambda^2
n_sample <- length(residuals(model_object$analyses[[1]]))
n_cols <- dim(Q)[1]
df_com <- n_sample - n_cols
df_observed <- ((df_com + 1) / (df_com + 3)) * df_com * (1 - lambda)
df_adjusted <- (df_old * df_observed) / (df_old + df_observed)
lwr <- qbar - qt(0.975, df_adjusted) * sqrt(tot_var)
upr <- qbar + qt(0.975, df_adjusted) * sqrt(tot_var)
q <- (0 - qbar)^2 / tot_var
p_value <- pf(q, df1 = 1, df2 = df_adjusted, lower.tail = FALSE)
df <- data.frame(noquote(rownames(Q)), qbar, lwr, upr, p_value)
rownames(df) <- NULL
names(df) <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
return(df)
}
Verify.
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
> extract <- c('term', 'estimate', '2.5 %', '97.5 %', 'p.value')
> summary(pool(fit), conf.int = TRUE)[, extract]
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 21.78583831 8.99373786 34.57793875 0.004228746
2 chl 0.03303449 -0.02812005 0.09418903 0.254696358
3 hyp -1.07291395 -5.57406829 3.42824039 0.624035769
Quantile regression
Now, let's pool results from rq for the expected median of the outcome.
library(quantreg)
# fit quantile regression model
fit <- with(imp, rq(bmi ~ chl + hyp, tau = 0.5))
To be able to pool results from rq, only the summary method used to extract point estimates and variances from each of the imputed data sets needs to be adjusted in pool_manual.
Q <- sapply(model_object$analyses, \(x) summary.rq(x, covariance = TRUE)$coefficients[, 'Value'])
U <- sapply(model_object$analyses, \(x) (summary.rq(x, covariance = TRUE)$coefficients[, 'Std. Error'])^2)
Result
> pool_manual(fit)
term estimate 2.5 % 97.5 % p.value
1 (Intercept) 22.23452856 0.8551626 43.6138945 0.04461337
2 chl 0.03487894 -0.0857199 0.1554778 0.47022312
3 hyp -1.43636147 -6.0666990 3.1939761 0.52455041
> summary(pool(fit), conf.int = TRUE)[, extract]
Error in rq.fit.br(x, y, tau = tau, ci = TRUE, ...) :
unused arguments (effects = "fixed", parametric = TRUE, exponentiate = FALSE)

Coverage probability problem for moving block bootstrap

I applying moving block bootstrap (MBB) to a regression model using time series data. When I calculated the coverage probability of the estimators derived from the MBB the outcomes were anomalous except one coefficient (coeffcient for x1 which was set to be a continuous variable). Given that MBB is a well-establish method (see https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), I was wondering if there is something wrong with my code. I appreciate any input!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
As you can see, only the coverage probability for x1 coefficient estimator is ok. So anything wrong about my code? Or does this have something to do with MBB itself?
You're not really evaluating the coverage probabilities for the bootstrap. You need to build the confidence interval from the bootstrapped statistics, not making confidence intervals from the parametric models run on the bootstrapped samples. Here's how I would do it.
First, we can generate the data:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
Next, we could write a function that we will bootstrap. This function generates the bootstrap sample, runs the regression and saves the coefficients.
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
Next, we can run the function lots of times. Note that to generate a reliable 95% percentile confidence interval, you should have in the neighborhood of 1500-2500 bootstrap statistics. The farther the quantile you're trying to characterize is in the tails, the more bootstrap samples you need. So, the code below generates a single set of bootstrap coefficients:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
From this one set of bootstrap statistics, we can make a single confidence interval.
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
If you want to investigate the coverage probabilities of these intervals, you wold have to do what I did above, lots of times (we'll do 100, though to get better estimates, you would probably want to do more). We could then write a little function that would evaluate the coverage of one set of estimates:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
Then, you could apply that function to each of the bootstrap confidence intervals you generated. Using the rowMeans() function will get the mean of the coverage 1/0 values, which will be the coverage probability. In this case, using only 100 intervals, the coverage is 100%.
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1

How to calculate variance of least squares estimator using QR decomposition in R?

I'm trying to learn QR decomposition, but can't figure out how to get the variance of beta_hat without resorting to traditional matrix calculations. I'm practising with the iris data set, and here's what I have so far:
y<-(iris$Sepal.Length)
x<-(iris$Sepal.Width)
X<-cbind(1,x)
n<-nrow(X)
p<-ncol(X)
qr.X<-qr(X)
b<-(t(qr.Q(qr.X)) %*% y)[1:p]
R<-qr.R(qr.X)
beta<-as.vector(backsolve(R,b))
res<-as.vector(y-X %*% beta)
Thanks for your help!
setup (copying in your code)
y <- iris$Sepal.Length
x <- iris$Sepal.Width
X <- cbind(1,x)
n <- nrow(X)
p <- ncol(X)
qr.X <- qr(X)
b <- (t(qr.Q(qr.X)) %*% y)[1:p] ## can be optimized; see Remark 1 below
R <- qr.R(qr.X) ## can be optimized; see Remark 2 below
beta <- as.vector(backsolve(R, b))
res <- as.vector(y - X %*% beta)
math
computation
Residual degree of freedom is n - p, so estimated variance is
se2 <- sum(res ^ 2) / (n - p)
Thus, the variance covariance matrix of estimated coefficients is
V <- chol2inv(R) * se2
# [,1] [,2]
#[1,] 0.22934170 -0.07352916
#[2,] -0.07352916 0.02405009
validation
Let's check the correctness by comparing with lm:
fit <- lm(Sepal.Length ~ Sepal.Width, iris)
vcov(fit)
# (Intercept) Sepal.Width
#(Intercept) 0.22934170 -0.07352916
#Sepal.Width -0.07352916 0.02405009
Identical result!
Remark 1 (skip forming 'Q' factor)
Instead of b <- (t(qr.Q(qr.X)) %*% y)[1:p], you can use function qr.qty (to avoid forming 'Q' matrix):
b <- qr.qty(qr.X, y)[1:p]
Remark 2 (skip forming 'R' factor)
You don't have to extract R <- qr.R(qr.X) for backsolve; using qr.X$qr is sufficient:
beta <- as.vector(backsolve(qr.X$qr, b))
Appendix: A function for estimation
The above is the simplest demonstration. In practice column pivoting and rank-deficiency need be dealt with. The following is an implementation. X is a model matrix and y is the response. Results should be compared with lm(y ~ X + 0).
qr_estimation <- function (X, y) {
## QR factorization
QR <- qr(X)
r <- QR$rank
piv <- QR$pivot[1:r]
## estimate identifiable coefficients
b <- qr.qty(QR, y)[1:r]
beta <- backsolve(QR$qr, b, r)
## fitted values
yhat <- base::c(X[, piv] %*% beta)
## residuals
resi <- y - yhat
## error variance
se2 <- base::c(crossprod(resi)) / (nrow(X) - r)
## variance-covariance for coefficients
V <- chol2inv(QR$qr, r) * se2
## post-processing on pivoting and rank-deficiency
p <- ncol(X)
beta_full <- rep.int(NA_real_, p)
beta_full[piv] <- beta
V_full <- matrix(NA_real_, p, p)
V_full[piv, piv] <- V
## return
list(coefficients = beta_full, vcov = V_full,
fitted.values = yhat, residuals = resi, sig = sqrt(se2))
}

Extracting the Linear Discriminant Equation

So I have this data and I would like to extract the coefficients from the equation it produces. That way I would be able to plug in a new data point and see where it would be placed.
library(MASS)
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
Sp = rep(c("s","c","v"), rep(50,3)))
train <- sample(1:150, 75)
table(Iris$Sp[train])
## your answer may differ
## c s v
## 22 23 30
z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
I know I can get this:
> z
Call:
lda(Sp ~ ., data = Iris, prior = c(1, 1, 1)/3, subset = train)
Prior probabilities of groups:
c s v
0.3333333 0.3333333 0.3333333
Group means:
Sepal.L. Sepal.W. Petal.L. Petal.W.
c 5.969231 2.753846 4.311538 1.3384615
s 5.075000 3.541667 1.500000 0.2583333
v 6.700000 2.936000 5.552000 1.9880000
Coefficients of linear discriminants:
LD1 LD2
Sepal.L. -0.5458866 0.5215937
Sepal.W. -1.5312824 1.7891248
Petal.L. 1.8087255 -1.2637188
Petal.W. 2.8620894 3.2868849
Proportion of trace:
LD1 LD2
0.9893 0.0107
but is there a way to get just the equation so I would not have to calculate the new observation by hand?
Just turning this into an answer. You need predict(), the predict.lda method in the MASS package has your exact example in its help page:
tr <- sample(1:50, 25)
train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])
test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
z <- lda(train, cl)
predict(z, test)$class
The default method is "plug-in" so this is the code from MASS:::predict.lda. object is the fit-object and x comes from the newdata argument converted to a matrix:
# snipped preamble and error checking
means <- colSums(prior * object$means)
scaling <- object$scaling
x <- scale(x, center = means, scale = FALSE) %*% scaling
dm <- scale(object$means, center = means, scale = FALSE) %*%
scaling
method <- match.arg(method)
dimen <- if (missing(dimen))
length(object$svd)
else min(dimen, length(object$svd))
N <- object$N
if (method == "plug-in") {
dm <- dm[, 1L:dimen, drop = FALSE]
dist <- matrix(0.5 * rowSums(dm^2) - log(prior), nrow(x),
length(prior), byrow = TRUE) - x[, 1L:dimen, drop = FALSE] %*%
t(dm)
dist <- exp(-(dist - apply(dist, 1L, min, na.rm = TRUE)))
}
# snipped two other methods
}
posterior <- dist/drop(dist %*% rep(1, ng))
This mostly put in to demonstrate why Gregor's answer is the most sensible approach. Trying to pull out an "equation" seems unfruitful. (I can remember using the results of linear regression to do such an exercise in my first year-regression class in grad school.)

Resources