Coverage probability problem for moving block bootstrap - r

I applying moving block bootstrap (MBB) to a regression model using time series data. When I calculated the coverage probability of the estimators derived from the MBB the outcomes were anomalous except one coefficient (coeffcient for x1 which was set to be a continuous variable). Given that MBB is a well-establish method (see https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), I was wondering if there is something wrong with my code. I appreciate any input!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
As you can see, only the coverage probability for x1 coefficient estimator is ok. So anything wrong about my code? Or does this have something to do with MBB itself?

You're not really evaluating the coverage probabilities for the bootstrap. You need to build the confidence interval from the bootstrapped statistics, not making confidence intervals from the parametric models run on the bootstrapped samples. Here's how I would do it.
First, we can generate the data:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
Next, we could write a function that we will bootstrap. This function generates the bootstrap sample, runs the regression and saves the coefficients.
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
Next, we can run the function lots of times. Note that to generate a reliable 95% percentile confidence interval, you should have in the neighborhood of 1500-2500 bootstrap statistics. The farther the quantile you're trying to characterize is in the tails, the more bootstrap samples you need. So, the code below generates a single set of bootstrap coefficients:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
From this one set of bootstrap statistics, we can make a single confidence interval.
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
If you want to investigate the coverage probabilities of these intervals, you wold have to do what I did above, lots of times (we'll do 100, though to get better estimates, you would probably want to do more). We could then write a little function that would evaluate the coverage of one set of estimates:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
Then, you could apply that function to each of the bootstrap confidence intervals you generated. Using the rowMeans() function will get the mean of the coverage 1/0 values, which will be the coverage probability. In this case, using only 100 intervals, the coverage is 100%.
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1

Related

Control the printout of confidence intervals related to a linear model function

I am doing bootstrapping for a linear model but how do I edit the printout names of the intercept and the x variable?
Here are the simulated data
set.seed(42)
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)
Here is the model:
mo <- lm(y ~ x, data=dd)
Find fit and residuals:
fit <- fitted(mo)
resi <- residuals(mo)
Function to retrieve confidence intervals based on the residual bootstrapping:
FUN <- function() {
X <- model.matrix(mo)
ressampy <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy ~ X-1)
confint(bootmod, level = 0.95)
}
Output of 1 run (notice that the printouts are X(Intercept) and Xx but instead I just want them to be (Intercept) and x)
FUN()
2.5 % 97.5 %
X(Intercept) 49.74439 50.07817
Xx 24.92904 25.25103
This may be a easy fix but I just couldn't get it to work. Any help will be greatly appreciated!
Simply use rownames() to change the row names of the matrix containing the confidence intervals, as in:
FUN <- function() {
X <- model.matrix(mo)
ressampy <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy ~ X-1)
ci <- confint(bootmod, level = 0.95)
rownames(ci) <- c("(Intercept)", "x")
return(ci)
}

Coverage probability calculation for LM

I am trying to calculate coverage probability for a set of residual bootstrap replicates I generated on the intercept and slope of regression . Can anyone show me how to calculate coverage probability of confidence intervals? Many thanks.
Note that I manually ran the regression using Qr decomposition but you can use lm() if that's easier. I just thought doing it manually will be faster.
set.seed(42) ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)
mo <- lm(y ~ x, data=dd)
# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))
EDIT
Incorporating jay.sf's answer, I rewrote the code that ran the lm() method and compared the first and second approach of calculating coverage probability in the link shared by jay.sf:
library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci
FUNInter <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "X(Intercept)", level = 0.95)
}
FUNBeta <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "Xx", level = 0.95)
}
set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))
#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405
According to Morris et. al 2019, Table 6, the coverage probability is defined as the probability how often real theta lies within a bootstrapped confidence interval (CI) (i.e. those of the model applied on many samples based on the actual data, or—in other words—new experiments):
Hence, we want to compute CIs based on OP's proposed i.i.d. bootstrap R times and calculate the ratio of how often theta is or is not in these CIs.
First, we estimate our model mo using the actual data.
mo <- lm(y ~ x)
To avoid unnecessary unpacking fitted values yhat, residuals u, model matrix X, and coefficients coef0 in the replications, we extract them beforehand.
yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25) ## known from data generating process of simulation
In a bootstrap function FUN we wrap all the steps we want to do in one replication. In order to apply the very fast .lm.fit, we have to calculate the white standard errors manually (identical to lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1"))).
FUN <- function() {
## resampling residuals
y.star <- yhat + sample(u, length(u), replace=TRUE)
## refit model
fit <- .lm.fit(X, y.star)
coef <- fit$coefficients[sort.list(fit$pivot)]
## alternatively using QR, but `.lm.fit` is slightly faster
# qrX <- qr(X, LAPACK=TRUE)
# coef <- qr.coef(qrX, y.star)
## white standard errors
v.cov <- chol2inv(chol(t(X) %*% X))
meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
## degrees of freedom adjust (HC1)
d <- dim(X)
dfa <- d[1] / (d[1] - d[2])
white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
## 95% CIs
ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
## coverage
c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}
Now we execute the bootstrap using replicate.
R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
# user system elapsed
# 71.19 28.25 100.28
head(res, 3)
# intercept x
# [1,] TRUE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
The mean of TRUEs in both columns simultaneously across the rows, or in each column respectively, gives the coverage probability we are looking for.
(cp.t <- mean(rowSums(res) == ncol(res))) ## coverage probability total
(cp.i <- colMeans(res)) ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
# total intercept x
# 0.8954 0.9478 0.9444
## values with other R:
# total intercept x
# 0.90700 0.95200 0.95200 ## R == 1k
# 0.89950 0.95000 0.94700 ## R == 2k
# 0.89540 0.94780 0.94440 ## R == 5k
# 0.89530 0.94570 0.94680 ## R == 10k
# 0.89722 0.94694 0.94777 ## R == 100k
And this is how it looks like after 100k repetitions
Code for plot:
r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)
matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R',
ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')
Data:
set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)

Predict using felm output with standard errors

Is there way to get predict behavior with standard errors from lfe::felm if the fixed effects are swept out using the projection method in felm? This question is very similar to the question here, but none of the answers to that question can be used to estimate standard errors or confidence/prediction intervals. I know that there's currently no predict.felm, but I am wondering if there are workarounds similar to those linked above that might also work for estimating the prediction interval
library(DAAG)
library(lfe)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Result: fit lwr upr
# 1 18436.18 2339.335 34533.03
model2 <- felm(data = cps1, re74 ~ age | nodeg + marr)
predict(model2, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Does not work
The goal is to estimate a prediction interval for yhat, for which I think I'd need to compute the full variance-covariance matrix (including the fixed effects). I haven't been able to figure out how to do this, and I'm wondering if it's even computationally feasible.
After conversations with several people, I don't believe it is possible to obtain an estimate the distribution of yhat=Xb (where X includes both the covariates and the fixed effects) directly from felm, which is what this question boils down to. It is possible bootstrap them, however. The following code does so in parallel. There is scope for performance improvements, but this gives the general idea.
Note: here I do not compute full prediction interval, just the SEs on Xb, but obtaining the prediction interval is straightforward - just add the root of sigma^2 to the SE.
library(DAAG)
library(lfe)
library(parallel)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
yhat_lm <- predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T)
set.seed(42)
boot_yhat <- function(b) {
print(b)
n <- nrow(cps1)
boot <- cps1[sample(1:n, n, replace=T),]
lm.model <- lm(data=demeanlist(boot[, c("re74", "age")], list(factor(boot$nodeg), factor(boot$marr))),
formula = re74 ~ age)
fe <- getfe(felm(data = boot, re74 ~ age | nodeg + marr))
bootResult <- predict(lm.model, newdata = data.frame(age = 40)) +
fe$effect[fe$fe == "nodeg" & fe$idx==0] +
fe$effect[fe$fe == "marr" & fe$idx==1]
return(bootResult)
}
B = 1000
yhats_boot <- mclapply(1:B, boot_yhat)
plot(density(rnorm(10000, mean=yhat_lm$fit, sd=yhat_lm$se.fit)))
lines(density(yhats), col="red")
From your first model predict(.) yields this:
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Following 李哲源 we can achieve these results manually, too.
beta.hat.1 <- coef(model1) # save coefficients
# model matrix: age=40, nodeg = 0, marr=1:
X.1 <- cbind(1, matrix(c(40, 0, 1), ncol=3))
pred.1 <- as.numeric(X.1 %*% beta.hat.1) # prediction
V.1 <- vcov(model1) # save var-cov matrix
se2.1 <- unname(rowSums((X.1 %*% V.1) * X.1)) # prediction var
alpha.1 <- qt((1-0.95)/2, df = model1$df.residual) # 5 % level
pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1) # 95%-CI
# [1] 18258.18 18614.18
sigma2.1 <- sum(model1$residuals ^ 2) / model1$df.residual # sigma.sq
PI.1 <- pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1 + sigma2.1) # prediction interval
matrix(c(pred.1, PI.1), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr")))
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Now, your linked example applied to multiple FE, we get this results:
lm.model <- lm(data=demeanlist(cps1[, c(8, 2)],
list(as.factor(cps1$nodeg),
as.factor(cps1$marr))), re74 ~ age)
fe <- getfe(model2)
predict(lm.model, newdata = data.frame(age = 40)) + fe$effect[fe$idx=="1"]
# [1] 15091.75 10115.21
The first value is with and the second without added FE (try fe$effect[fe$idx=="1"]).
Now we're following the manual approach above.
beta.hat <- coef(model2) # coefficient
x <- 40 # age = 40
pred <- as.numeric(x %*% beta.hat) # prediction
V <- model2$vcv # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = model2$df.residual) # 5% level
pred + c(alpha, -alpha) * sqrt(se2) # CI
# [1] 9599.733 10630.697
sigma2 <- sum(model2$residuals ^ 2) / model2$df.residual # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
matrix(c(pred, PI), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr"))) # output
# fit lwr upr
# 1 10115.21 -5988.898 26219.33
As we see, the fit is the same as the linked example approach, but now with prediction interval. (Disclaimer: The logic of the approach should be straightforward, the values of the PI should still be evaluated, e.g. in Stata with reghdfe.)
Edit: In case you want to achieve exactly the same output from felm() which predict.lm() yields with the linear model1, you simply need to "include" again the fixed effects in your model (see model3 below). Just follow the same approach then. For more convenience you easily could wrap it into a function.
library(DAAG)
library(lfe)
model3 <- felm(data = cps1, re74 ~ age + nodeg + marr)
pv <- c(40, 0, 1) # prediction x-values
predict0.felm <- function(mod, pv.=pv) {
beta.hat <- coef(mod) # coefficient
x <- cbind(1, matrix(pv., ncol=3)) # prediction vector
pred <- as.numeric(x %*% beta.hat) # prediction
V <- mod[['vcv'] ] # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = mod[['df.residual']]) # 5% level
CI <- structure(pred + c(alpha, -alpha) * sqrt(se2),
names=c("CI lwr", "CI upr")) # CI
sigma2 <- sum(mod[['residuals']] ^ 2) / mod[['df.residual']] # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
mx <- matrix(c(pred, PI), nrow = 1,
dimnames = list(1, c("PI fit", "PI lwr", "PI upr"))) # output
list(CI, mx)
}
predict0.felm(model3)[[2]]
# PI fit PI lwr PI upr
# 1 18436.18 2339.335 34533.03
By this with felm() you can achieve the same prediction interval as with predict.lm().

Combining the result of multiple imputation in partial least square (PLS) (non-normal distribution)

I am using a combination of IPW and MI with PLS, in each of the MI model, I have calculate the 95%CI. My question is how could I combine the results of 95%CI into the final result. Below is a sample script that I am using. Assuming the parameter of interest does not follow normal distribution.
The coeftable[[i]] contains the coefficients and 95%CI in each MI model.
library(Hmisc)
library(dplyr)
library(nlme)
library(reshape)
library(plsRglm)
library(xlsx)
library(boot)
set.seed(123)
id <- c(1:1000)
y <- sample(c(1:5,NA), 1000, replace=T)
x1 <- sample(c(1:2,NA), 1000, replace=T)
x2 <- sample(c(1:3,NA), 1000, replace=T)
x3 <- sample(c(1:4,NA), 1000, replace=T)
df <- data.frame(id,y,x1,x2,x3)
df.nomiss <- subset(df, !is.na(df$y))
# obs==1: with any missing data of x
df.nomiss[,"obs"] <- 0
df.nomiss$obs[is.na(df.nomiss$x1)==TRUE |
is.na(df.nomiss$x2)==TRUE|
is.na(df.nomiss$x3)==TRUE ] <- 1
# only include obs==1 into the imputation
include<-df.nomiss[df.nomiss$obs==1,]
exclude<-anti_join(df.nomiss,include,by="id")
# imputation
m=10
include.i <- aregImpute(~factor(y) + factor(x1) + factor(x2) +factor(x3) ,
data=include,n.impute=m)
include.nomiss <- list(include, include, include, include, include, include,include, include, include, include)
# if a variable is coded as 0, use "include.i$imputed$x1[,i]-1 "
for(i in 1:m){
include.nomiss[[i]]$y[is.na(include.nomiss[[i]]$y)] <-
include.i$imputed$y[,i]
include.nomiss[[i]]$x1[is.na(include.nomiss[[i]]$x1)] <-
include.i$imputed$x1[,i]
include.nomiss[[i]]$x2[is.na(include.nomiss[[i]]$x2)] <-
include.i$imputed$x2[,i]
include.nomiss[[i]]$x3[is.na(include.nomiss[[i]]$x3)] <-
include.i$imputed$x3[,i]
}
missingmodel <- list(NA)
analysismodel<-list(NA)
all<- rep(list(NA), m)
modplsglm <- rep(list(NA), m)
coeftable <- rep(list(NA), m)
rawci <- rep(list(NA), m)
loading <- rep(list(NA), m)
temp.bootplsRglm <- rep(list(NA), m)
# PLSRGLM
R <- 1000
ncomp <- 3
# IPW with PLS
for(i in 1:m){
all[[i]]<-rbind(exclude,include.nomiss[[i]])
# IPW
missingmodel[[i]] <- glm(obs ~y + x1 + x2 +x3 ,
data=all[[i]], family=binomial)
all[[i]]$pw<-(1/missingmodel[[i]]$fitted.values)
# PLSRGLM
modplsglm[[i]] <- plsRglm(y~ factor(x1) + factor(x2) + factor(x3) ,
nt=ncomp,data=all[[i]], modele="pls", weights=all[[i]]$pw)
# bootstrap 95%CI
temp.bootplsRglm[[i]] <- bootplsglm(modplsglm[[i]], typeboot="plsmodel", R=R , statistic=coefs.plsRglmnp, sim="balanced", stype="i", stabvalue=1e6, verbose=TRUE)
indices.temp.bootplsRglm <- !is.na(temp.bootplsRglm[[i]]$t[,1])
temp.bootplsRglm[[i]]$t=temp.bootplsRglm[[i]]$t[indices.temp.bootplsRglm,]
temp.bootplsRglm[[i]]$R=sum(indices.temp.bootplsRglm)
temp.bootplsRglm[[i]]$call$R<-sum(indices.temp.bootplsRglm)
Cornell.bootYX.raw <- temp.bootplsRglm[[i]]
# generate coeftable
options(scipen=999)
coeftable[[i]] <- as.data.frame(modplsglm[[i]]$Coeffs)
colnames(coeftable[[i]]) <- "coef"
rawci[[i]] <- confints.bootpls(Cornell.bootYX.raw, typeBCa=FALSE)
rawci[[i]] <- as.data.frame(rawci[[i]])
col <- c("Normal lower","Normal upper","Basic lower","Basic upper","Percentile lower","Percentile upper")
colnames(rawci[[i]]) <- col
rawci[[i]] <- rawci[[i]][,c("Percentile lower","Percentile upper")]
coeftable[[i]] <- cbind(variable=0,coeftable[[i]],rawci[[i]])
coeftable[[i]][,"variable"]<-rownames(coeftable[[i]])
}
Below is the coeftable of one MI model, my expected output should be like this, but presenting the overall estimate of all MI models.
coeftable[[1]]
variable coef Percentile lower Percentile upper
Intercept Intercept 2.96021462 0.000000000 0.00000000
factor.x1.1 factor.x1.1 0.04540381 -0.019860282 0.04854000
factor.x1.2 factor.x1.2 -0.04540381 -0.048540000 0.01986028
factor.x2.2 factor.x2.2 0.23350314 -0.002184034 0.15478083
factor.x2.3 factor.x2.3 0.04506754 -0.063760940 0.09520172
factor.x3.2 factor.x3.2 0.08297860 -0.057287056 0.09292398
factor.x3.3 factor.x3.3 -0.15542543 -0.124509722 0.02694244
factor.x3.4 factor.x3.4 -0.05176159 -0.092618253 0.05736522

When simulating multivariate data for regression, how can I set the R-squared (example code included)?

I am trying to simulate a three-variable dataset so that I can run linear regression models on it. 'X1' and 'X2' would be continuous independent variables (mean=0, sd=1), and 'Y' would be the continuous dependent variable.
The variables will be regression model will produce coefficients like this:
Y = 5 + 3(X1) - 2(X2)
I would like to simulate this dataset such that the resulting regression model has an R-squared value of 0.2. How can I determine the value of 'sd.value' so that the regression model has this R-squared?
n <- 200
set.seed(101)
sd.value <- 1
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))
Take a look at this code, it should be close enough to what you want:
simulate <- function(n.obs=10^4, beta=c(5, 3, -2), R.sq=0.8) {
stopifnot(length(beta) == 3)
df <- data.frame(x1=rnorm(n.obs), x2=rnorm(n.obs)) # x1 and x2 are independent
var.epsilon <- (beta[2]^2 + beta[3]^2) * (1 - R.sq) / R.sq
stopifnot(var.epsilon > 0)
df$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
df$y <- with(df, beta[1] + beta[2]*x1 + beta[3]*x2 + epsilon)
return(df)
}
get.R.sq <- function(desired) {
model <- lm(y ~ x1 + x2, data=simulate(R.sq=desired))
return(summary(model)$r.squared)
}
df <- data.frame(desired.R.sq=seq(from=0.05, to=0.95, by=0.05))
df$actual.R.sq <- sapply(df$desired.R.sq, FUN=get.R.sq)
plot(df)
abline(a=0, b=1, col="red", lty=2)
Basically your question comes down to figuring out the expression for var.epsilon. Since we have y = b1 + b2*x1 + b3*x2 + epsilon, and Xs and epsilon are all independent, we have var[y] = b2^2 * var[x1] + b3^2 * var[x2] + var[eps], where the var[Xs]=1 by assumption. You can then solve for var[eps] as a function of R-squared.
So the formula for R^2 is 1-var(residual)/var(total)
In this case, the variance of Y is going to be 3^2+2^2+sd.value^2, since we are adding three independent random variables. And, asymptotically, the residual variance is going to be simply sd.value^2.
So you can compute rsquared explicitly with this function:
rsq<-function(x){1-x^2/(9+ 4+x^2)}
With a little algebra, you can compute the inverse of this function:
rsqi<-function(x){sqrt(13)*sqrt((1-x)/x)}
So setting sd.value<-rsqi(rsquared) should give you what you want.
We can test this as follows:
simrsq<-function(x){
Y <- rnorm(n, (5 + 3*X1 - 2*X2), rsqi(x))
simdata <- data.frame(X1, X2, Y)
summary(lm(Y ~ X1 + X2, data=simdata))$r.squared
}
> meanrsq<-rep(0,9)
> for(i in 1:50)
+ meanrsq<-meanrsq+Vectorize(simrsq)((1:9)/10)
> meanrsq/50
[1] 0.1031827 0.2075984 0.3063701 0.3977051 0.5052408 0.6024988 0.6947790
[8] 0.7999349 0.8977187
So it looks to be correct.
This is how I would do it (blind iterative algorithm, assuming no knowledge, for when you are purely interested in "how to simulate this"):
simulate.sd <- function(nsim=10, n=200, seed=101, tol=0.01) {
set.seed(seed)
sd.value <- 1
rsquare <- 1:nsim
results <- 1:nsim
for (i in 1:nsim) {
# tracking iteration: if we miss the value, abort at sd.value > 7.
iter <- 0
while (rsquare[i] > (0.20 + tol) | rsquare[i] < (0.2 - tol)) {
sd.value <- sd.value + 0.01
rsquare[i] <- simulate.sd.iter(sd.value, n)
iter <- iter + 1
if (iter > 3000) { break }
}
results[i] <- sd.value # store the current sd.value that is OK!
sd.value <- 1
}
cbind(results, rsquare)
}
simulate.sd.iter <- function(sd.value, n=200) { # helper function
# Takes the sd.value, creates data, and returns the r-squared
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
Y <- rnorm(n, (5 + 3*X1 - 2*X2), sd.value)
simdata <- data.frame(X1, X2, Y)
return(summary(lm(Y ~ X1 + X2, data=simdata))$r.squared)
}
simulate.sd()
A few things to note:
I let the X1 and X2 vary, since this affects this sought sd.value.
The tolerance is how exact you want this estimate to be. Are you fine with an r-squared of ~0.19 or ~0.21? Have the tolerance be 0.01.
Note that a too precise tolerance might not allow you to find a result.
The value of 1 is quite a bad starting value, making this iterative algorithm quite slow.
The resulting vector for 10 results is:
[1] 5.64 5.35 5.46 5.42 5.79 5.39 5.64 5.62 4.70 5.55,
which takes roughly 13 seconds on my machine.
My next step would be to start from 4.5, add 0.001 to the iteration instead of 0.01, and perhaps lower the tolerance. Good luck!
Alright, some summary statistics for nsim=100, taking 150 seconds, with steps increase of 0.001, and tolerance still at 0.01:
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.513 4.913 5.036 5.018 5.157 5.393
Why are you interested in this though?
Here is another code to generate multiple linear regression with errors follow normal distribution:
OPS sorry this code just produces multiple regression
sim.regression<-function(n.obs=10,coefficients=runif(10,-5,5),s.deviation=.1){
n.var=length(coefficients)
M=matrix(0,ncol=n.var,nrow=n.obs)
beta=as.matrix(coefficients)
for (i in 1:n.var){
M[,i]=rnorm(n.obs,0,1)
}
y=M %*% beta + rnorm(n.obs,0,s.deviation)
return (list(x=M,y=y,coeff=coefficients))
}

Resources