Related
I am trying to calculate coverage probability for a set of residual bootstrap replicates I generated on the intercept and slope of regression . Can anyone show me how to calculate coverage probability of confidence intervals? Many thanks.
Note that I manually ran the regression using Qr decomposition but you can use lm() if that's easier. I just thought doing it manually will be faster.
set.seed(42) ## for sake of reproducibility
n <- 100
x <- rnorm(n)
e <- rnorm(n)
y <- as.numeric(50 + 25*x + e)
dd <- data.frame(id=1:n, x=x, y=y)
mo <- lm(y ~ x, data=dd)
# Manual Residual Bootstrap
resi <- residuals(mo)
fit <- fitted(mo)
ressampy <- function() fit + sample(resi, length(resi), replace=TRUE)
# Sample y values:
head(ressampy())
# Qr decomposition of X values
qrX <- qr(cbind(Intercept=1, dd[, "x", drop=FALSE]), LAPACK=TRUE)
# faster than LM
qr.coef(qrX, dd[, "y"])
# One Bootstrap replication
boot1 <- qr.coef(qrX, ressampy())
# 1000 bootstrap replications
boot <- t(replicate(1000, qr.coef(qrX, ressampy())))
EDIT
Incorporating jay.sf's answer, I rewrote the code that ran the lm() method and compared the first and second approach of calculating coverage probability in the link shared by jay.sf:
library(lmtest);library(sandwich)
ci <- confint(coeftest(mo, vcov.=vcovHC(mo, type="HC3")))
ci
FUNInter <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "X(Intercept)", level = 0.95)
}
FUNBeta <- function() {
X <- model.matrix(mo)
ressampy.2 <- fit + sample(resi, length(resi), replace = TRUE)
bootmod <- lm(ressampy.2 ~ X-1)
confint(bootmod, "Xx", level = 0.95)
}
set.seed(42)
R <- 1000
Interres <- replicate(R, FUNInter(), simplify=FALSE)
Betares <- replicate(R, FUNBeta(), simplify=FALSE)
ciinter <- t(sapply(Interres, function(x, y) x[grep(y, rownames(x)), ], "X\\(Intercept\\)"))
cibeta <- t(sapply(Betares, function(x, y) x[grep(y, rownames(x)), ], "Xx"))
#second approach of calculating CP
sum(ciinter[,"2.5 %"] <=50 & 50 <= ciinter[,"97.5 %"])/R
[1] 0.842
sum(cibeta[,"2.5 %"] <=25 & 25 <= cibeta[,"97.5 %"])/R
[1] 0.945
#first approach of calculating CP
sum(apply(ciinter, 1, function(x) {
all(data.table::between(x, ci[1,1], ci[1,2]))
}))/R
[1] 0.076
sum(apply(cibeta, 1, function(x) {
all(data.table::between(x, ci[2,1], ci[2,2]))
}))/R
[1] 0.405
According to Morris et. al 2019, Table 6, the coverage probability is defined as the probability how often real theta lies within a bootstrapped confidence interval (CI) (i.e. those of the model applied on many samples based on the actual data, or—in other words—new experiments):
Hence, we want to compute CIs based on OP's proposed i.i.d. bootstrap R times and calculate the ratio of how often theta is or is not in these CIs.
First, we estimate our model mo using the actual data.
mo <- lm(y ~ x)
To avoid unnecessary unpacking fitted values yhat, residuals u, model matrix X, and coefficients coef0 in the replications, we extract them beforehand.
yhat <- mo$fitted.values
u <- as.matrix(mo$residuals)
X <- model.matrix(mo)
theta <- c(50, 25) ## known from data generating process of simulation
In a bootstrap function FUN we wrap all the steps we want to do in one replication. In order to apply the very fast .lm.fit, we have to calculate the white standard errors manually (identical to lmtest::coeftest(fit, vcov.=sandwich::vcovHC(fit, type="HC1"))).
FUN <- function() {
## resampling residuals
y.star <- yhat + sample(u, length(u), replace=TRUE)
## refit model
fit <- .lm.fit(X, y.star)
coef <- fit$coefficients[sort.list(fit$pivot)]
## alternatively using QR, but `.lm.fit` is slightly faster
# qrX <- qr(X, LAPACK=TRUE)
# coef <- qr.coef(qrX, y.star)
## white standard errors
v.cov <- chol2inv(chol(t(X) %*% X))
meat <- t(X) %*% diag(diag(u %*% t(u))) %*% X
## degrees of freedom adjust (HC1)
d <- dim(X)
dfa <- d[1] / (d[1] - d[2])
white.se <- sqrt(diag(v.cov %*% meat %*% v.cov)*dfa)
## 95% CIs
ci <- coef + qt(1 - .025, d[1] - d[2])*white.se %*% t(c(-1, 1))
## coverage
c(intercept=theta[1] >= ci[1, 1] & theta[1] <= ci[1, 2],
x=theta[2] >= ci[2, 1] & theta[2] <= ci[2, 2])
}
Now we execute the bootstrap using replicate.
R <- 5e3
set.seed(42)
system.time(res <- t(replicate(R, FUN())))
# user system elapsed
# 71.19 28.25 100.28
head(res, 3)
# intercept x
# [1,] TRUE TRUE
# [2,] FALSE TRUE
# [3,] TRUE TRUE
The mean of TRUEs in both columns simultaneously across the rows, or in each column respectively, gives the coverage probability we are looking for.
(cp.t <- mean(rowSums(res) == ncol(res))) ## coverage probability total
(cp.i <- colMeans(res)) ## coverage probability individual coefs
(cp <- c(total=cp.t, cp.i))
# total intercept x
# 0.8954 0.9478 0.9444
## values with other R:
# total intercept x
# 0.90700 0.95200 0.95200 ## R == 1k
# 0.89950 0.95000 0.94700 ## R == 2k
# 0.89540 0.94780 0.94440 ## R == 5k
# 0.89530 0.94570 0.94680 ## R == 10k
# 0.89722 0.94694 0.94777 ## R == 100k
And this is how it looks like after 100k repetitions
Code for plot:
r1 <- sapply(seq(nrow(res)), \(i) mean(rowSums(res[1:i,,drop=FALSE]) == ncol(res)))
r2 <- t(sapply(seq(nrow(res)), \(i) colMeans(res[1:i,,drop=FALSE])))
r <- cbind(r1, r2)
matplot(r, type='l', col=2:4, lty=1, main='coverage probability', xlab='R',
ylab='cum. mean',ylim=c(.89, .955))
grid()
sapply(seq(cp), \(i) abline(h=cp[i], lty=2, col=i + 1))
legend('right', col=2:4, lty=1, legend=names(cp), bty='n')
Data:
set.seed(42)
n <- 1e3
x <- rnorm(n)
y <- 50 + 25*x + rnorm(n)
I applying moving block bootstrap (MBB) to a regression model using time series data. When I calculated the coverage probability of the estimators derived from the MBB the outcomes were anomalous except one coefficient (coeffcient for x1 which was set to be a continuous variable). Given that MBB is a well-establish method (see https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)), I was wondering if there is something wrong with my code. I appreciate any input!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
As you can see, only the coverage probability for x1 coefficient estimator is ok. So anything wrong about my code? Or does this have something to do with MBB itself?
You're not really evaluating the coverage probabilities for the bootstrap. You need to build the confidence interval from the bootstrapped statistics, not making confidence intervals from the parametric models run on the bootstrapped samples. Here's how I would do it.
First, we can generate the data:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
Next, we could write a function that we will bootstrap. This function generates the bootstrap sample, runs the regression and saves the coefficients.
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
Next, we can run the function lots of times. Note that to generate a reliable 95% percentile confidence interval, you should have in the neighborhood of 1500-2500 bootstrap statistics. The farther the quantile you're trying to characterize is in the tails, the more bootstrap samples you need. So, the code below generates a single set of bootstrap coefficients:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
From this one set of bootstrap statistics, we can make a single confidence interval.
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
If you want to investigate the coverage probabilities of these intervals, you wold have to do what I did above, lots of times (we'll do 100, though to get better estimates, you would probably want to do more). We could then write a little function that would evaluate the coverage of one set of estimates:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
Then, you could apply that function to each of the bootstrap confidence intervals you generated. Using the rowMeans() function will get the mean of the coverage 1/0 values, which will be the coverage probability. In this case, using only 100 intervals, the coverage is 100%.
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1
For my thesis I have to fit some glm models with MLEs that R doesn't have, I was going ok for the models with close form but now I have to use de Gausian CDF, so i decide to fit a simple probit model.
this is the code:
Data:
set.seed(123)
x <-matrix( rnorm(50,2,4),50,1)
m <- matrix(runif(50,2,4),50,1)
t <- matrix(rpois(50,0.5),50,1)
z <- (1+exp(-((x-mean(x)/sd(x)))))^-1 + runif(50)
y <- ifelse(z < 1.186228, 0, 1)
data1 <- as.data.frame(cbind(y,x,m,t))
myprobit <- function (formula, data)
{
mf <- model.frame(formula, data)
y <- model.response(mf, "numeric")
X <- model.matrix(formula, data = data)
if (any(is.na(cbind(y, X))))
stop("Some data are missing.")
loglik <- function(betas, X, y, sigma) { #loglikelihood
p <- length(betas)
beta <- betas[-p]
eta <- X %*% beta
sigma <- 1 #because of identification, sigma must be equal to 1
G <- pnorm(y, mean = eta,sd=sigma)
sum( y*log(G) + (1-y)*log(1-G))
}
ls.reg <- lm(y ~ X - 1)#starting values using ols, indicating that this model already has a constant
start <- coef(ls.reg)
fit <- optim(start, loglik, X = X, y = y, control = list(fnscale = -1), method = "BFGS", hessian = TRUE) #optimizar
if (fit$convergence > 0) {
print(fit)
stop("optim failed to converge!") #verify convergence
}
return(fit)
}
myprobit(y ~ x + m + t,data = data1)
And i get: Error in X %*% beta : non-conformable arguments, if i change start <- coef(ls.reg) with start <- c(coef(ls.reg), 1) i get wrong stimatives comparing with:
probit <- glm(y ~ x + m + t,data = data1 , family = binomial(link = "probit"))
What am I doing wrong?
Is possible to correctly fit this model using pnorm, if no, what algorithm should I use to approximate de gausian CDF. Thanks!!
The line of code responsible for your error is the following:
eta <- X %*% beta
Note that "%*%" is the matrix multiplication operator. By reproducing your code I noticed that X is a matrix with 50 rows and 4 columns. Hence, for matrix multiplication to be possible your "beta" needs to have 4 rows. But when you run "betas[-p]" you subset the betas vector by removing its last element, leaving only three elements instead of the four you need for matrix multiplication to be defined. If you remove [-p] the code will work.
Is there way to get predict behavior with standard errors from lfe::felm if the fixed effects are swept out using the projection method in felm? This question is very similar to the question here, but none of the answers to that question can be used to estimate standard errors or confidence/prediction intervals. I know that there's currently no predict.felm, but I am wondering if there are workarounds similar to those linked above that might also work for estimating the prediction interval
library(DAAG)
library(lfe)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Result: fit lwr upr
# 1 18436.18 2339.335 34533.03
model2 <- felm(data = cps1, re74 ~ age | nodeg + marr)
predict(model2, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T, interval="prediction")$fit
# Does not work
The goal is to estimate a prediction interval for yhat, for which I think I'd need to compute the full variance-covariance matrix (including the fixed effects). I haven't been able to figure out how to do this, and I'm wondering if it's even computationally feasible.
After conversations with several people, I don't believe it is possible to obtain an estimate the distribution of yhat=Xb (where X includes both the covariates and the fixed effects) directly from felm, which is what this question boils down to. It is possible bootstrap them, however. The following code does so in parallel. There is scope for performance improvements, but this gives the general idea.
Note: here I do not compute full prediction interval, just the SEs on Xb, but obtaining the prediction interval is straightforward - just add the root of sigma^2 to the SE.
library(DAAG)
library(lfe)
library(parallel)
model1 <- lm(data = cps1, re74 ~ age + nodeg + marr)
yhat_lm <- predict(model1, newdata = data.frame(age=40, nodeg = 0, marr=1), se.fit = T)
set.seed(42)
boot_yhat <- function(b) {
print(b)
n <- nrow(cps1)
boot <- cps1[sample(1:n, n, replace=T),]
lm.model <- lm(data=demeanlist(boot[, c("re74", "age")], list(factor(boot$nodeg), factor(boot$marr))),
formula = re74 ~ age)
fe <- getfe(felm(data = boot, re74 ~ age | nodeg + marr))
bootResult <- predict(lm.model, newdata = data.frame(age = 40)) +
fe$effect[fe$fe == "nodeg" & fe$idx==0] +
fe$effect[fe$fe == "marr" & fe$idx==1]
return(bootResult)
}
B = 1000
yhats_boot <- mclapply(1:B, boot_yhat)
plot(density(rnorm(10000, mean=yhat_lm$fit, sd=yhat_lm$se.fit)))
lines(density(yhats), col="red")
From your first model predict(.) yields this:
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Following 李哲源 we can achieve these results manually, too.
beta.hat.1 <- coef(model1) # save coefficients
# model matrix: age=40, nodeg = 0, marr=1:
X.1 <- cbind(1, matrix(c(40, 0, 1), ncol=3))
pred.1 <- as.numeric(X.1 %*% beta.hat.1) # prediction
V.1 <- vcov(model1) # save var-cov matrix
se2.1 <- unname(rowSums((X.1 %*% V.1) * X.1)) # prediction var
alpha.1 <- qt((1-0.95)/2, df = model1$df.residual) # 5 % level
pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1) # 95%-CI
# [1] 18258.18 18614.18
sigma2.1 <- sum(model1$residuals ^ 2) / model1$df.residual # sigma.sq
PI.1 <- pred.1 + c(alpha.1, -alpha.1) * sqrt(se2.1 + sigma2.1) # prediction interval
matrix(c(pred.1, PI.1), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr")))
# fit lwr upr
# 1 18436.18 2339.335 34533.03
Now, your linked example applied to multiple FE, we get this results:
lm.model <- lm(data=demeanlist(cps1[, c(8, 2)],
list(as.factor(cps1$nodeg),
as.factor(cps1$marr))), re74 ~ age)
fe <- getfe(model2)
predict(lm.model, newdata = data.frame(age = 40)) + fe$effect[fe$idx=="1"]
# [1] 15091.75 10115.21
The first value is with and the second without added FE (try fe$effect[fe$idx=="1"]).
Now we're following the manual approach above.
beta.hat <- coef(model2) # coefficient
x <- 40 # age = 40
pred <- as.numeric(x %*% beta.hat) # prediction
V <- model2$vcv # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = model2$df.residual) # 5% level
pred + c(alpha, -alpha) * sqrt(se2) # CI
# [1] 9599.733 10630.697
sigma2 <- sum(model2$residuals ^ 2) / model2$df.residual # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
matrix(c(pred, PI), nrow = 1, dimnames = list(1, c("fit", "lwr", "upr"))) # output
# fit lwr upr
# 1 10115.21 -5988.898 26219.33
As we see, the fit is the same as the linked example approach, but now with prediction interval. (Disclaimer: The logic of the approach should be straightforward, the values of the PI should still be evaluated, e.g. in Stata with reghdfe.)
Edit: In case you want to achieve exactly the same output from felm() which predict.lm() yields with the linear model1, you simply need to "include" again the fixed effects in your model (see model3 below). Just follow the same approach then. For more convenience you easily could wrap it into a function.
library(DAAG)
library(lfe)
model3 <- felm(data = cps1, re74 ~ age + nodeg + marr)
pv <- c(40, 0, 1) # prediction x-values
predict0.felm <- function(mod, pv.=pv) {
beta.hat <- coef(mod) # coefficient
x <- cbind(1, matrix(pv., ncol=3)) # prediction vector
pred <- as.numeric(x %*% beta.hat) # prediction
V <- mod[['vcv'] ] # var/cov
se2 <- unname(rowSums((x %*% V) * x)) # prediction var
alpha <- qt((1-0.95)/2, df = mod[['df.residual']]) # 5% level
CI <- structure(pred + c(alpha, -alpha) * sqrt(se2),
names=c("CI lwr", "CI upr")) # CI
sigma2 <- sum(mod[['residuals']] ^ 2) / mod[['df.residual']] # sigma^2
PI <- pred + c(alpha, -alpha) * sqrt(se2 + sigma2) # PI
mx <- matrix(c(pred, PI), nrow = 1,
dimnames = list(1, c("PI fit", "PI lwr", "PI upr"))) # output
list(CI, mx)
}
predict0.felm(model3)[[2]]
# PI fit PI lwr PI upr
# 1 18436.18 2339.335 34533.03
By this with felm() you can achieve the same prediction interval as with predict.lm().
The nls function works normally like the following:
x <- 1:10
y <- 2*x + 3 # perfect fit
yeps <- y + rnorm(length(y), sd = 0.01) # added noise
nls(yeps ~ a + b*x, start = list(a = 0.12345, b = 0.54321))#
Because the model I use have a lot of parameters or I don't know beforehand what will be included in the parameter list, I want something like following
tmp <- function(x,p) { p["a"]+p["b"]*x }
p0 <- c(a = 0.12345, b = 0.54321)
nls(yeps ~ tmp(x,p), start = list(p=p0))
Does anyone know how to modify the nls function so that it can accept a parameter vector argument in the formula instead of many seperate parameters?
You can give a vector of init coefficients like this :
tmp <- function(x, coef){
a <- coef[1]
b <- coef[2]
a +b*x
}
x <- 1:10
yeps <- y + rnorm(length(y), sd = 0.01) # added noise
nls(yeps ~ a + b*x, start = list(a = 0.12345, b = 0.54321))#
nls(yeps ~ tmp(x,coef), start = list(coef = c(0.12345, 0.54321)))
Nonlinear regression model
model: yeps ~ tmp(x, coef)
data: parent.frame()
coef1 coef2
3 2
residual sum-of-squares: 0.0016
Number of iterations to convergence: 2
Achieved convergence tolerance: 3.47e-08
PS:
example(nls)
Should be a good start to understand how to play with nls.