Bootstrap in pair within a matched sample - r

Hi I would like to compare outcomes between the treatment and control groups by bootstrapping the matched sample. This matched sample was obtained by using the genetic matching approach provided by MatchIt. According to this paper
Austin, P. C., & Small, D. S. (2014). The use of bootstrapping when using propensity‐score matching without replacement: a simulation study. Statistics in medicine, 33(24), 4306-4319.
I think I have to bootstrap the matched sample based on each matched pair, not each individual. However, I don't know how to re-sample by each matched pair.
Here I provide an example:
id <- c("A", "B", "C", "D", "E", "F")
treatment <- c(1, 0, 1, 0, 1, 0)
subclass <- c(1, 1, 2, 2, 3, 3)
outcome1 <- c(100, 300, 400, 500, 600, 700)
outcome2 <- c(200, 50, 600, 800, 900, 1000)
matched_sample <- data.frame(id, treatment, subclass, outcome1, outcome2)
> matched_sample
id treatment subclass outcome1 outcome2
1 A 1 1 100 200
2 B 0 1 300 50
3 C 1 2 400 600
4 D 0 2 500 800
5 E 1 3 600 900
6 F 0 3 700 1000
Subclass indicates the matched pair. For example, individual A and B are a matched pair because they share the same subclass number. Whenever A appears in any sample, B should also appear in that sample.
After bootstrapping, I will run regression on outcome1 and outcome2 to estimate the average treatment effects (ATE), and also to obtain the 95% confidence intervals of the ATEs.
I think the package boot might be useful, but I'm not sure how to use it. I would be really grateful for your help on this.
EDIT: The ATEs that I would like to estimate are basically the coefficients of "treatment" in regressions. That is,
lm.ATE1 <- lm(outcome1 ~ treatment)
lm.ATE2 <- lm(outcome2 ~ treatment)
The idea is to bootstrap the matched sample for 10,000 times, estimate these two regressions within each bootstrapped samples, rank the resulting coefficients, and then find coefficients at the 2.5 and 97.5 percentile as the 95% confidence intervals for the ATE on outcome1 and outcome2 respectively.
Hopefully this clarifies. Thanks in advance.

The following function resamples from matched_sample R times, keeping matched pairs. Then it computes two regressions and extracts the coefficients the question names ATE*, returning a matrix 2xR. Finally, it uses apply to get the percentile 95% confidence intervals.
id <- c("A", "B", "C", "D", "E", "F")
treatment <- c(1, 0, 1, 0, 1, 0)
subclass <- c(1, 1, 2, 2, 3, 3)
outcome1 <- c(100, 300, 400, 500, 600, 700)
outcome2 <- c(200, 50, 600, 800, 900, 1000)
matched_sample <- data.frame(id, treatment, subclass, outcome1, outcome2)
fun_boot <- function(data, R = 10000L) {
f <- function() {
b <- sample(uniq_sclass, n, TRUE)
out <- sp[match(b, uniq_sclass)]
out <- do.call(rbind, out)
lm.ATE1 <- lm(outcome1 ~ treatment, out)
lm.ATE2 <- lm(outcome2 ~ treatment, out)
c(ATE1 = unname(coef(lm.ATE1))[2],
ATE2 = unname(coef(lm.ATE2))[2])
}
sp <- split(data, data$subclass)
n <- length(sp)
uniq_sclass <- names(sp)
replicate(R, f())
}
set.seed(2022)
# change this value to 10,000
R <- 10L
bootres <- fun_boot(matched_sample, R)
t(apply(bootres, 1, quantile, probs = c(0.025, 0.975)))
#> 2.5% 97.5%
#> ATE1 -159.1667 -100.00000
#> ATE2 -159.1667 47.91667
Created on 2022-08-12 by the reprex package (v2.0.1)

Related

r: for loop to simulate predictions when random sampling is applied

I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.

Estimate power for a smaller sample size - simr package

NB: this question has been cross-posted as this GitHub issue.
I am estimating sample size based on the data from a previous experiment that included 40 participants.
I used simr::powerCurve for several sample sizes smaller than the original one:
pc <- powerCurve(fit = model, nsim = 100, alpha=0.02,
breaks = c(10, 20, 30, 40), along = 'subject_id')
The results are identical for all sizes and are close to 100%. I assume this is due to the simulated sample size being smaller than the original one.
Is there a way to estimate power for sample sizes smaller than the one used to fit the model?
Here is a reproducible example using synthetic data (code taken from https://humburg.github.io/Power-Analysis/simr_power_analysis.html and adapted slightly):
library(simr)
subj <- factor(1:40)
class_id <- letters[1:5]
time <- 0:2
group <- c("control", "intervention")
subj_full <- rep(subj, 15)
class_full <- rep(rep(class_id, each=10), 3)
time_full <- rep(time, each=50)
group_full <- rep(rep(group, each=5), 15)
covars <- data.frame(id=subj_full, class=class_full, treat=group_full, time=factor(time_full))
## Intercept and slopes for intervention, time1, time2, intervention:time1, intervention:time2
fixed <- c(5, 2, 0.1, 0.2)
## Random intercepts for participants clustered by class
rand <- list(0.5, 0.1)
## residual variance
res <- 2
model <- makeLmer(y ~ treat + time + (1|class/id), fixef=fixed, VarCorr=rand, sigma=res, data=covars)
pc <- powerCurve(model, test = fixed('treat'), nsim=100, along='subj', breaks = c(10, 20, 30, 40))
print(pc)
The output is
Power for predictor 'treat', (95% confidence interval),
by number of levels in subj:
10: 100.0% (96.38, 100.0) - 150 rows
20: 100.0% (96.38, 100.0) - 300 rows
30: 100.0% (96.38, 100.0) - 450 rows
40: 100.0% (96.38, 100.0) - 600 rows
Time elapsed: 0 h 0 m 55 s

Draw ID's from a data frame based on conditions and probabilities for the conditions

I have the following data frame
ID <- c(1,2,3,4,5,6)
pop <- c(10,10,10,10,1000,1000)
df <- data.frame(pop,ID)
Now I would like to draw unique ID's from this data frame with the condition, that the probability, that I draw an ID with a population of larger or equal of 1000 being 0.1 and that the probability for drawing an ID with a population of lower than 1000 being 0.9.
This is the code, that I have come up with:
sample(c(df$ID[df$pop>=1000],df$ID[df$pop<1000]),3, prob=c(0.1,0.9))
However, I get the following error message:
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
How can I rewrite this problem so it works without assigning a probability to every ID?
Maybe this one is your solution:
# 1. Data set
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
population = c(10, 10, 10, 10, 1000, 1000))
# 2. Set 'probability' by 'population'
df <- df %>%
mutate(probability = ifelse((population >= 1000), 0.1, 0.9))
A simple way could be to create a probability vector for each ID and pass that to sample function. For IDs having pop >=1000 probability vector should have 0.1 and for IDs having pop <1000 vector should contain 0.9.
# Create a probability vector for each ID and pass it to sample function.
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 4 3 2
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 2 1 4
sample(df$ID, 3, prob = ifelse(df$pop>=1000, 0.1, 0.9))
#[1] 3 2 4
Data:
ID <- c(1,2,3,4,5,6)
pop <- c(10,10,10,10,1000,1000)
df <- data.frame(pop,ID)

Stratified cluster sampling estimates from survey package

I want to estimate means and totals from a stratified sampling design in which single stage cluster sampling was used in each stratum. I believe I have the design properly specified using the svydesign() function of the survey package. But I'm not sure how to correctly specify the stratum weights.
Example code is shown below. I provide unadjusted stratum weights using the weights= argument. I expected that the estimate and the SE from svytotal() would be equal to the sum of the stratum weights (70, in the example) times the estimate and SE from svymean(). Instead the estimates differ by a factor of 530 (which is the sum of the stratum weights over all of the elements in the counts data) and the SEs differ by a factor of 898 (???). My questions are (1) how can I provide my 3 stratum weights to svydesign() in a way that it understands, and (2) why aren't the estimates and SEs from svytotal() and svymean() differing by the same factor?
library(survey)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = 1:21
)
# stratum weights
weights <- data.frame(
Stratum=c("A", "B", "C"),
W=c(10, 20, 40)
)
# combine counts and weights
both <- merge(counts, weights)
# estimate mean and total count
D <- svydesign(id=~Cluster, strata=~Stratum, weights=~W, data=both)
a <- svymean(~Count, D)
b <- svytotal(~Count, D)
sum(weights$W) # 70
sum(both$W) # 530
coef(b)/coef(a) # 530
SE(b)/SE(a) # 898.4308
First update
I'm adding a diagram to help explain my design. The entire population is a lake with known area (70 ha in this example). The strata have known areas, too (10, 20, and 40 ha). The number of clusters allocated to each stratum was not proportional. Also, the clusters are tiny relative to the number that could possibly be sampled, so the finite population correction is FPC = 1.
I want to calculate an overall mean and SE on a per unit area basis and a total that is equal to 70 times this mean and SE.
Second update
I wrote the code to do the calculations from scratch. I get a total estimate of 920 with se 61.6.
library(survey)
library(tidyverse)
# example data from a stratified sampling design in which
# single stage cluster sampling is used in each stratum
counts <- data.frame(
Stratum=rep(c("A", "B", "C"), c(5, 8, 8)),
Cluster=rep(1:8, c(3, 2, 3, 2, 3, 2, 3, 3)),
Element=c(1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 1, 2, 3, 1, 2, 3),
Count = c(5:1, 6:21)
)
# stratum weights
areas <- data.frame(
Stratum=c("A", "B", "C"),
A_h=c(10, 20, 40)
)
# calculate cluster means
step1 <- counts %>%
group_by(Stratum, Cluster) %>%
summarise(P_hi = sum(Count), m_hi=n())
step2 <- step1 %>%
group_by(Stratum) %>%
summarise(
ybar_h = sum(P_hi) / sum(m_hi),
n_h = n(),
sh.numerator = sum((P_hi - ybar_h*m_hi)^2),
mbar_h = mean(m_hi)
) %>%
mutate(
S_ybar_h = 1 / mbar_h * sqrt( sh.numerator / (n_h * (n_h-1)) )
)
# now expand up to strata
step3 <- step2 %>%
left_join(areas) %>%
mutate(
W_h = A_h / sum(A_h)
) %>%
summarise(
A = sum(A_h),
ybar_strat = sum(W_h * ybar_h),
S_ybar_strat = sum(W_h * S_ybar_h / sqrt(n_h))
) %>%
mutate(
tot = A * ybar_strat,
S_tot = A * S_ybar_strat
)
step2
step3
This gives the following output:
> step2
# A tibble: 3 x 6
Stratum ybar_h n_h sh.numerator mbar_h S_ybar_h
<fctr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 3.0 2 18.0 2.500000 1.200000
2 B 9.5 3 112.5 2.666667 1.623798
3 C 17.5 3 94.5 2.666667 1.488235
> step3
# A tibble: 1 x 5
A ybar_strat S_ybar_strat tot S_tot
<dbl> <dbl> <dbl> <dbl> <dbl>
1 70 13.14286 0.8800657 920 61.6046
(Revised answer to revised question)
In this case svytotal isn't what you want -- it's for the actual population total of the elements being sampled, and so doesn't make sense when the population is thought of as infinitely bigger than the sample. The whole survey package is really designed for discrete, finite populations, but we can work around it.
I think you want to get a mean for each stratum and then multiply it by the stratum weights. To do that,
D <- svydesign(id=~Cluster, strata=~Stratum, data=both)
means<- svyby(~Count, ~Stratum, svymean, design=D)
svycontrast(means, quote(10*A+20*B+40*C))
You'll get a warning
Warning message:
In vcov.svyby(stat) : Only diagonal elements of vcov() available
That's because svyby doesn't return covariances between the stratum means. It's harmless, because the strata really are independent samples (that's what stratification means) so the covariances are zero.
svytotal is doing what I think it should do here: weights are based on sampling probability, so they are only defined for sampling units. The svydesign call applied those weights to the clusters and (because cluster sampling) to the elements, giving the 530-fold higher total. You need to supply either observation weights or enough information for svydesign to calculate them itself. If this is cluster sampling with no subsampling, you can divide the stratum weight over the clusters to get the cluster weight and the divide this over elements within a cluster to get the observation weight. Or, if the stratum weight is the number of clusters in the population, you can use the fpc argument to svydesign
The fact that the SE doesn't scale the same way as the point estimate is because the population size is unknown and has to be estimated. The mean is the estimated total divided by the estimated population size, and the SE estimate takes account of the variance of the denominator and its covariance with the numerator.

How to calculate multivariate normal distribution function in R

Here's what I tried, making use of the mvtnorm package
Sample Dataset
library(mvtnorm)
set.seed(2357)
df <- data.frame(
x = rnorm(1000, mean=80, sd=20),
y = rnorm(1000, mean=0, sd=5),
z = rnorm(1000, mean=0, sd=5)
)
head(df)
x y z
1 70.38 1.307 0.2005
2 59.76 5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91 7.754 -6.2076
5 87.07 1.389 1.1065
6 75.89 1.684 6.2979
Fit multivariate normal dist and check P(x <= 80) ~ 0.5
# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)
# Check P(x <= 80)
sum(df$x <= 80)/nrow(df) # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr) # 0.8232
Why is the fitted result 0.82? Where did I go wrong?
First, you don't need to simulate anything to study the pmvnorm function:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))
The result is 0.5, as you expected.
Your means vector is approximately (79, 0, 0), so let's try it:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))
The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!
The correct calculation uses the covariance matrix of the data, not its correlation matrix:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))
The result is 0.5178412, quite in keeping with the data.

Resources