NB: this question has been cross-posted as this GitHub issue.
I am estimating sample size based on the data from a previous experiment that included 40 participants.
I used simr::powerCurve for several sample sizes smaller than the original one:
pc <- powerCurve(fit = model, nsim = 100, alpha=0.02,
breaks = c(10, 20, 30, 40), along = 'subject_id')
The results are identical for all sizes and are close to 100%. I assume this is due to the simulated sample size being smaller than the original one.
Is there a way to estimate power for sample sizes smaller than the one used to fit the model?
Here is a reproducible example using synthetic data (code taken from https://humburg.github.io/Power-Analysis/simr_power_analysis.html and adapted slightly):
library(simr)
subj <- factor(1:40)
class_id <- letters[1:5]
time <- 0:2
group <- c("control", "intervention")
subj_full <- rep(subj, 15)
class_full <- rep(rep(class_id, each=10), 3)
time_full <- rep(time, each=50)
group_full <- rep(rep(group, each=5), 15)
covars <- data.frame(id=subj_full, class=class_full, treat=group_full, time=factor(time_full))
## Intercept and slopes for intervention, time1, time2, intervention:time1, intervention:time2
fixed <- c(5, 2, 0.1, 0.2)
## Random intercepts for participants clustered by class
rand <- list(0.5, 0.1)
## residual variance
res <- 2
model <- makeLmer(y ~ treat + time + (1|class/id), fixef=fixed, VarCorr=rand, sigma=res, data=covars)
pc <- powerCurve(model, test = fixed('treat'), nsim=100, along='subj', breaks = c(10, 20, 30, 40))
print(pc)
The output is
Power for predictor 'treat', (95% confidence interval),
by number of levels in subj:
10: 100.0% (96.38, 100.0) - 150 rows
20: 100.0% (96.38, 100.0) - 300 rows
30: 100.0% (96.38, 100.0) - 450 rows
40: 100.0% (96.38, 100.0) - 600 rows
Time elapsed: 0 h 0 m 55 s
Related
I am trying to simulate how replacement/reassignment of values on random samples affect predictions conveyed by AUC.
I have a tumor classification in a dataframe denoted df$who which has levels 1, 2, 3 corresponding to the severity of the tumor lesion.
Intro to the question
Lets say the baseline data looks like this:
set.seed(1)
df <- data.frame(
who = as.factor(sample(1:3, size = 6000, replace = TRUE, prob = c(0.8, 0.15, 0.05))),
age = round(runif(n = 6000, min = 18, max = 95), digits = 1),
gender = sample(c("m", "f"), size = 6000, replace = TRUE, prob = c(1/3, 2/3)),
event.time = runif(n = 6000, min = 8, max = 120),
event = as.factor(sample(0:2, size = 6000, replace = TRUE, prob = c(0.25, 0.2, 0.55)))
)
And a standard cause-specific Cox regression looks like:
library(survival)
a_baseline <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df, x = TRUE)
From which AUC can be obtained as a measure of predictive performance. Here, leave-one-out bootstrap on 5-year prediction on df$event == 1.
library(riskRegression)
u <- Score(list("baseline" = a_baseline),
Surv(event.time, event == 1) ~ 1,
data = df,
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# The AUC is then obtained
u$AUC$score$AUC[2]
Question
I want to simulate how re-classifying a random 5% of df$who == 1 to dfwho == 2 affect the 5-year prediction on df$event == 1
I want to create 10 separate and simulated subsets of the baseline data df, but each containing a random allocation of 5% df$who == 1 to .. == 2. Then, I want to apply each of these 10 separate and simulated subsets to predict the 5-year risk of df$event == 1.
I have applied a for loop to this. The expected output is dataframe that tells me which of the 10 simulated datasets yielded the highest and lowest u$AUC$score$AUC[2] (i.e., the best and worst prediction).
I am new to for loop, but here is my go (that obviously did not work).
all_auc <- data.frame() ## create a dataframe to fill in AUC from all 10 simulated sub-datasets
for(i in 1:10){ #1:10 represent the simulated datasets from 1 to 10
df[i] <- df #allocating baseline data to each of the 10 datasets
df[i]$who[sample(which(df[i]$who==1), round(0.05*length(which(df[i]$who==1))))]=2 #create the random 5% allocation of who==1 to who==2 in the i'th simulated dataset
ith_cox <- coxph(Surv(event.time, event == 1) ~ who + age + gender, data = df[i], x = TRUE) #create the i'th Cox regression based on the i´th dataset
# create the predictions based on the i´th Cox
u[i] <- Score(list("baseline" = ith_cox),
Surv(event.time, event == 1) ~ 1,
data = df[i],
times = 60,
plots = "cal",
B = 50,
split.method = "loob",
metrics = c("auc", "brier")
)
# summarize all AUC from all 10 sub-datasets
all_auc <- u[i]$AUC$score$AUC[2]
}
(1) I could not get this for loop to work as described, and
(2) the final dataframe all_auc should provide only which of the 10 datasets yielded the worst and best predictions (I will then use these two data sets for further analysis).
A final note
This is only a reproducible example. The for loop will be applied to 10.000 simulated datasets in our analysis. I do not know if this could affect the answer - but, it illustrates the importance of the result: a dataframe (or vector?) that simply tells me which simulated dataset yielded the best vs worst predictions, and that I subsequently will be able to use these two dataframes for furter analysis, eg df2930 and df8939.
I need to do a linear mixed model simulation to get power for varying sample sizes.
My model is:
Ratings = y
Fixed effect, x = Ring
Random effect = participants
The code I tried is below. It only returns 'Based on 100 simulations, (0 warnings, 100 errors)
alpha = 0.05, nrow = 2000' ....
Thank you!!
#create a dataframe
library(lmerTest)
library(simr)
library(tidyverse)
Ring = c('Ring', 'NoRing')
#from 1 to 10 (11 is not included).
Ring = rep(Ring, times = 1000)
attractiveness = floor(runif(10, min=1, max=11)) #this creates random numbers
#from 1 to 10 (11 is not included).
participants<-rep(factor(1:100),each=20)
targetID = rep(c(1,2,3,4,5,6,7,8,9,10), each= 2)
targetImage= rep(targetID, times= 100)
Ratings = rep(attractiveness, times = 200)
data<-data.frame(participants, Ring, targetImage, Ratings)
#parameters for the model:
## Intercept and slopes for ring
fixed <- c(3, 0.5)
## Random intercepts for participants
rand <- 0.5
## residual variance
res <- 2
model <- makeLmer(Ratings ~ Ring + (1|participants), fixef=fixed,
VarCorr=rand, sigma=res, data=data)
sim_treat <- powerSim(model, nsim=100, test = fcompare(Ratings~Ring))
sim_treat
I need to simulate an AR(2) process Y[t]=1/20+(Sqrt(3)/2)Y[t-1]-(1/4)Y[t-2]+e[t]
e[t]~(0,0.02^2)
Simulation has to be over 30 years where the model is measured in quarters.
I've tried with x <- arima.sim(model = list(order = c(2, 0, 0), ar = c(a1, a2)), n = 120, n.start = 100, sd = 0.02)
Using the above, R says the model isn't stationary.
Where a1 and a2 are equal to phi 1 and phi 2 in the model, but I can't figure out how to add Phi 0, or how to set values for y0=0.1 and y-1 = 0.12 which is required.
I've also tried the following
set.seed(9029) # set a seed to fix the simulated numbers
nsim = 1 # no. of simulations
burn = 100 # burn-in periods
n = 220 # sample length + burn-in periods --> sample length = 4quarters*30yrs
tp=(burn+1):n # time points to be sampled
sigerr = 0.02 # error s.d.
a1 = (sqrt(3)/2) # AR(2) coefficient
a2 = 0.25 # AR(2) coefficient = 1/4
a0 = 1/20 # Phi 0
# create data series and error series
y = array(0,c(n,nsim)) # data series
err = array(rnorm(n*nsim,0,sigerr),c(n,nsim)) # iid errors
# simulate y from an AR(2) process
for (k in 1:nsim) {
for (i in 2:n) {
y[i,k] = a0 + a1*y[i-1,k] + a2*y[i-2,k] + err[i,k]
}
}
But keep getting replacement has length zero as an error, and also I still can't find out how to add values for y0 and y-1 equal to 0.1 and 0.12 respectively. Please help I can't seem to find a fix. Thanks.
Here's what I tried, making use of the mvtnorm package
Sample Dataset
library(mvtnorm)
set.seed(2357)
df <- data.frame(
x = rnorm(1000, mean=80, sd=20),
y = rnorm(1000, mean=0, sd=5),
z = rnorm(1000, mean=0, sd=5)
)
head(df)
x y z
1 70.38 1.307 0.2005
2 59.76 5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91 7.754 -6.2076
5 87.07 1.389 1.1065
6 75.89 1.684 6.2979
Fit multivariate normal dist and check P(x <= 80) ~ 0.5
# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)
# Check P(x <= 80)
sum(df$x <= 80)/nrow(df) # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr) # 0.8232
Why is the fitted result 0.82? Where did I go wrong?
First, you don't need to simulate anything to study the pmvnorm function:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))
The result is 0.5, as you expected.
Your means vector is approximately (79, 0, 0), so let's try it:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))
The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!
The correct calculation uses the covariance matrix of the data, not its correlation matrix:
pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))
The result is 0.5178412, quite in keeping with the data.
Here is the problem: Five observations on Y are to be taken when X = 4, 8, 12, 16, 20, respectively. The true regression function is E(y) = 20 + 4X, and the ei are independent N(O, 25).
Generate five normal random numbers, with mean 0 and variance 25. Consider these random numbers as the error terms for the five Y observations at X = 4,8, 12, 16, 20 and calculate Y1, Y2 , Y3 , Y4 , and Y5. Obtain the least squares estimates bo and b1, when fitting a straight line to the five cases. Also calculate Yh when Xh = 10 and obtain a 95 percent confidence interval for E(Yh) when Xh = 10. I did part 1, but I need help to repeat it for 200 times.
Repeat part (1) 200 times, generating new random numbers each time.
Make a frequency distribution of the 200 estimates b1. Calculate the mean and standard deviation of the 200 estimates b1. Are the results consistent with theoretical expectations?
What proportion of the 200 confidence intervals for E(Yh) when Xh = 10 include E(Yh)? Is this result consistent with theoretical expectations?
Here's my code so far, I am stumped on how to repeat part 1 for 200 times:
X <- matrix(c(4, 8, 12, 16, 20), nrow = 5, ncol = 1)
e <- matrix(c(rnorm(5,0,sqrt(5))), nrow = 5, ncol = 1)
Y <- 20 + 4 * X + e
mydata <- data.frame(cbind(Y=Y, X=X, e=e))
names(mydata) <- c("Y","X","e")
reg<-lm(Y ~ X, data = mydata)
predict(reg, newdata = data.frame(X=10), interval="confidence")
There is mistake in your code. You want independent N(O, 25) errors, but you passed sqrt(5) as standard error to rnorm(). It should be 5.
We first wrap up your code into a function. This function takes no input, but run experiment once, and returns regression coefficients b0, b1 and prediction fit, lwr, upr in a named vector.
sim <- function () {
x <- c(4, 8, 12, 16, 20)
y <- 20 + 4 * x + rnorm(5,0,5)
fit <- lm(y ~ x)
pred <- predict(fit, data.frame(x = 10), interval = "confidence")
pred <- setNames(c(pred), dimnames(pred)[[2]])
## return simulation result
c(coef(fit), pred)
}
For example, let's try
set.seed(2016)
sim()
#(Intercept) x fit lwr upr
# 24.222348 3.442742 58.649773 47.522309 69.777236
Now we use replicate to repeat such experiment 200 times.
set.seed(0)
z <- t(replicate(200, sim()))
head(z)
# (Intercept) x fit lwr upr
#[1,] 24.100535 3.987755 63.97808 57.61262 70.34354
#[2,] 6.417639 5.101501 57.43265 52.44263 62.42267
#[3,] 20.652355 3.797991 58.63227 52.74861 64.51593
#[4,] 20.349829 3.816426 58.51409 52.59115 64.43702
#[5,] 19.891873 4.095140 60.84327 57.49911 64.18742
#[6,] 24.586749 3.589483 60.48158 53.64574 67.31743
There will be 200 rows, for results of 200 simulations.
The second column contains estimation for b1 under 200 simulations, we compute their mean and standard error:
mean(z[,2])
# [1] 3.976249
sd(z[,2])
# [1] 0.4263377
We know that the true value is 4, and it is evident that our estimate is consistent with true values.
Finally, let's check with 95% confidence interval for prediction at X = 10. The true value is 20 + 4 * 10 = 60, so the proportion of confidence interval that covers this true vale is
mean(z[, "lwr"] < 60 & z[, "upr"] > 60)
## 0.95
which is exactly 0.95.