Plotting a graph with sample sizes and power estimates - r

I have simulated a linear model 1000 times using a randomly generated height and weight values, and randomly assigned each participant to a treatment or non-treatment (factor of 1 and 0). Let's say the model was:
lm(bmi~height + weight + treatment, data = df)
I am now struggling for the following:
The model now needs to cycle through the sample sizes between 300 and 500 in steps of 10 for each of the 1000 replications and store the proportion of simulated experiments with p values less than 0.05 for the purpose of estimating the power that can detect a change of 0.5 in bmi between two treatment groups at 5% significance level.
After doing the above, I then need to create a figure that best depicts the sample sizes on x-axis, and the estimated power on the y-axis, and also reflect the smallest sample size to achieve a 80% power estimate by a distinct color.
Any ideas how and where to go from here?
Thanks,
Chris

I would do it something like this:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)

Related

create sequence of predictor values to generate posterior predictions of simultaneous change in predictors

I am trying to create a data frame using (either tidyr::expand.grid or tibble::data_frame) in order to then generate posterior predictions using the tidybayes::epred_draws function from tidybayes (akin to posterior_predict). I have three continuous predictors that I could like to vary simultaneously at three set values: 1 standard dev below the mean of each predictor, the mean of each predictor, and 1 standard deviation above the mean of each predictor. The issue I am running into is that I cannot figure out a way to generate values in between the set standard deviation while keeping the structure of the dataset intact.
I created a reproducible example below, as you can see the final posterior prediction doesn't look great. Is there any way to generate additional incremental values in between the set standard deviation and mean?
My go to method would be either be seq() or even
modelr::seq_range(data_var_1, pretty=TRUE, n=100), but I'm not sure how to incorporate that in the new dataset in a way that allows me to see what happens the predictors simultaneously shift at once.
Let me know if I can explain anything else.
library(brms)
library(tidybayes)
library(ggplot2)
library(ggthemes)
## create a dataset
data <- tibble(
outcome = rnorm(100, 2, 2),
var_1 = rnorm(100, 5, 2),
var_2 = rnorm(100, 8, 2),
var_3 = rnorm(100, 10, 2)
)
## model the data
m1 <- brms::brm(outcome ~ var_1 + var_2 + var_3, data) # run model (takes a few sec.)
## prepare for predictions with set values
new_data = tibble(
var_1 = c(mean(var_1) - sd(var_1)*1, mean(var_1), mean(var_1) + sd(var_1)*1),
var_2 = c(mean(var_2) - sd(var_2)*1, mean(var_2), mean(var_2) + sd(var_2)*1),
var_3 = c(mean(var_3) - sd(var_3)*1, mean(var_3), mean(var_3) + sd(var_3)*1))
pred_1 <- m1 %>%
tidybayes::epred_draws(new_data)
# generate grand mean posterior predictions (for more on this,
# see: https://www.andrewheiss.com/blog/2021/11/10/ame-bayes-re-guide/)
plot_1 <- ggplot(pred_1, aes(x = var_1, y = .epred)) +
stat_lineribbon() +
scale_fill_brewer(palette = "Reds") +
labs(x = "Shifts in Var 1, 2, and 3", y = "Outcome",
fill = "Credible interval") +
ggthemes::theme_pander() +
theme(legend.position = "bottom") +
scale_x_continuous(limits = c(new_data$var_1[1], new_data$var_1[3]),
breaks=c(new_data$var_1[1],
new_data$var_1[2],
new_data$var_1[3]),
labels = c("-1 SD", "Mean", "+1 SD"))
# visualize posterior predictions (example isn't so pretty, sorry)

Combining LOESS and Quantreg to caculate percentiles/quantiles for data

I am trying to calculate percentiles or quantils for data which considerably scatter.
Using the Loess function the mean is nicely presented, however, I cannot get percentile/quantils from this function.
I tried to combine quantreg with loess. This plot shows linear curves instead of loess smoothed curves.
I would like to get a result similar to this:
data(cars)
plot(cars)
lmodel <- loess(cars$dist~cars$speed,span = 0.3, degree = 1)
lpred<-predict(lmodel, newdata= 5:25,se=TRUE)
lines(5:25, lpred$fit,col='#000066',lwd=4)
lines(5:25, lpred$fit - qt(0.975, lpred$df)*lpred$se, lty=2)
lines(5:25, lpred$fit + qt(0.975, lpred$df)*lpred$se, lty=2)
#### combination of quantreg with loess
plot(cars$speed,cars$dist)
xx <- seq(min(cars$speed),max(cars$speed),1)
f <- coef(rq(loess(cars$dist~cars$speed,span = 0.3, degree = 1), tau=c(0.1,0.25,0.5,0.75,0.9)) )
yy <- cbind(1,xx)%*%f
for(i in 1:length(taus)){
lines(xx,yy[,i],col = "gray")
}
I also tried the suggested code, however, I could not change the settings of the smoothing. The lines showed wavy path.
library(quantreg)
data(cars)
taus <- c(0.1, 0.25, 0.5, 0.75, 0.9)
lmodel <- loess(dist ~ speed, data = cars, span = 0.9, degree = 1)
rqmodel <- rq(lmodel, tau = taus, data = cars)
f <- coef(rqmodel)
xx <- seq(min(cars$speed), max(cars$speed), length.out = nrow(cars))
yy <- predict(rqmodel)
plot(cars)
matlines(xx, yy, col = "grey",lwd=3)
The Loess function does not provide data for quantiles as the rg would.
However, the Loess functions allows to get a curve without zigzag.
Please see the code snip. What would be the setting for tau=0.5 using the rg function to produce the same results compared with Loess function.
data(cars)
lmodel <- loess(dist ~ speed, data = cars, span = 0.9 )
plot(cars)
lines( x=4:25 , y=predict(lmodel, newdata= data.frame(speed=4:25)) ,col="Blue")
I believe the code in the question is mixing loess and quantile regressions when they are different methods and the latter does not need the former.
I will try to fit both and plot the respective results.
In the code below I will use matlines, not a for loop.
These code lines are common.
library(quantreg)
data(cars)
xx <- seq(min(cars$speed), max(cars$speed), length.out = nrow(cars))
First the loess model.
lmodel <- loess(dist ~ speed, data = cars, span = 0.5, degree = 1)
ls_yy <- predict(lmodel, se = TRUE)
ls_yy <- cbind(ls_yy$fit,
ls_yy$fit - 2*ls_yy$se.fit,
ls_yy$fit + 2*ls_yy$se.fit)
plot(cars)
matlines(xx, ls_yy, col = "darkgrey")
Now quantile regression.
taus <- c(0.1, 0.25, 0.5, 0.75, 0.9)
rqmodel <- rq(dist ~ speed, tau = taus, data = cars)
rq_yy <- predict(rqmodel)
plot(cars)
matlines(xx, rq_yy, col = "darkgrey")
The code below (taken from an "answer") is not correct and should not be included in a correct solution. This would provide a 95% confidence interval on a fit, and the probability that interval lands on the true trend line. It does not correspond to a quantile computed from the data within the span of this moving average. A normal based approximation as recommended would require multiplying ls_yy$se.fit by sqrt(ni) where ni is the number of observations in the particular span. Unfortunately loess does not return ni, so this is not a tenable solution unless the span covers the entire dataset and ni can be set to n and there is no heteroskedasticity.
data(cars)
plot(cars)
lmodel <- loess(dist ~ speed, data = cars, span = 0.5, degree = 1)
ls_yy <- predict(lmodel, se = TRUE)
#wrong - this does not denote quantiles for the input data:
ls_yy <- cbind(ls_yy$fit,
ls_yy$fit - 2*ls_yy$se.fit,
ls_yy$fit + 2*ls_yy$se.fit)
plot(cars)
matlines(xx, ls_yy, col = "darkgrey")
We can make this more obvious using a sample dataset with more observations. Samples 1 and 2 are identical, aside from their sample sizes (500 and 1500 observations), and therefore they should have very similar quantiles.
set.seed(1)
x1 = runif(500,0,10)
y1 = x1 + rnorm(length(x1))
x2 = runif(1500,0,10)
y2 = x1 + rnorm(length(x2))
dfpd = data.frame(x=1:9)
lmodel1 <- loess(y ~ x, data = data.frame(x=x1,y=y1), span = 0.5, degree = 1)
ls_yy1 <- predict(lmodel1, newdata=dfpd, se = TRUE)
lmodel2 <- loess(y ~ x, data = data.frame(x=x2,y=y2), span = 0.5, degree = 1)
ls_yy2 <- predict(lmodel2, newdata=dfpd, se = TRUE)
#the only difference between lmodel1 and lmodel2 is the number of observations
#the quantiles should be very similar, but their se values are a function of sample
#size and are thus quite different
ls_yy1$se
ls_yy2$se
ls_yy1$se / ls_yy2$se
We can see that the ratio of se values is around 60% which confirms that they cannot be used as-is for quantile calculations

Converting data to percentage rank

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?
For example, say the data looks like Z below:
library(dplyr)
library(ggplot2)
data.frame(x = runif(1000, 0, 5)) %>%
mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
Z
we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).
One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.
Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):
library(quantreg)
library(splines)
model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)
which can be plotted as follows:
library(tidyr)
data.frame(x = seq(0, 5, len = 100)) %>%
data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
gather(Tau, y, -x) %>%
ggplot(aes(x,y)) +
geom_point(data = Z, size = 0.1) +
geom_line(aes(color = Tau), size = 1)
Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

LMS (Lambda-Mu-Sigma) method in R

I want to create percentile curves for my data using LMS (Lambda-Mu-Sigma) method. I have following example data. How can 10th, 50th and 90th percentile curves of yvar (on y-axis) vs age (on x-axis) be drawn using LMS?
age = sample(5:75, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 61 87.16011
2 58 49.73289
3 65 15.60212
4 71 83.32699
5 33 40.89592
6 18 25.04376
plot(age, yvar)
I came across VGAM package http://www.inside-r.org/packages/cran/VGAM/docs/lms.bcn . Is that the best method to do it? I could not really understand its example code to create simple percentile curve from above data. Thanks for your help.
Simulate data (reproducibly):
set.seed(1001)
mydata <- data.frame(
age = sample(5:75, 500, replace=TRUE))
mydata <- transform(mydata,
yvar = rnorm(500, age, 20))
Since the LMS method typically appears to be based on variants of the Box-Cox transformation, which requires positive values, a simpler way to do this would be to use quantile regression.
library("quantreg")
library("ggplot2"); theme_set(theme_bw())
g0 <- ggplot(mydata,aes(x=age,y=yvar))+geom_point()
g0 + geom_smooth(method="rq",tau=c(0.1),se=FALSE,lty=2)+
geom_smooth(method="rq",tau=c(0.5),se=FALSE)+
geom_smooth(method="rq",tau=c(0.9),se=FALSE,lty=2)
rq() by itself has the capability to fit all three percentiles at the same time, but you need to use the strategy suggested in this blog post to draw them more conveniently:
model.rq <- rq(yvar ~ age, mydata, tau=c(0.1, 0.5, 0.9))
quantile.regressions <- data.frame(t(coef(model.rq)))
colnames(quantile.regressions) <- c("intercept", "slope")
quantile.regressions$quantile <- rownames(quantile.regressions)
g0 + geom_abline(aes(intercept=intercept, slope=slope,
colour=quantile), show_guide=TRUE, data=quantile.regressions)
Alternatively it is possible to do this within VGAM, but I'm not sure whether it's what you want/whether the results make sense or not. The Yeo-Johnson transformation, via lms.yjn, allows you to do this even when some data values are negative, but you might look at ?lms.bcg, ?lms.bcn for alternatives that work for non-negative data.
library("VGAM")
fit <- vgam(yvar ~ s(age, df = 4), lms.yjn, data=mydata,
control=vgam.control(maxit=100),
trace=FALSE)
We get a warning message:
## Warning message:
## In vgam.fit(x = x, y = y, w = w, mf = mf, Xm2 = Xm2, Ym2 = Ym2, :
## convergence not obtained in 100 iterations
This might be because we're overfitting the data using a 4-knot spline model?
Quantile plot (following example("lms.yjn"))
par(bty = "l", mar = c(5, 4, 4, 3) + 0.1, xpd = TRUE)
qtplot(fit, percentiles = c(10, 50, 90),
las = 1, ylab = "yvar", lwd = 2, lcol = 4)
This is a terrible hack, but if you want access to the raw values so you can plot the curves yourself:
pcurves <- qtplot.lmscreg(fit,show.plot=FALSE,
percentiles=c(10,50,90))
vals <- data.frame(age=mydata$age,pcurves$fitted.values)
vals <- vals[order(vals$age),]
matplot(vals$age,vals[,-1],type="l",lty=c(2,1,2),col=1,
xlab="age",ylab="")

R: Determine the threshold that maximally separates two groups based on a continuous variable?

Say I have 200 subjects, 100 in group A and 100 in group B, and for each I measure some continuous parameter.
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
I would like to determine the value (Threshold? Breakpoint?) that maximizes separation and minimizes misclassification between the groups. Does such a function exist in R?
I've tried searching along the lines of "r breakpoint maximal separation between groups," and "r threshold minimize misclassification," but my google-foo seems to be off today.
EDIT:
Responding to #Thomas's comment, I have tried to fit the data using logistic regression and then solve for the threshold, but I haven't gotten very far.
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435 and B1 = -0.0911762
From Wikipedia, I see that F(x) = 1/(1+e^-(Bo + B1x)), and solving for x:
x = (ln(F(x) / (1 - F(x))) - Bo)/B1
But trying this in R, I get an obviously incorrect answer:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
A simple approach is to write a function that calculates the accuracy given a threshold:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
Then find the maximum using optimize:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
I've gotten the answer I need thanks to help from #Thomas and #BenBolker.
Summary
The problem with my attempt at solving it through logistic regression was that I hadn't specified family = binomial
The dose.p() function in MASS will do the work for me given a glm fit
Code
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
Thanks, everyone, for your help!

Resources