Setting hypotheses in R package SPRT - r

I am new to using the SPRT package in R to perform sequential proprortion ratio testing, and vignettes/tutorials for this package seem to be sparse.
By default the SPRT function can receive cumulative values of n & k (trials and events). I will be using this method on a large studies where trials and events will be tallied daily in a cumulative fashion and I want to check my logic on how I have applied SPRT().
SPRT requires users to set explicit null and alternative hypothesis. I have set these to H_0: treat = control
H_1: treat = control * 1.01
In my for-loop that follows I apply the SPRT() function every day to compute the log likelihood ratio of the cumulative data under each hypothesis, and I really just want to confirm that this is the correct way to analyze the data. Most examples I have seen set h0 and h1 in a more explicit fashion (e.g., h0 = .85 & h1 = .85*1.01), while I have set them to reflect the observed rates for each day in the cumulative data as seems more appropriate in the setting of an experiment (e.g., h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i]).
library(SPRT)
library(tidyverse)
# simulate cumulative data from an AB Test
set.seed(42)
DAYS <- 14
DAILY_N <- 1e3
BASERATE <- .85
MDE <- 1.02
df_sprt <-
tibble(
day = 1:DAYS,
control = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE),
treat = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE*MDE),
n = DAILY_N
) %>%
mutate(
control = cumsum(control),
treat = cumsum(treat),
n = cumsum(n)
)
# apply SPRT in a for loop
wald_a <- vector('numeric', length = nrow(df_sprt))
wald_b <- vector('numeric', length = nrow(df_sprt))
llr <- vector('numeric', length = nrow(df_sprt))
for (i in 1:nrow(df_sprt)) {
out <- SPRT(
distribution = "bernoulli",
type1 = 0.05, type2 = 0.20,
h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i],
n = df_sprt$n[i],
k = df_sprt$treat[i]
)
wald_a[i] <- out$wald.A
wald_b[i] <- out$wald.B
llr[i] <- out$llr
}
sprt_out <-
tibble(
llr,
wald_a,
wald_b,
cohort_day = 1:DAYS
)
# Plot the results
sprt_out %>%
ggplot(aes(x = cohort_day, y = llr)) +
geom_hline(
yintercept =
c(max(sprt_out$wald_a), max(sprt_out$wald_b)),
color = c('darkgreen', 'red')
) +
geom_point() +
geom_line() +
annotate(
x=10,y=max(sprt_out$wald_b),
label="Reject Alternative Hy & Retain Null Hy",
vjust=-1, geom="text", color = 'red'
) +
annotate(
x=10,y=max(sprt_out$wald_a),
label="Reject Null Hy & Accept Alternative Hy",
vjust=1.5, geom="text", color = 'darkgreen'
) +
scale_y_continuous(breaks = -10:20) +
scale_x_continuous(breaks = 1:20) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Related

Monte Carlo Sim in R plots STRAIGHTS

So I am getting started with Monte Carlo Sims, and went with this basic code to simulate Returns for a given portfolio. Well somehow a portion of the simulated returns always results in straight linear lines which are easy to see on the plotted graph. First I decreased the number of sims so you can see it clearer and I also played around with some other factors but they keep showing up. The rest of the output looks promising and "random".
Added the link to the image as my account is new and also the code, appreciate any help!:
library(quantmod)
library(ggplot2)
maxDate<- "2000-01-01"
tickers<-c("MSFT", "AAPL", "BRK-B")
getSymbols(tickers, from=maxDate)
Port.p<-na.omit(merge(Cl(AAPL),Cl(MSFT),Cl(`BRK-B`)))
Port.r<-ROC(Port.p, type = "discrete")[-1,]
stock_Price<- as.matrix(Port.p[,1:3])
stock_Returns <- as.matrix(Port.r[,1:3])
mc_rep = 50 # Number of Sims
training_days = 200
portfolio_Weights = c(0.5,0.3,0.2)
coVarMat = cov(stock_Returns)
miu = colMeans(stock_Returns)
Miu = matrix(rep(miu, training_days), nrow = 3)
portfolio_Returns_m = matrix(0, training_days, mc_rep)
set.seed(2000)
for (i in 1:mc_rep) {
Z = matrix ( rnorm( dim(stock_Returns)[2] * training_days ), ncol = training_days )
L = t( chol(coVarMat) )
daily_Returns = Miu + L %*% Z
portfolio_Returns_200 = cumprod( portfolio_Weights %*% daily_Returns + 1 )
portfolio_Returns_m[,i] = portfolio_Returns_200;
}
x_axis = rep(1:training_days, mc_rep)
y_axis = as.vector(portfolio_Returns_m-1)
plot_data = data.frame(x_axis, y_axis)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis)) + geom_path(col = 'red', size = 0.1) +
xlab('Days') + ylab('Portfolio Returns') +
ggtitle('Simulated Portfolio Returns in 200 days')+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The lines are the 'return' from the end of each series to the beginning of the next. You can keep the lines separate by adding a grouping variable to your plotting data and using the group aesthetic to tell ggplot about it:
g <- rep(1:training_days, each = mc_rep)
plot_data = data.frame(x_axis, y_axis, g)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis, group = g)) + ...

How to generate covariate-adjusted cox survival/hazard functions?

I'm using the survminer package to try to generate survival and hazard function graphs for a longitudinal student-level dataset that has 5 subgroups of interest.
I've had success creating a model that shows the survival functions without adjusting for student-level covariates using ggsurvplot.
ggsurvplot(survfit(Surv(expectedgr, sped) ~ langstatus_new, data=mydata), pvalue=TRUE)
Output example
However, I cannot manage to get these curves adjusted for covariates. My aim is to create graphs like these. As you can see, these are covariate-adjusted survival curves according to some factor variable. Does anyone how such graphs can be obtained in R?
You want to obtain survival probabilities from a Cox model for certain values of some covariate of interest, while adjusting for other covariates. However, because we do not make any assumption on the distribution of the survival times in a Cox model, we cannot directly obtain survival probabilities from it. We first have to estimate the baseline hazard function, which is typically done with the non-parametric Breslow estimator. When the Cox model is fitted with coxph from the survival package, we can obtain such probabilites with a call to the survfit() function. You may consult ?survfit.coxph for more information.
Let's see how we can do this by using the lung data set.
library(survival)
# select covariates of interest
df <- subset(lung, select = c(time, status, age, sex, ph.karno))
# assess whether there are any missing observations
apply(df, 2, \(x) sum(is.na(x))) # 1 in ph.karno
# listwise delete missing observations
df <- df[complete.cases(df), ]
# Cox model
fit <- coxph(Surv(time, status == 2) ~ age + sex + ph.karno, data = df)
## Note that I ignore the fact that ph.karno does not satisfy the PH assumption.
# specify for which combinations of values of age, sex, and
# ph.karno we want to derive survival probabilies
ND1 <- with(df, expand.grid(
age = median(age),
sex = c(1,2),
ph.karno = median(ph.karno)
))
ND2 <- with(df, expand.grid(
age = median(age),
sex = 1, # males
ph.karno = round(create_intervals(n_groups = 3L))
))
# Obtain the expected survival times
sfit1 <- survfit(fit, newdata = ND1)
sfit2 <- survfit(fit, newdata = ND2)
The code behind the function create_intervals() can be found in this post. I just simply replaced speed with ph.karno in the function.
The output sfit1 contains the expected median survival times and the corresponding 95% confidence intervals for the combinations of covariates as specified in ND1.
> sfit1
Call: survfit(formula = fit, newdata = ND)
n events median 0.95LCL 0.95UCL
1 227 164 283 223 329
2 227 164 371 320 524
Survival probabilities at specific follow-up times be obtained with the times argument of the summary() method.
# survival probabilities at 200 days of follow-up
summary(sfit1, times = 200)
The output contains again the expected survival probability, but now after 200 days of follow-up, wherein survival1 corresponds to the expected survival probability of the first row of ND1, i.e. a male and female patient of median age with median ph.karno.
> summary(sfit1, times = 200)
Call: survfit(formula = fit, newdata = ND1)
time n.risk n.event survival1 survival2
200 144 71 0.625 0.751
The 95% confidence limits associated with these two probabilities can be manually extracted from summary().
sum_sfit <- summary(sfit1, times = 200)
sum_sfit <- t(rbind(sum_sfit$surv, sum_sfit$lower, sum_sfit$upper))
colnames(sum_sfit) <- c("S_hat", "2.5 %", "97.5 %")
# ------------------------------------------------------
> sum_sfit
S_hat 2.5 % 97.5 %
1 0.6250586 0.5541646 0.7050220
2 0.7513961 0.6842830 0.8250914
If you would like to use ggplot to depict the expected survival probabilities (and the corresponding 95% confidence intervals) for the combinations of values as specified in ND1 and ND2, we first need to make data.frames that contain all the information in an appropriate format.
# function which returns the output from a survfit.object
# in an appropriate format, which can be used in a call
# to ggplot()
df_fun <- \(surv_obj, newdata, factor) {
len <- length(unique(newdata[[factor]]))
out <- data.frame(
time = rep(surv_obj[['time']], times = len),
n.risk = rep(surv_obj[['n.risk']], times = len),
n.event = rep(surv_obj[['n.event']], times = len),
surv = stack(data.frame(surv_obj[['surv']]))[, 'values'],
upper = stack(data.frame(surv_obj[['upper']]))[, 'values'],
lower = stack(data.frame(surv_obj[['lower']]))[, 'values']
)
out[, 7] <- gl(len, length(surv_obj[['time']]))
names(out)[7] <- 'factor'
return(out)
}
# data for the first panel (A)
df_leftPanel <- df_fun(surv_obj = sfit1, newdata = ND1, factor = 'sex')
# data for the second panel (B)
df_rightPanel <- df_fun(surv_obj = sfit2, newdata = ND2, factor = 'ph.karno')
Now that we have defined our data.frames, we need to define a new function which allows us to plot the 95% CIs. We assign it the generic name geom_stepribbon.
library(ggplot2)
# Function for geom_stepribbon
geom_stepribbon <- function(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = stat,
geom = GeomStepribbon,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ... )
)
}
GeomStepribbon <- ggproto(
"GeomStepribbon", GeomRibbon,
extra_params = c("na.rm"),
draw_group = function(data, panel_scales, coord, na.rm = FALSE) {
if (na.rm) data <- data[complete.cases(data[c("x", "ymin", "ymax")]), ]
data <- rbind(data, data)
data <- data[order(data$x), ]
data$x <- c(data$x[2:nrow(data)], NA)
data <- data[complete.cases(data["x"]), ]
GeomRibbon$draw_group(data, panel_scales, coord, na.rm = FALSE)
}
)
Finally, we can plot the expected survival probabilities for ND1 and ND2.
yl <- 'Expected Survival probability\n'
xl <- '\nTime (days)'
# left panel
my_colours <- c('blue4', 'darkorange')
adj_colour <- \(x) adjustcolor(x, alpha.f = 0.2)
my_colours <- c(
my_colours, adj_colour(my_colours[1]), adj_colour(my_colours[2])
)
left_panel <- ggplot(df_leftPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Sex',
values = c('1' = my_colours[1],
'2' = my_colours[2]),
labels = c('1' = 'Males',
'2' = 'Females')) +
scale_fill_manual(name = 'Sex',
values = c('1' = my_colours[3],
'2' = my_colours[4]),
labels = c('1' = 'Males',
'2' = 'Females')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# right panel
my_colours <- c('blue4', 'darkorange', '#00b0a4')
my_colours <- c(
my_colours, adj_colour(my_colours[1]),
adj_colour(my_colours[2]), adj_colour(my_colours[3])
)
right_panel <- ggplot(df_rightPanel,
aes(x = time, colour = factor, fill = factor)) +
geom_step(aes(y = surv), size = 0.8) +
geom_stepribbon(aes(ymin = lower, ymax = upper), colour = NA) +
scale_colour_manual(name = 'Ph.karno',
values = c('1' = my_colours[1],
'2' = my_colours[2],
'3' = my_colours[3]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
scale_fill_manual(name = 'Ph.karno',
values = c('1' = my_colours[4],
'2' = my_colours[5],
'3' = my_colours[6]),
labels = c('1' = 'Low',
'2' = 'Middle',
'3' = 'High')) +
ylab(yl) + xlab(xl) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = 'top')
# composite plot
library(ggpubr)
ggarrange(left_panel, right_panel,
ncol = 2, nrow = 1,
labels = c('A', 'B'))
Output
Interpretation
Panel A shows the expected survival probabilities for a male and female patient of median age with a median ph.karno.
Panel B shows the expected survival probabilities for three male patients of median age with ph.karnos of 67 (low), 83 (middle), and 100 (high).
These survival curves will always satisfy the PH assumption, as they were derived from the Cox model.
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
Although correct, I believe that the method described in the answer of Dion Groothof is not what is usually of interest. Usually, researchers are interested in visualizing the causal effect of a variable adjusted for confounders. Simply showing the predicted survival curve for one single covariate combination does not really do the trick here. I would recommend reading up on confounder-adjusted survival curves. See https://arxiv.org/abs/2203.10002 for example.
Those type of curves can be calculated in R using the adjustedCurves package: https://github.com/RobinDenz1/adjustedCurves
In your example, the following code could be used:
library(survival)
library(devtools)
# install adjustedCurves from github, load it
devtools::install_github("/RobinDenz1/adjustedCurves")
library(adjustedCurves)
# "event" needs to be binary
lung$status <- lung$status - 1
# "variable" needs to be a factor
lung$ph.ecog <- factor(lung$ph.ecog)
fit <- coxph(Surv(time, status) ~ ph.ecog + age + sex, data=lung,
x=TRUE)
# calculate and plot curves
adj <- adjustedsurv(data=lung, variable="ph.ecog", ev_time="time",
event="status", method="direct",
outcome_model=fit, conf_int=TRUE)
plot(adj)
Producing the following output:
These survival curves are adjusted for the effect of age and sex. More information on how this adjustment works can be found in the documentation of the adjustedCurves package or the article I cited above.

Given the probability find y for gamma distribution

Looking for some assistance using r. I know that there is invgamma but I am not sure if that will work/how to use it correctly. If X has a Gamma distribution with shape parameter r = 3 and scale parameter ρ = 6 is there a way to calculate y such that Prob(X < y) = .95? thanks!
In R you have 4 types of functions for distribution:
r[name of the distribution]- Random number generator
q[name of the distribution]- Quantile function
d[name of the distribution]- Density function
p[name of the distribution]- Distribution function
So since you have the probability, you need yo use qgamma.
P(X < x) = 0.95
shape <- 3
rate <- 6
x_95 <- qgamma(p = 0.95,shape = shape, rate = rate)
[1] 1.049299
Plot code
df <-
tibble(
x = seq(0,3,l = 1000)
) %>%
mutate(y = dgamma(x = x,shape = shape,rate = rate))
ggplot(df,aes(x,y)) +
geom_function(fun = dgamma, args = list(shape = shape, rate = rate))+
geom_vline(xintercept = x_95, linetype = "dashed")+
theme_bw()+
scale_x_continuous(breaks = x_95)+
geom_area(data = df %>% filter(x <= x_95),
alpha = .7, fill = "chocolate2")+
scale_y_continuous(expand = c(0,0))+
annotate(geom = "text",x = .5,y = .7,label = "95%",size = 12)

How do I overlay a set of functions onto a ggplot2 graph when the function relies on the parameters of the inherited datafame?

Effectively what I want to do is more efficiently generate a graph like this
using reproducible data that I am providing.
I have plotted experimental data points of {[Substrate concentration], ([Substrate Concentration] / velocity) grouped by [Inhibitor Concentration]. Each of the lines is a continuous function that accepts the parameters Substrate_Conc and Inhibitor_Conc.
This is somewhat difficult for me to have to plot a function that relies on two inherited parameters.
For instance it is easy to plot a function that relies on one inherited parameter x, but x cannot be specified as an argument to my knowledge.
fx <- function(x){return(1 / x)}
ggplot() + theme_bw() + xlim(c(1, 100)) + stat_function(fun = fx, col = "black") + labs(x = "x", y = "y")
This simply plots the graph of 1 / x, but I cannot specify x as a parameter
In my case I need to specify x = Substrate_Conc and also another parameter Inhibitor_Conc which I have grouped by.
My code:
library(dplyr) # For %>% operator
library(ggplot2) # for plots
library(scales) # for pretty_breaks()
library(stats) # For nls() function
# Construct a dataframe of experimental Inhibitor Concentration with corresponding substrate concentrations
# And with corresponding velocities
df <- data.frame(Inhibitor_Conc = rep(c(0,6,12,24), each = 5),
Substrate_Conc = rep(c(2,5,10,20,50), times = 4),
velocity = c(0.0552, 0.1128, 0.1476, 0.1799, 0.2261,0.0242, 0.0690, 0.0774, 0.1905,0.1861,
0.0231, 0.0420,0.0979, 0.1329,0.1722,0.0138,0.0393,0.0855,0.1042,0.1562))
# Compute S_over_V for the Woolf_Hanes Plot
df <- df %>%
mutate(S_over_V = Substrate_Conc/velocity)
# Fit a function to describe this dataset. The Competitive Model is the best model to describe this enzyme kinetics dataset.
# But I will get around to fitting the other kinetic models such as Uncompetitive for comparison.
Competitive_Fit <- nls(data = df, formula = velocity ~ (V_Max * Substrate_Conc) / (K_M * (1 + Inhibitor_Conc/K_I) + Substrate_Conc),
start = list(K_M = median(df$Substrate_Conc), K_I = median(df$Substrate_Conc), V_Max = max(df$velocity)),
control = nls.control(maxiter = 100))
# Print a summary of the fit: It is pretty good.
print(summary(Competitive_Fit))
# Extract the parameters. First entry is K_M, following by K_I, followed by V_Max
Parameters <- coefficients(Competitive_Fit)
Competitive_Woolf_H_Function <- function(S, K_M, K_I, V_Max, I)
{
# Where S is the substrate concentration, and I is the inhibitor concentration
predicted_S_over_V = ((K_M / V_Max) + ((1 / V_Max)*S) + (K_M / (V_Max * K_I)) * I)
return(predicted_S_over_V)
}
# First construct a pretty Woolf Hanes Plot without any fitting:
# Construct a Woolf Hanes Plot simply a plot a {(Substrate Concentration) / (Velocity)} vs Substrate Concentration
Woolf_Hanes_Plot <- ggplot(data = df, aes(x = Substrate_Conc, y = S_over_V)) + theme_bw() +
geom_point(aes(group = factor(Inhibitor_Conc), col = factor(Inhibitor_Conc), shape = factor(Inhibitor_Conc))) +
geom_vline(xintercept = 0, col = "black") + geom_hline(yintercept = 0, col = "black") +
labs(x = "Substrate Concentration (mM)", y = "[Substrate] / Velocity", title = "Woolf Hanes Plot of Enzymatic Inhibtion") +
scale_x_continuous(breaks = scales::pretty_breaks(n=20), limits = c(-50,50)) +
scale_y_continuous(breaks = scales::pretty_breaks(n=20)) +
scale_color_manual(values = c("red", "green", "blue", "purple"), name = "Inhibitor Concentration:") +
scale_shape_manual(values = c(15,16,17,3), name = "Inhibitor Concentration:")
# Add the fitted lines to the plot.
# Very redundant calls to the Competitive_Woolf_H_Function
# With the only difference being the Inhibitor Concentration is different
Woolf_Hanes_Plot_Fitted <- Woolf_Hanes_Plot + stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 0),
col = "red") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 6),
col = "green") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 12),
col = "blue") +
stat_function(size = 1, fun = Competitive_Woolf_H_Function,
args = list(K_M = Parameters[[1]], K_I = Parameters[[2]], V_Max = Parameters[[3]], I = 24),
col = "violet")
print(Woolf_Hanes_Plot_Fitted)
Effectively what I have done is I have created a dataframe of Inhibitor_Conc, Substrate_Conc, and Velocity. I fit the model of Competitive Inhibition to this Enzyme Kinetics Data. I used nonlinear regression to extract the unknown parameters K_M, K_I, and V_Max that describe my dataset.
I then used those calculated parameters to create a function which returns values for the Woolf_Hanes Version of the competitive Inhibition function. The hallmark of this function is that whatever the parameters K_M, K_I, V_Max are the lines as function of Substrate Concentration will always be parallel.
The most inefficient part of my code is when I call stat_function 4 times, and manually specify I = 0 and then I = 6, I = 12, and I = 24. That's not very efficient calling stat_function 4 times with each of those inhibitor concentrations especially when the inhibitor concentrations are already in my plotted dataframe. I did this, though to demonstrate my desired output. How can I do this passing the Substrate_Conc and the Inhibitor_Conc into the function I call in stat_function()? Thank you!
The output of the nonlinear model yields the parameters:
Formula: velocity ~ (V_Max * Substrate_Conc)/(K_M * (1 + Inhibitor_Conc/K_I) +
Substrate_Conc)
Parameters:
Estimate Std. Error t value Pr(>|t|)
K_M 6.81741 1.40083 4.867 0.000145 ***
K_I 7.64747 1.92212 3.979 0.000971 ***
V_Max 0.24724 0.01475 16.763 5.24e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01489 on 17 degrees of freedom
Number of iterations to convergence: 5
Achieved convergence tolerance: 5.661e-06

Change scale in geom_qq

I'd like to get the numeric values of a variable (rather than z-score) in the x-axis using ggplot and geom_qq
library("ggplot2")
coin_prob <- 0.5 # this is a fair coin
tosses_per_test <- 5000 # we want to flip a coin 5000 times
no_of_tests <- 1000
outcomes <- rbinom(n = no_of_tests,
size = tosses_per_test,
prob = coin_prob)/tosses_per_test
outcomes.df <- data.frame("results"= outcomes)
ggplot(outcomes.df, aes(sample = results)) +
geom_qq() +
geom_qq_line(color="red") +
labs(x="Theoretical Data", title = "Simulated Coin toss", subtitle = "5000 tosses repeated 1000 times", y="Sample Outcomes")
The default in ggplot for the x-axis seems to be z-scores rather than raw theoretical values. I can hack around like this to get the "real" x axis
p <- ggplot(outcomes.df, aes(sample = results)) + geom_qq()
g <- ggplot_build(p)
raw_qs <- g$data[[1]]$theoretical*sd(outcomes.df$results) + mean(outcomes.df$results)
ggplot(outcomes.df, aes(sample = results)) +
geom_qq() +
geom_qq_line(color="red") +
labs(x="Theoretical Data", title = "Simulated Coin toss", subtitle = "5000 tosses repeated 1000 times", y="Sample Outcomes") +
scale_x_continuous(breaks=seq(-3,3,1), labels = round((seq(-3,3,1)*sd(outcomes.df$results) + mean(outcomes.df$results)),2))
But there's got to be something simpler
Set the parameters of the distribution such that the theoretical quantiles match the distribution to which you're comparing.
library("ggplot2")
coin_prob <- 0.5 # this is a fair coin
tosses_per_test <- 5000 # we want to flip a coin 5000 times
no_of_tests <- 1000
outcomes <- rbinom(
n = no_of_tests,
size = tosses_per_test,
prob = coin_prob) / tosses_per_test
## set dparams in _qq calls
## so that we're not comparing against standard normal distn.
ggplot(mapping = aes(sample = outcomes)) +
geom_qq(dparams = list(mean = mean(outcomes), sd = sd(outcomes))) +
geom_qq_line(
dparams = list(mean = mean(outcomes), sd = sd(outcomes)),
color = "red"
) +
labs(
x = "Theoretical Data",
title = "Simulated Coin toss",
subtitle = "5000 tosses repeated 1000 times",
y = "Sample Outcomes"
)
You can also change the distribution entirely.
For example, to compare against uniform quantiles (eg, p-values)
pvals <- replicate(1000, cor.test(rnorm(100), rnorm(100))$p.value)
ggplot(mapping = aes(sample = pvals)) +
geom_qq(distribution = stats::qunif) +
geom_qq_line(
distribution = stats::qunif,
color = "red"
) +
labs(
x = "Uniform quantiles",
title = "p-values under the null",
subtitle = "1,000 null correlation tests",
y = "Observed p-value"
)

Resources