Change scale in geom_qq - r

I'd like to get the numeric values of a variable (rather than z-score) in the x-axis using ggplot and geom_qq
library("ggplot2")
coin_prob <- 0.5 # this is a fair coin
tosses_per_test <- 5000 # we want to flip a coin 5000 times
no_of_tests <- 1000
outcomes <- rbinom(n = no_of_tests,
size = tosses_per_test,
prob = coin_prob)/tosses_per_test
outcomes.df <- data.frame("results"= outcomes)
ggplot(outcomes.df, aes(sample = results)) +
geom_qq() +
geom_qq_line(color="red") +
labs(x="Theoretical Data", title = "Simulated Coin toss", subtitle = "5000 tosses repeated 1000 times", y="Sample Outcomes")
The default in ggplot for the x-axis seems to be z-scores rather than raw theoretical values. I can hack around like this to get the "real" x axis
p <- ggplot(outcomes.df, aes(sample = results)) + geom_qq()
g <- ggplot_build(p)
raw_qs <- g$data[[1]]$theoretical*sd(outcomes.df$results) + mean(outcomes.df$results)
ggplot(outcomes.df, aes(sample = results)) +
geom_qq() +
geom_qq_line(color="red") +
labs(x="Theoretical Data", title = "Simulated Coin toss", subtitle = "5000 tosses repeated 1000 times", y="Sample Outcomes") +
scale_x_continuous(breaks=seq(-3,3,1), labels = round((seq(-3,3,1)*sd(outcomes.df$results) + mean(outcomes.df$results)),2))
But there's got to be something simpler

Set the parameters of the distribution such that the theoretical quantiles match the distribution to which you're comparing.
library("ggplot2")
coin_prob <- 0.5 # this is a fair coin
tosses_per_test <- 5000 # we want to flip a coin 5000 times
no_of_tests <- 1000
outcomes <- rbinom(
n = no_of_tests,
size = tosses_per_test,
prob = coin_prob) / tosses_per_test
## set dparams in _qq calls
## so that we're not comparing against standard normal distn.
ggplot(mapping = aes(sample = outcomes)) +
geom_qq(dparams = list(mean = mean(outcomes), sd = sd(outcomes))) +
geom_qq_line(
dparams = list(mean = mean(outcomes), sd = sd(outcomes)),
color = "red"
) +
labs(
x = "Theoretical Data",
title = "Simulated Coin toss",
subtitle = "5000 tosses repeated 1000 times",
y = "Sample Outcomes"
)
You can also change the distribution entirely.
For example, to compare against uniform quantiles (eg, p-values)
pvals <- replicate(1000, cor.test(rnorm(100), rnorm(100))$p.value)
ggplot(mapping = aes(sample = pvals)) +
geom_qq(distribution = stats::qunif) +
geom_qq_line(
distribution = stats::qunif,
color = "red"
) +
labs(
x = "Uniform quantiles",
title = "p-values under the null",
subtitle = "1,000 null correlation tests",
y = "Observed p-value"
)

Related

Why is geom_density plotting my data differently from the expected image?

I have the following task statement:
In this task we want to simulate random variables with density
To do this, write a function r_density(n) that simulates n of such random variables.
Then use this function to simulate N = 1000 of such random variables. Using geom_density() you can now estimate the density from the simulated random variables. We can compare this estimate with the real density. To do this, create a graph that looks like this:
Problem is, however, that I don't understand why my output looks like this:
Why is the raked density plotted in such a weird way? Can someone explain to me why it looks like that and how to get the estimated density from the expected image?
This is the corresponding code I wrote for the above plot:
library(tidyverse)
N <- 1000
r_density <- function(n){
exp(-abs(n))/2
}
x <- runif(N)
tb <- tibble(
x = x,
density_fkt = r_density(x)
)
ggplot() +
geom_density(
data = tb,
mapping = aes(
x = density_fkt,
y = ..scaled..
)
) +
geom_function(
fun = r_density,
xlim = c(-6,6),
color = "red",
size = 1
) +
theme_minimal() +
labs(
x = "x",
y = "Dichtefunktion f(x)",
title = "Geschätzte (schwarz) vs echte (rot) Dichte"
)
You may use inverse transform sampling or rejection sampling. I choose rejection sampling.
library(tidyverse)
N <- 1000
r_density <- function(n){
exp(-abs(n))/2
}
x = c()
while (length(x) < N) {
y = rnorm(1)
while (y > 6 | y < -6) {
y = rnorm(1)
}
u = runif(1)
if (u < r_density(y)/(dnorm(y) * 3)) {
x=append(x, y)
}
}
tb <- tibble(
x = x,
density_fkt = r_density(x)
)
ggplot() +
geom_density(
data = tb,
mapping = aes(
x = x,
y = ..density..
)
) +
geom_function(
fun = r_density,
xlim = c(-6,6),
color = "red",
size = 1
) +
theme_minimal() +
labs(
x = "x",
y = "Dichtefunktion f(x)",
title = "Geschätzte (schwarz) vs echte (rot) Dichte"
)
Here's the inverse transform sampling method (this involves some difficult integration, so perhaps not what your teacher intended)
r_density <- function(n) {
cdf <- function(x) {
1/4 * exp(-x) * (-1 + 2 * exp(x) + exp(2*x) - (-1 + exp(x))^2 * sign(x))
}
sapply(runif(n), function(i) {
uniroot(function(x) cdf(x) - i, c(-30, 20))$root
})
}
Plotting gives:
ggplot() +
geom_density(aes(r_density(1000))) +
geom_function(
fun = function(x) exp(-abs(x))/2,
xlim = c(-6,6),
color = "red",
size = 1
) +
theme_minimal() +
labs(
x = "x",
y = "Dichtefunktion f(x)",
title = "Geschätzte (schwarz) vs echte (rot) Dichte"
)

Monte Carlo Sim in R plots STRAIGHTS

So I am getting started with Monte Carlo Sims, and went with this basic code to simulate Returns for a given portfolio. Well somehow a portion of the simulated returns always results in straight linear lines which are easy to see on the plotted graph. First I decreased the number of sims so you can see it clearer and I also played around with some other factors but they keep showing up. The rest of the output looks promising and "random".
Added the link to the image as my account is new and also the code, appreciate any help!:
library(quantmod)
library(ggplot2)
maxDate<- "2000-01-01"
tickers<-c("MSFT", "AAPL", "BRK-B")
getSymbols(tickers, from=maxDate)
Port.p<-na.omit(merge(Cl(AAPL),Cl(MSFT),Cl(`BRK-B`)))
Port.r<-ROC(Port.p, type = "discrete")[-1,]
stock_Price<- as.matrix(Port.p[,1:3])
stock_Returns <- as.matrix(Port.r[,1:3])
mc_rep = 50 # Number of Sims
training_days = 200
portfolio_Weights = c(0.5,0.3,0.2)
coVarMat = cov(stock_Returns)
miu = colMeans(stock_Returns)
Miu = matrix(rep(miu, training_days), nrow = 3)
portfolio_Returns_m = matrix(0, training_days, mc_rep)
set.seed(2000)
for (i in 1:mc_rep) {
Z = matrix ( rnorm( dim(stock_Returns)[2] * training_days ), ncol = training_days )
L = t( chol(coVarMat) )
daily_Returns = Miu + L %*% Z
portfolio_Returns_200 = cumprod( portfolio_Weights %*% daily_Returns + 1 )
portfolio_Returns_m[,i] = portfolio_Returns_200;
}
x_axis = rep(1:training_days, mc_rep)
y_axis = as.vector(portfolio_Returns_m-1)
plot_data = data.frame(x_axis, y_axis)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis)) + geom_path(col = 'red', size = 0.1) +
xlab('Days') + ylab('Portfolio Returns') +
ggtitle('Simulated Portfolio Returns in 200 days')+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The lines are the 'return' from the end of each series to the beginning of the next. You can keep the lines separate by adding a grouping variable to your plotting data and using the group aesthetic to tell ggplot about it:
g <- rep(1:training_days, each = mc_rep)
plot_data = data.frame(x_axis, y_axis, g)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis, group = g)) + ...

Illustrate standard deviation in histogram

Consider the following simple example:
# E. Musk in Grunheide
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n = 1000, mean = 10, sd = 10)
# empirical sd
sd(randomNumbers)
#> [1] 10.34369
# histogram
hist(randomNumbers, probability = TRUE, main = "", breaks = 50)
# just for illusatration purpose
###
# empirical density
lines(density(randomNumbers), col = 'black', lwd = 2)
# theortical density
curve(dnorm(x, mean = 10, sd = 10), col = "blue", lwd = 2, add = TRUE)
###
Created on 2022-03-22 by the reprex package (v2.0.1)
Question:
Is there a nice way to illustrate the empirical standard deviation (sd) in the histogram by colour?
E.g. representing the inner bars by a different color, or indicating the range of the sd by an interval, i.e., [mean +/- sd], on the x-axis?
Note, if ggplot2 provides an easy solution, suggesting this would be also much appreciated.
This is similar ggplot solution to Benson's answer, except we precompute the histogram and use geom_col, so that we don't get any of the unwelcome stacking at the sd boundary:
# E. Musk in Grunheide
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
h <- hist(randomNumbers, breaks = 50, plot = FALSE)
lower <- mean(randomNumbers) - sd(randomNumbers)
upper <- mean(randomNumbers) + sd(randomNumbers)
df <- data.frame(x = h$mids, y = h$density,
fill = h$mids > lower & h$mids < upper)
library(ggplot2)
ggplot(df) +
geom_col(aes(x, y, fill = fill), width = 1, color = 'black') +
geom_density(data = data.frame(x = randomNumbers),
aes(x = x, color = 'Actual density'),
key_glyph = 'path') +
geom_function(fun = function(x) {
dnorm(x, mean = mean(randomNumbers), sd = sd(randomNumbers)) },
aes(color = 'theoretical density')) +
scale_fill_manual(values = c(`TRUE` = '#FF374A', 'FALSE' = 'gray'),
name = 'within 1 SD') +
scale_color_manual(values = c('black', 'blue'), name = 'Density lines') +
labs(x = 'Value of random number', y = 'Density') +
theme_minimal()
Here is a ggplot solution. First calculate mean and sd, and save the values in different vectors. Then use an ifelse statement to categorise the values into "Within range" and "Outside range", fill them with different colours.
Blue line represents the normal distribution stated in your question, and black line represents the density graph of the histogram we're plotting.
library(ggplot2)
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
randomNumbers_mean <- mean(randomNumbers)
randomNumbers_sd <- sd(randomNumbers)
ggplot(data.frame(randomNumbers = randomNumbers), aes(randomNumbers)) +
geom_histogram(aes(
fill = ifelse(
randomNumbers > randomNumbers_mean + randomNumbers_sd |
randomNumbers < randomNumbers_mean - randomNumbers_sd,
"Outside range",
"Within range"
)
),
binwidth = 1, col = "gray") +
geom_density(aes(y = ..count..)) +
stat_function(fun = function(x) dnorm(x, mean = 10, sd = 10) * 1000,
color = "blue") +
labs(fill = "Data")
Created on 2022-03-22 by the reprex package (v2.0.1)
data.frame(rand = randomNumbers,
cut = {
sd <- sd(randomNumbers)
mn <- mean(randomNumbers)
cut(randomNumbers, c(-Inf, mn -sd, mn +sd, Inf))
}) |>
ggplot(aes(x = rand, fill = cut ) ) +
geom_histogram()

Setting hypotheses in R package SPRT

I am new to using the SPRT package in R to perform sequential proprortion ratio testing, and vignettes/tutorials for this package seem to be sparse.
By default the SPRT function can receive cumulative values of n & k (trials and events). I will be using this method on a large studies where trials and events will be tallied daily in a cumulative fashion and I want to check my logic on how I have applied SPRT().
SPRT requires users to set explicit null and alternative hypothesis. I have set these to H_0: treat = control
H_1: treat = control * 1.01
In my for-loop that follows I apply the SPRT() function every day to compute the log likelihood ratio of the cumulative data under each hypothesis, and I really just want to confirm that this is the correct way to analyze the data. Most examples I have seen set h0 and h1 in a more explicit fashion (e.g., h0 = .85 & h1 = .85*1.01), while I have set them to reflect the observed rates for each day in the cumulative data as seems more appropriate in the setting of an experiment (e.g., h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i]).
library(SPRT)
library(tidyverse)
# simulate cumulative data from an AB Test
set.seed(42)
DAYS <- 14
DAILY_N <- 1e3
BASERATE <- .85
MDE <- 1.02
df_sprt <-
tibble(
day = 1:DAYS,
control = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE),
treat = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE*MDE),
n = DAILY_N
) %>%
mutate(
control = cumsum(control),
treat = cumsum(treat),
n = cumsum(n)
)
# apply SPRT in a for loop
wald_a <- vector('numeric', length = nrow(df_sprt))
wald_b <- vector('numeric', length = nrow(df_sprt))
llr <- vector('numeric', length = nrow(df_sprt))
for (i in 1:nrow(df_sprt)) {
out <- SPRT(
distribution = "bernoulli",
type1 = 0.05, type2 = 0.20,
h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i],
n = df_sprt$n[i],
k = df_sprt$treat[i]
)
wald_a[i] <- out$wald.A
wald_b[i] <- out$wald.B
llr[i] <- out$llr
}
sprt_out <-
tibble(
llr,
wald_a,
wald_b,
cohort_day = 1:DAYS
)
# Plot the results
sprt_out %>%
ggplot(aes(x = cohort_day, y = llr)) +
geom_hline(
yintercept =
c(max(sprt_out$wald_a), max(sprt_out$wald_b)),
color = c('darkgreen', 'red')
) +
geom_point() +
geom_line() +
annotate(
x=10,y=max(sprt_out$wald_b),
label="Reject Alternative Hy & Retain Null Hy",
vjust=-1, geom="text", color = 'red'
) +
annotate(
x=10,y=max(sprt_out$wald_a),
label="Reject Null Hy & Accept Alternative Hy",
vjust=1.5, geom="text", color = 'darkgreen'
) +
scale_y_continuous(breaks = -10:20) +
scale_x_continuous(breaks = 1:20) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

How to Add a Legend to a ggplot without plotting the raw data?

I have made a plot of a polynomial function: y = x^2 - 6*x + 9
with a series of several points in a sequence + minor standard error in y. I used these points to construct a spline model for that function from the raw data points, and then I calculated the derivative from the spline model with R's predict() function and then I added both of the spline curves to the plot.
By the way, the expected derivative function is this: dy / dx = 2*x - 6
The original function I colored blue and the 1st derivative function I colored red. I wish to add legends to these plots, but I'm finding that difficult since I did not assign any points to the plots, as I declared the data-frames within the geom_smooth() functions.
The code I'm using is this:
library(ggplot2)
# Plot the function: f(x) = x^2 - 6x + 9
# with a smooth spline:
# And then the deriviative of that function from predicted values of the
# smoothed spline: f ' (x) = 2*x - 6
# Get a large sequence of x-values:
x <- seq(from = -10, to = 10, by = 0.01)
# The y-values are a function of each x value.
y <- x^2 - 6*x + 9 + rnorm(length(x), 0, 0.5)
# Fit the curve to a model which is a smoothed spine.
model <- smooth.spline(x = x, y = y)
# Predict the 1st derivative of this smoothed spline.
f_x <- predict(model, x = seq(from = min(x), to = max(x), by = 1), deriv = 1)
# Plot the smoothed spline of the original function and the derivative with respect to x.
p <- ggplot() + theme_bw() + geom_smooth(data = data.frame(x,y), aes(x = x, y = y), method = "loess", col = "blue", se = TRUE) + geom_smooth(data = data.frame(f_x$x, f_x$y), aes(x = f_x$x, y = f_x$y), method = "loess", col = "red", se = TRUE)
# Set the bounds of the plot.
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
# Add some axis labels
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_fill_manual(values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)
I was hoping that I could add the legend with scale_fill_manual(), but my attempt does not add a legend to the plot. Essentially, the plot I get generally looks like this, minus the messy legend that I added in paint. I would like that legend, thank you.
I did this because I want to show to my chemistry instructor that I can accurately measure the heat capacity just from the points from differential scanning calorimetry data for which I believe the heat capacity is just the first derivative plot of heat flow vs Temperature differentiated with respect to temperature.
So I tried to make a plot showing the original function overlayed with the 1st derivative function with respect to x, showing that the plot of the first derivative made only from a spline curve fitted to raw data points reliably produces the expected line dy / dx = 2 * x - 6, which it does.
I just want to add that legend.
Creating a data frame with you data and use color within aesthetics is the most common way of doing this.
df <- rbind(
data.frame(data='f(x)', x=x, y=y),
data.frame(data='f`(x)', x=f_x$x, y=f_x$y))
p <- ggplot(df, aes(x,y, color=data)) + geom_smooth(method = 'loess')
p <- p + scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 10)) + scale_y_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-10, 10))
p <- p + labs(x = "x-axis", y = "y-axis", title = "Original Function and predicted derivative function")
p <- p + scale_color_manual(name = "Functions", values = c("blue", "red"), labels = c("Original Function", "Derivative Function with respect to x"))
print(p)

Resources