Looking for some assistance using r. I know that there is invgamma but I am not sure if that will work/how to use it correctly. If X has a Gamma distribution with shape parameter r = 3 and scale parameter ρ = 6 is there a way to calculate y such that Prob(X < y) = .95? thanks!
In R you have 4 types of functions for distribution:
r[name of the distribution]- Random number generator
q[name of the distribution]- Quantile function
d[name of the distribution]- Density function
p[name of the distribution]- Distribution function
So since you have the probability, you need yo use qgamma.
P(X < x) = 0.95
shape <- 3
rate <- 6
x_95 <- qgamma(p = 0.95,shape = shape, rate = rate)
[1] 1.049299
Plot code
df <-
tibble(
x = seq(0,3,l = 1000)
) %>%
mutate(y = dgamma(x = x,shape = shape,rate = rate))
ggplot(df,aes(x,y)) +
geom_function(fun = dgamma, args = list(shape = shape, rate = rate))+
geom_vline(xintercept = x_95, linetype = "dashed")+
theme_bw()+
scale_x_continuous(breaks = x_95)+
geom_area(data = df %>% filter(x <= x_95),
alpha = .7, fill = "chocolate2")+
scale_y_continuous(expand = c(0,0))+
annotate(geom = "text",x = .5,y = .7,label = "95%",size = 12)
Related
Consider the following simple example:
# E. Musk in Grunheide
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n = 1000, mean = 10, sd = 10)
# empirical sd
sd(randomNumbers)
#> [1] 10.34369
# histogram
hist(randomNumbers, probability = TRUE, main = "", breaks = 50)
# just for illusatration purpose
###
# empirical density
lines(density(randomNumbers), col = 'black', lwd = 2)
# theortical density
curve(dnorm(x, mean = 10, sd = 10), col = "blue", lwd = 2, add = TRUE)
###
Created on 2022-03-22 by the reprex package (v2.0.1)
Question:
Is there a nice way to illustrate the empirical standard deviation (sd) in the histogram by colour?
E.g. representing the inner bars by a different color, or indicating the range of the sd by an interval, i.e., [mean +/- sd], on the x-axis?
Note, if ggplot2 provides an easy solution, suggesting this would be also much appreciated.
This is similar ggplot solution to Benson's answer, except we precompute the histogram and use geom_col, so that we don't get any of the unwelcome stacking at the sd boundary:
# E. Musk in Grunheide
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
h <- hist(randomNumbers, breaks = 50, plot = FALSE)
lower <- mean(randomNumbers) - sd(randomNumbers)
upper <- mean(randomNumbers) + sd(randomNumbers)
df <- data.frame(x = h$mids, y = h$density,
fill = h$mids > lower & h$mids < upper)
library(ggplot2)
ggplot(df) +
geom_col(aes(x, y, fill = fill), width = 1, color = 'black') +
geom_density(data = data.frame(x = randomNumbers),
aes(x = x, color = 'Actual density'),
key_glyph = 'path') +
geom_function(fun = function(x) {
dnorm(x, mean = mean(randomNumbers), sd = sd(randomNumbers)) },
aes(color = 'theoretical density')) +
scale_fill_manual(values = c(`TRUE` = '#FF374A', 'FALSE' = 'gray'),
name = 'within 1 SD') +
scale_color_manual(values = c('black', 'blue'), name = 'Density lines') +
labs(x = 'Value of random number', y = 'Density') +
theme_minimal()
Here is a ggplot solution. First calculate mean and sd, and save the values in different vectors. Then use an ifelse statement to categorise the values into "Within range" and "Outside range", fill them with different colours.
Blue line represents the normal distribution stated in your question, and black line represents the density graph of the histogram we're plotting.
library(ggplot2)
set.seed(22032022)
# generate random numbers
randomNumbers <- rnorm(n=1000, mean=10, sd=10)
randomNumbers_mean <- mean(randomNumbers)
randomNumbers_sd <- sd(randomNumbers)
ggplot(data.frame(randomNumbers = randomNumbers), aes(randomNumbers)) +
geom_histogram(aes(
fill = ifelse(
randomNumbers > randomNumbers_mean + randomNumbers_sd |
randomNumbers < randomNumbers_mean - randomNumbers_sd,
"Outside range",
"Within range"
)
),
binwidth = 1, col = "gray") +
geom_density(aes(y = ..count..)) +
stat_function(fun = function(x) dnorm(x, mean = 10, sd = 10) * 1000,
color = "blue") +
labs(fill = "Data")
Created on 2022-03-22 by the reprex package (v2.0.1)
data.frame(rand = randomNumbers,
cut = {
sd <- sd(randomNumbers)
mn <- mean(randomNumbers)
cut(randomNumbers, c(-Inf, mn -sd, mn +sd, Inf))
}) |>
ggplot(aes(x = rand, fill = cut ) ) +
geom_histogram()
I am new to using the SPRT package in R to perform sequential proprortion ratio testing, and vignettes/tutorials for this package seem to be sparse.
By default the SPRT function can receive cumulative values of n & k (trials and events). I will be using this method on a large studies where trials and events will be tallied daily in a cumulative fashion and I want to check my logic on how I have applied SPRT().
SPRT requires users to set explicit null and alternative hypothesis. I have set these to H_0: treat = control
H_1: treat = control * 1.01
In my for-loop that follows I apply the SPRT() function every day to compute the log likelihood ratio of the cumulative data under each hypothesis, and I really just want to confirm that this is the correct way to analyze the data. Most examples I have seen set h0 and h1 in a more explicit fashion (e.g., h0 = .85 & h1 = .85*1.01), while I have set them to reflect the observed rates for each day in the cumulative data as seems more appropriate in the setting of an experiment (e.g., h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i]).
library(SPRT)
library(tidyverse)
# simulate cumulative data from an AB Test
set.seed(42)
DAYS <- 14
DAILY_N <- 1e3
BASERATE <- .85
MDE <- 1.02
df_sprt <-
tibble(
day = 1:DAYS,
control = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE),
treat = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE*MDE),
n = DAILY_N
) %>%
mutate(
control = cumsum(control),
treat = cumsum(treat),
n = cumsum(n)
)
# apply SPRT in a for loop
wald_a <- vector('numeric', length = nrow(df_sprt))
wald_b <- vector('numeric', length = nrow(df_sprt))
llr <- vector('numeric', length = nrow(df_sprt))
for (i in 1:nrow(df_sprt)) {
out <- SPRT(
distribution = "bernoulli",
type1 = 0.05, type2 = 0.20,
h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i],
n = df_sprt$n[i],
k = df_sprt$treat[i]
)
wald_a[i] <- out$wald.A
wald_b[i] <- out$wald.B
llr[i] <- out$llr
}
sprt_out <-
tibble(
llr,
wald_a,
wald_b,
cohort_day = 1:DAYS
)
# Plot the results
sprt_out %>%
ggplot(aes(x = cohort_day, y = llr)) +
geom_hline(
yintercept =
c(max(sprt_out$wald_a), max(sprt_out$wald_b)),
color = c('darkgreen', 'red')
) +
geom_point() +
geom_line() +
annotate(
x=10,y=max(sprt_out$wald_b),
label="Reject Alternative Hy & Retain Null Hy",
vjust=-1, geom="text", color = 'red'
) +
annotate(
x=10,y=max(sprt_out$wald_a),
label="Reject Null Hy & Accept Alternative Hy",
vjust=1.5, geom="text", color = 'darkgreen'
) +
scale_y_continuous(breaks = -10:20) +
scale_x_continuous(breaks = 1:20) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
I'm trying to fit a negative binomial distribution to counts data but scaled back to counts like in this example In my data, I have to separate out the binomial function plotting for two species. However, there is not easy way to specify this within the function and getting the line legends with parameter values in the key for both species.
set.seed(111)
count <- rbinom(500,100,0.1)
species <- rep(c("A","B"),time = 250)
df <- data.frame(count,species)
#Specifying negative binomial function
negbinom.params <- fitdistr(df$count,"negative binomial", method = "SANN")$estimate
dist.params <- map(list(`Negative Binomial` = negbinom.params),~ map2(names(.),.,~ paste0(.x," = ",round(.y,2))) %>% unlist %>% paste0(.,collapse = ", ")) %>% map2_chr(names(.),., ~ paste(.x,.y,sep=":\n"))
#Plotting
mybinwidth = 2
ggplot(df, aes(x = count, colour = species, fill = species)) +
facet_grid(.~species) +
geom_histogram(aes(y=..count..),alpha = 0.5, binwidth = mybinwidth) +
stat_function(aes(color = "orange"),
fun = function(x,size, mu) {
mybinwidth * nrow(df) * dnbinom(x,size = size, mu = mu)
},
args=fitdistr(df$count, "negative binomial", method="SANN")$estimate,
xlim=c(0,50),n=20)
You are right, this is a bit of a pain to get right. I've adapted your example a little bit to show two different distribution more clearly. Here is my attempt to make your approach work:
library(ggplot2)
library(MASS)
#> Warning: package 'MASS' was built under R version 3.6.2
set.seed(111)
df <- data.frame(
count = rnbinom(500, rep(c(5, 10), each = 250), 0.5),
species = rep(c("A", 'B'), each = 250)
)
# Not the prettiest formatting, but it'll show the point
ests <- sapply(split(df$count, df$species), function(x) {
est <- fitdistr(x, "negative binomial", method = "SANN")$estimate
formatted <- paste0(names(est)[1], " = ", format(est, digits = 2)[1], ",",
names(est)[2], " = ", format(est, digits = 2)[2])
formatted
})
mybinwidth <- 1
spec_A = df[df$species == "A",]
spec_B = df[df$species == "B",]
ggplot(df, aes(count)) +
geom_histogram(binwidth = mybinwidth,
aes(fill = species), alpha = 0.5,
position = "identity") +
stat_function(aes(color = "A"),
data = data.frame(species = "A"),
fun = function(x, size, mu) {
mybinwidth * nrow(spec_A) * dnbinom(x,size = size, mu = mu)
},
args = fitdistr(spec_A$count, "negative binomial", method="SANN")$estimate,
xlim = c(0, max(df$count)), n= max(df$count) + 1, inherit.aes = FALSE) +
stat_function(aes(color = "B"),
data = data.frame(species = "B"),
fun = function(x, size, mu) {
mybinwidth * nrow(spec_B) * dnbinom(x,size = size, mu = mu)
},
args = fitdistr(spec_B$count, "negative binomial", method="SANN")$estimate,
xlim = c(0, max(df$count)), n= max(df$count) + 1, inherit.aes = FALSE) +
scale_colour_discrete(labels = unname(ests), name = "fit") +
facet_wrap(~ species)
#> Warning: `mapping` is not used by stat_function()
#> Warning: `data` is not used by stat_function()
#> Warning: `mapping` is not used by stat_function()
#> Warning: `data` is not used by stat_function()
Created on 2020-04-15 by the reprex package (v0.3.0)
There are also packages that do the majority of this work for you. Disclaimer: I wrote ggh4x, so I'm not unbiased. You can also replace the ggplot code with the following (assuming similar preprocessing)
library(ggh4x)
ggplot(df, aes(count)) +
geom_histogram(binwidth = mybinwidth,
aes(fill = species), alpha = 0.5,
position = "identity") +
stat_theodensity(aes(colour = species,
y = after_stat(count * mybinwidth)),
distri = "nbinom") +
scale_colour_discrete(labels = unname(ests), name = "fit") +
facet_wrap(~ species)
Hope that helped!
When executing the following piece of code, the output plot shows a blue line of f(x) = 0, instead of the Gamma pdf (see the blue line in this picture).
analyzeGamma <- function(csvPath, alpha, beta) {
dfSamples <- read.csv(file = csvPath,
header = TRUE,
sep = ",")
base <- ggplot(dfSamples, aes(x = value, y = quantity))
base +
geom_col(color = "red") +
geom_vline(xintercept = qgamma(seq(0.1, 0.9, by = 0.1), alpha, beta)) +
stat_function(
fun = dgamma,
args = list(shape = alpha, rate = beta),
colour = "blue"
)
}
path = "/tmp/data.csv"
alpha = 1.2
beta = 0.01
analyzeGamma(path, alpha, beta)
When I comment out the line:
geom_col(color = "red") +
The Gamma pdf is drawn correctly, as can be seen here.
Any idea why it happens and how to resolve?
Thanks.
It's because your geom_col() goes up to 25 and probability density functions have an integral of 1. If I'm correct in assuming your columns resemble a histogram with count data as quantities, you would have to scale your density to match the columns as follows:
density * number of samples * width of columns
If you've precomputed the columns, 'number of samples' would be the sum of all your y-values.
An example with some toy data, notice the function in the stat:
alpha = 1.2
beta = 0.01
df <- data.frame(x = rgamma(1000, shape = alpha, rate = beta))
binwidth <- 5
ggplot(df, aes(x)) +
geom_histogram(binwidth = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * length(df$x) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)
The following example with geom_col() gives the same picture:
x <- table(cut_width(df$x, binwidth, boundary = 0))
newdf <- data.frame(x = seq(0.5*binwidth, max(df$x), by = binwidth),
y = as.numeric(x))
ggplot(newdf, aes(x, y)) +
geom_col(width = binwidth) +
stat_function(
fun = function(z, shape, rate)(dgamma(z, shape, rate) * sum(newdf$y) * binwidth),
args = list(shape = alpha, rate = beta),
colour = "blue"
)
ggplot scales the y-axis to show all data. The blue curve appears as a straight line due do scale - if you compare the scale of the y-axis in both charts you'll see: when you draw the geom_col the y axis maximum is somewhere at 25 (and stat_functions seems to be a straigh line). Without the geom_col, y-axis max is somewhere at 0.006.
I have data.frame object with a numeric column amount and categorical column fraud:
amount <- [60.00, 336.38, 119.00, 115.37, 220.01, 60.00, 611.88, 189.78 ...]
fraud <- [1,0,0,0,0,0,1,0, ...]
I want to fit a gamma distribution to amount but to plot it by factor(fraud).
I want a graph that will show me 2 curves with 2 different colors that will distinguish between the 2 sets (fraud/non fraud groups).
Here is what I have done so far:
fit.gamma1 <- fitdist(df$amount[df$fraud == 1], distr = "gamma", method = "mle")
plot(fit.gamma1)
fit.gamma0 <- fitdist(df$amount[df$fraud == 0], distr = "gamma", method = "mle")
plot(fit.gamma0)
I have used this reference:
How would you fit a gamma distribution to a data in R?
Perhaps what you want is
curve(dgamma(x, shape = fit.gamma0$estimate[1], rate = fit.gamma0$estimate[2]),
from = min(amount), to = max(amount), ylab = "")
curve(dgamma(x, shape = fit.gamma1$estimate[1], rate = fit.gamma1$estimate[2]),
from = min(amount), to = max(amount), col = "red", add = TRUE)
or with ggplot2
ggplot(data.frame(x = range(amount)), aes(x)) +
stat_function(fun = dgamma, aes(color = "Non fraud"),
args = list(shape = fit.gamma0$estimate[1], rate = fit.gamma0$estimate[2])) +
stat_function(fun = dgamma, aes(color = "Fraud"),
args = list(shape = fit.gamma1$estimate[1], rate = fit.gamma1$estimate[2])) +
theme_bw() + scale_color_discrete(name = NULL)