ggplot2 : Extending stat_function to the geom_violin - r

In a data.frame, I would like to be able to compare the density estimates by ggplot2::geom_violin() with the ones that would be computed with stat_function() and this for every factor.
In this settting, I want to compare the empirical density of 2 samples of size 100 with the true density of normal distributions with mean 10 and 20.
library(tidyverse)
test <- tibble(a = rnorm(100, mean = 10),
b = rnorm(100, mean = 20)) %>%
gather(key, value)
One way to achieve this is to replicate for every factor an overlay of stat_density and stat_function. However for too many factors this would create too many plots. (multiple answers on these questions exist : e.g. overlay histogram with empirical density and dnorm function)
For the clarity of the next graphs i use the geom_flat_violin of #DavidRobinson : dgrtwo/
geom_flat_violin.R.
source("geom_flat_violin.R")
# without the "true" distribution
test %>%
ggplot(aes(x = key, y = value)) +
geom_flat_violin(col = "red", fill = "red", alpha = 0.3) +
geom_point()
# comparing with the "true" distribution
test %>%
ggplot(aes(x = key, y = value)) +
geom_flat_violin(col = "red", fill = "red", alpha = 0.3) +
geom_point() +
geom_flat_violin(data = tibble(value = rnorm(10000, mean = 10), key = "a"),
fill = "blue", alpha = 0.2)
The problem with this solution is that it requires to simulate for every factor enough simulated data points so that the final density is smooth enough. For the normal distribution 10000 is enough but for other distributions it might be necessary to simulate even more points.
The question is : can the stat_functions be used to achieve this so that it is not mandatory to simulate data?
stat_function(fun = dnorm, args = list(mean = 10))
stat_function(fun = dnorm, args = list(mean = 20))

Rather than having to calculate the density of a large sample, you could simply get the distribution directly and plot it as a polygon:
library(tidyverse)
test <- tibble(a = rnorm(100, mean = 10),
b = rnorm(100, mean = 20)) %>%
gather(key, value)
test %>%
ggplot(aes(x = key, y = value)) +
geom_flat_violin(col = "red", fill = "red", alpha = 0.3) +
geom_point() +
geom_polygon(data = tibble(value = seq(7, 13, length.out = 100),
key = 1 + dnorm(value, 10)),
fill = "blue", colour = "blue", alpha = 0.2)

Related

Overlay two plots from different dataframes in R

I would like to overlay two ggplots from different data sources. I don't think a left_join will work because the dataframes are of two different lengths and would potential change the underlying plots.[Maybe?]
library(tidyverse)
set.seed(123)
player_df <- tibble(name = rep(c("A","B","C","D"), each = 10, times = 1),
pos = rep(c("DEF","DEF","MID","MID"), each = 10, times = 1),
load = c(rnorm(10, mean = 200, sd = 100),
rnorm(10, mean = 300, sd = 50),
rnorm(10, mean = 400, sd = 100),
rnorm(10, mean = 500, sd = 50)))
p1 <- player_df %>%
ggplot(aes(x = load, y = name)) +
geom_point()
pos_df <- tibble(pos = rep(c("DEF","MID"), each = 30, times = 1),
load = (c(rnorm(30, mean = 250, sd = 100),
rnorm(30, mean = 350, sd = 100))))
p2 <- pos_df %>%
ggplot(aes(x = load, y = pos)) +
geom_boxplot()
p1
p2
# add p2 to every p1 player plot by pos
I would like p1 to have the corresponding p2 - by pos - appear behind it. So... add the matching p2 boxplot to each p1 scatterplot.
p1:
p2:
It's not really advisable to attempt to superimpose two plots on each other. A ggplot is made of layers already, so usually it's just a case of superimposing one geom on another. This can be difficult if (as in your case) one of the axes has different labels. However, with a little work it is possible to wrangle your data so that it all sits on a single plot. In your case, you could do something like:
levs <- c("A", "DEF", "B", "C", "MID", "D")
ggplot(within(pos_df, pos <- factor(pos, levs)), aes(x = load, y = pos)) +
geom_boxplot(width = 2.3) +
geom_point(data = within(player_df, pos <- factor(name, levs))) +
scale_y_discrete(limits = c("A", "DEF", "B", " ", "C", "MID", "D"))
Dug into ggplot a bit and re-engineered a boxplot bit by bit.
# manually calculate stats that are used in boxplots
pos_df_summary <- pos_df %>%
group_by(pos, .drop = FALSE) %>%
summarise(min = fivenum(load)[1],
Q1 = fivenum(load)[2],
median = fivenum(load)[3],
Q3 = fivenum(load)[4],
max = fivenum(load)[5]
)
# add the boxplot data to each player
joined_df <- player_df %>%
left_join(., pos_df_summary, by = "pos") %>%
distinct(name, .keep_all = TRUE)
# plot
ggplot(data = NULL, aes(group = name)) +
# create the line from min to max
geom_segment(data = joined_df, aes(y = name, yend = name, x=min, xend=max), color="black") +
#create the box with median line
geom_crossbar(data = joined_df,
aes(y = name, xmin = Q1, xmax = Q3, x = median, fill = "NA"),
color = "black",
fatten = 1) +
scale_fill_manual(values = "white") +
# add the points from the player_df
geom_point(data = player_df,
aes(x = load, y = name, group=name),
color = "red",
show.legend=FALSE) +
theme(legend.position = "none")
There may be some extraneous code in here as I cobbled it from some other resources. Specifically, I'm not sure what the aes(group = name) in the ggplot() call does exactly.

How to add name labels to a graph using ggplot2 in R?

I have the following code:
plot <- ggplot(data = df_sm)+
geom_histogram(aes(x=simul_means, y=..density..), binwidth = 0.20, fill="slategray3", col="black", show.legend = TRUE)
plot <- plot + labs(title="Density of 40 Means from Exponential Distribution", x="Mean of 40 Exponential Distributions", y="Density")
plot <- plot + geom_vline(xintercept=sampl_mean,size=1.0, color="black", show.legend = TRUE)
plot <- plot + stat_function(fun=dnorm,args=list(mean=sampl_mean, sd=sampl_sd),color = "dodgerblue4", size = 1.0)
plot <- plot+ geom_vline(xintercept=th_mean,size=1.0,color="indianred4",linetype = "longdash")
plot <- plot + stat_function(fun=dnorm,args=list(mean=th_mean, sd=th_mean_sd),color = "darkmagenta", size = 1.0)
plot
I want to show the legends of each layer, I tried show.legend = TRUE but it does nothing.
All my data frame is means from exponential distribution simulations, also I have some theoretical values from the distribution (mean and standard deviation) which I describe as th_mean and th_mean_sd.
The code for my simulation is the following:
lambda <- 0.2
th_mean <- 1/lambda
th_sd <- 1/lambda
th_var <- th_sd^2
n <- 40
th_mean_sd <- th_sd/sqrt(n)
th_mean_var <- th_var/sqrt(n)
simul <- 1000
simul_means <- NULL
for(i in 1:simul) {
simul_means <- c(simul_means, mean(rexp(n, lambda)))
}
sampl_mean <- mean(simul_means)
sampl_sd <- sd(simul_means)
df_sm<-data.frame(simul_means)
If you want to get a legend you have to map on aesthetics instead of setting the color, fill, ... as parameter, i.e. move color=... inside aes(...) and make use of scale_color/fill_manual to set the color values. Personally I find it helpful to make use of some meaningful labels, e.g. in case of your histogram I map the label "hist" on the fill but you could whatever label you like:
set.seed(123)
lambda <- 0.2
th_mean <- 1 / lambda
th_sd <- 1 / lambda
th_var <- th_sd^2
n <- 40
th_mean_sd <- th_sd / sqrt(n)
th_mean_var <- th_var / sqrt(n)
simul <- 1000
simul_means <- NULL
for (i in 1:simul) {
simul_means <- c(simul_means, mean(rexp(n, lambda)))
}
sampl_mean <- mean(simul_means)
sampl_sd <- sd(simul_means)
df_sm <- data.frame(simul_means)
library(ggplot2)
ggplot(data = df_sm) +
geom_histogram(aes(x = simul_means, y = ..density.., fill = "hist"), binwidth = 0.20, col = "black") +
labs(title = "Density of 40 Means from Exponential Distribution", x = "Mean of 40 Exponential Distributions", y = "Density") +
stat_function(fun = dnorm, args = list(mean = sampl_mean, sd = sampl_sd), aes(color = "sampl_mean"), size = 1.0) +
stat_function(fun = dnorm, args = list(mean = th_mean, sd = th_mean_sd), aes(color = "th_dens"), size = 1.0) +
geom_vline(size = 1.0, aes(xintercept = sampl_mean, color = "sampl_mean")) +
geom_vline(size = 1.0, aes(xintercept = th_mean, color = "th_mean"), linetype = "longdash") +
scale_fill_manual(values = c(hist = "slategray3")) +
scale_color_manual(values = c(sampl_dens = "dodgerblue4", th_dens = "darkmagenta", th_mean = "indianred4", sampl_mean = "black"))

Stat function dnorm failure

I am working through a class problem to test if the central limit theorem applies to medians as well. I've written the code, and as far as I can tell, it is working just fine. But my dnorm stat to plot the normal distribution is not showing up. It just creates a flat line when it should create a bell curve. Here is the code:
set.seed(14)
median_clt <- rnorm(1000, mean = 10, sd = 2)
many_sample_medians <- function(vec, n, reps) {
rep_vec <- replicate(reps, sample(vec, n), simplify = "vector")
median_vec <- apply(rep_vec, 2, median)
return(median_vec)
}
median_clt_test <- many_sample_medians(median_clt, 500, 1000)
median_clt_test_df <- data.frame(median_clt_test)
bw_clt <- 2 * IQR(median_clt_test_df$median_clt_test) / length(median_clt_test_df$median_clt_test)^(1/3)
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..), fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2), col = "darkorchid1", lwd = 2) +
theme_classic()
As far as I can tell, the rest of the code is working properly - it just doesn't plot the dnorm stat function correctly. The exact same stat line worked for me before, so I'm not sure what's gone wrong.
The line isn't quite flat; it's just very stretched out compared to the histogram. We can see this more clearly if we zoom out on the x axis and zoom in on the y axis:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x, mean = 10, sd = 2),
col = "darkorchid1",
lwd = 2) +
xlim(c(5, 15)) +
coord_cartesian(xlim = c(5, 15), ylim = c(0, 1)) +
theme_classic()
But why is this?
It's because you are using dnorm to plot the distribution of the random variable from which the medians were drawn, but your histogram is a sample of the medians themselves. So you are plotting the wrong dnorm curve. The sd should not be the standard deviation of the random variable, but the standard deviation of the sample medians:
ggplot(median_clt_test_df, aes(x = median_clt_test)) +
geom_histogram(binwidth = bw_clt, aes(y = ..density..),
fill = "hotpink1", col = "white") +
stat_function(fun = ~dnorm(.x,
mean = mean(median_clt_test),
sd = sd(median_clt_test)),
col = "darkorchid1",
lwd = 2)
theme_classic()
If you prefer you could use the theoretical standard error of the mean instead of the measured standard deviation of your medians - these will be very similar.
# Theoretical SEM
2/sqrt(500)
#> [1] 0.08944272
# SD of medians
sd(median_clt_test)
#> [1] 0.08850221

Overlay a Normal Density Plot On Top of Data ggplot2

I plot a density curve using ggplot2. After I plot the data, I would like to add a normal density plot right on top of it with a fill.
Currently, I am using rnorm() to create the data but this is not efficient and would work poorly on small data sets.
library(tidyverse)
#my data that I want to plot
my.data = rnorm(1000, 3, 10)
#create the normal density plot to overlay the data
overlay.normal = rnorm(1000, 0, 5)
all = tibble(my.data = my.data, overlay.normal = overlay.normal)
all = melt(all)
ggplot(all, aes(value, fill = variable))+geom_density()
The goal would be to plot my data and overlay a normal distribution on top of it (with a fill). Something like:
ggplot(my.data)+geom_density()+add_normal_distribution(mean = 0, sd = 5, fill = "red)
Here's an approach using stat_function to define a normal curve and draw it within the ggplot call.
ggplot(my.data %>% enframe(), aes(value)) +
geom_density(fill = "mediumseagreen", alpha = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = 0, sd = 5),
color = "red", linetype = "dotted", size = 1)
I figured out the solution from mixing Jon's answer and an answer from Hadley.
my.data = rnorm(1000, 3, 10)
ggplot(my.data %>% enframe(), aes(value)) +
geom_density(fill = "mediumseagreen", alpha = 0.1) +
geom_area(stat = "function", fun = function(x) dnorm(x, mean = 0, sd = 5), fill = "red", alpha = .5)

Using position_dodge within stat_summary for means and confidence intervals?

I'm trying to show data of two groups. I am using the ggplot2 package to graph the data and using stat_summary() to obtain a point estimate (mean) and 90% CI within the plot of the data. What I'd like is for the mean and confidence interval be structured off to the right of the points representing the distribution of the data. Currently, stat_summary() will simply impose the mean and CI over top of the distribution.
Here is an example of data that I am working with:
set.seed(9909)
Subjects <- 1:100
values <- c(rnorm(n = 50, mean = 30, sd = 5), rnorm(n = 50, mean = 35, sd = 8))
data <- cbind(Subjects, values)
group1 <- rep("group1", 50)
group2 <- rep("group2", 50)
group <- c(group1, group2)
data <- data.frame(data, group)
data
And this is what my current ggplot2 code looks like (distribution as points with the mean and 90% CI overlaid on top for each group):
ggplot(data, aes(x = group, y = values, group = 1)) +  
geom_point() +
stat_summary(fun.y = "mean", color = "red", size = 5, geom = "point") +
stat_summary(fun.data = "mean_cl_normal", color = "red", size = 2, geom = "errorbar", width = 0, fun.args = list(conf.int = 0.9)) + theme_bw()
Is it possible to get the mean and confidence intervals to position_dodge to the right of their respective groups?
You can use position_nudge:
ggplot(data, aes(x = group, y = values, group = 1)) +
geom_point() +
stat_summary(fun.y = "mean", color = "red", size = 5, geom = "point",
position=position_nudge(x = 0.1, y = 0)) +
stat_summary(fun.data = "mean_cl_normal", color = "red", size = 2,
geom = "errorbar", width = 0, fun.args = list(conf.int = 0.9),
position=position_nudge(x = 0.1, y = 0)) +
theme_bw()

Resources