Using stat_summary to plot the location of the median

Using stat_summary to plot the location of the median - r

I want a way to draw a vertical line where the median occurs for each group in my data on top of a histogram. I can do that by first grouping by the groups, mutating a new column to be the median, and then faceting by the group. Here is the some code to do that:
library(tidyverse)
N = 1000
m = c(1,5,10)
z = c('A','B','C')
d<-map2_dfr(m,z, ~data.frame(x = rbeta(N,shape1 =.x, shape2 = 20), z = .y))
d %>%
group_by(z) %>%
mutate(med = median(x)) %>%
ungroup %>%
ggplot(aes(x, fill = z))+
geom_histogram(aes(y = ..density..),bins = 10,color = 'black')+
geom_vline(aes(xintercept = med))+
facet_wrap(~z)
Since the median is a statistical summary, can I achieve the same result using stat_summary or stat_function with geom="vline"?

Yes, you can; there are just a few tricks to it.
Since stat_summary calculates a summary over y for every x, we'll need to fool the function by giving it a dummy x-variable and supply the input for the histogram as y. I've found that giving a dummy-x that is within the range of the data works best, since then it does not affect the axis limits.
In code below, assume d is the d generated with your code.
ggplot(d, aes(x, fill = z)) +
geom_histogram(aes(y = ..density..), bins = 10, colour = "black") +
stat_summary(aes(x = 0.1, y = x, xintercept = stat(y), group = z),
fun.y = median, geom = "vline") +
facet_wrap(~ z)
As compared to the original plot:
d %>%
group_by(z) %>%
mutate(med = median(x)) %>%
ungroup %>%
ggplot(aes(x, fill = z))+
geom_histogram(aes(y = ..density..),bins = 10,color = 'black')+
geom_vline(aes(xintercept = med))+
facet_wrap(~z)

Related

How to plot stat_mean for scatterplot in R ggplot2?

For each treatment tmt, I want to plot the means using stat_summary in ggplot2 with different colour size. I find that the there are mulitple means being plotted over the current points. Not sure how to rectify it.
df <- data.frame(x = rnorm(12, 4,1), y = rnorm(12, 6,4), tmt = rep(c("A","B","C"), each = 4))
ggplot(aes(x = x, y = y, fill = tmt), data = df) +
geom_point(shape=21, size=5, alpha = 0.6) +
scale_fill_manual(values=c("pink","blue", "purple")) +
stat_summary(aes(fill = tmt), fun = 'mean', geom = 'point', size = 5) +
scale_fill_manual(values=c("pink","blue", "purple"))
Plot without the last two lines of code
Plot with the entire code

Using stat_summary you compute the mean of y for each pair of x and tmt. If you want the mean of x and the mean of y per tmt I would suggest to manually compute the means outside of ggplot and use a second geom_point to plot the means. In my code below I increased the size and used rectangles for the means:
df <- data.frame(x = rnorm(12, 4,1), y = rnorm(12, 6,4), tmt = rep(c("A","B","C"), each = 4))
library(ggplot2)
library(dplyr)
df_mean <- df |>
group_by(tmt) |>
summarise(across(c(x, y), mean))
ggplot(aes(x = x, y = y, fill = tmt), data = df) +
geom_point(shape=21, size=5, alpha = 0.6) +
geom_point(data = df_mean, shape=22, size=8, alpha = 0.6) +
scale_fill_manual(values=c("pink","blue", "purple"))

How to plot multiple mean lines in a single histogram with multiple groups present?

I am plotting a distribution of two variables on a single histogram. I am interested in highlighting each distribution's mean value on that graph through a doted line or something similar (but hopefully something that matches the color present already in the aes section of the code).
How would I do that?
This is my code so far.
hist_plot <- ggplot(data, aes(x= value, fill= type, color = type)) +
geom_histogram(position="identity", alpha=0.2) +
labs( x = "Value", y = "Count", fill = "Type", title = "Title") +
guides(color = FALSE)
Also, is there any way to show the count of n for each type on this graph?

i've made some reproducible code that might help you with your problem.
library(tidyverse)
# Generate some random data
df <- data.frame(value = c(runif(50, 0.5, 1), runif(50, 1, 1.5)),
type = c(rep("type1", 50), rep("type2", 50)))
# Calculate means from df
stats <- df %>% group_by(type) %>% summarise(mean = mean(value),
n = n())
# Make the ggplot
ggplot(df, aes(x= value, fill= type, color = type)) +
geom_histogram(position="identity", alpha=0.2) +
labs(x = "Value", y = "Count", fill = "Type", title = "Title") +
guides(color = FALSE) +
geom_vline(data = stats, aes(xintercept = mean, color = type), size = 2) +
geom_text(data = stats, aes(x = mean, y = max(df$value), label = n),
size = 10,
color = "black")
If things go as intended, you'll end up something akin to the following plot.
histogram with means

How to add the x intercept from geom_vline in density plot as a label?

I'm creating a density plot and facet it.
Let's say I have 4 groups for a variable, faceting it using this variable will generate me with 4 density plots.
For each density plot, I put in a vertical line which represents the mean.
However one needs to eyeball and look at the intersection between x-axis and the vertical line to see roughly how much is the mean.
What I want to create is for each density plot, I also want to show its mean as a label in the plotting area.
Below example code
x <- rnorm(n = 100, mean = 10, sd = 1)
y <- rnorm(n = 100, mean = 20, sd = 1)
z <- rnorm(n = 100, mean = 40, sd = 1)
df <- as_tibble(cbind(
c(x,y,z),
c(rep('x',length(x)), rep('y',length(y)), rep('z',length(z))),
c(rep('a',length(x)/2), rep('b',length(x)/2))))
df$V1 <- as.numeric(df$V1)
df <- df %>% group_by(V2, V3) %>%
summarise(mumean = mean(V1)) %>%
right_join(df)
df %>%
ggplot(aes(x = V1, color = V2)) +
geom_density(aes(fill = V2)) + facet_grid(V3 ~ V2) + theme_bw() +
geom_vline(data = df, aes(xintercept = mumean))

One approach is to pre-calc the means and use those to feed geom_vline and geom_text:
library(dplyr)
iris_means <- iris %>%
group_by(Species) %>%
summarize(mean = mean(Sepal.Length))
ggplot(iris, aes(Sepal.Length)) +
geom_density() +
geom_vline(data = iris_means, aes(xintercept = mean)) +
geom_text(data = iris_means, aes(x = mean, label = mean),
y = 0.1, angle = 90, vjust = -0.2) +
facet_wrap(~Species)

ggplot: labelling geom_smooth / stat_smooth values at correct value

I'm trying to get labels to line up with the values from a smooth line. While other answers I've seen suggest creating a data column of predicted values, I'm looking for a cleaner alternative that uses the data that is already being produced for the ggplot.
See example below for the problem:
require(tidyverse)
require(ggrepel)
set.seed(1)
df <- data.frame(x = rep(1:100, 5), y = c(sample(1:20, 100, T), sample(21:40, 100, T), sample(41:60, 100, T), sample(61:80, 100, T), sample(81:100, 100, T)), group = rep(letters[1:5], each = 100))
df <- tbl_df(df)
df %>%
ggplot(aes(x = x, y = y, label = group, color = group)) +
geom_smooth() +
guides(color = F) +
geom_text_repel(data = . %>% filter(x == max(x)), aes(x = x, y = y, label = group), nudge_x = 50)
Is there some way to get the smooth line value at max(x) without using ggplot_build() or another external, multi-step approach?

I'm not sure if this is really more elegant, but it's all in one pipe. I didn't have the "repel" version handy, but the idea is the same.
library(broom)
df %>%
{ggplot(., aes(x, y, label = group, color = group)) +
geom_smooth() +
guides(color = F) +
geom_text(data = group_by(., group) %>%
do(augment(loess(y~x, .))) %>%
filter(x == max(x)),
aes(x, .fitted), nudge_x = 5)}
You need to get the prediction of the loess smoother at that final x value, so you just have to fit it twice. If the model-fitting is slow, you can do that once, higher in the dplyr chain, and just use the output for the rest of the figure.
df %>%
group_by(group) %>%
do(augment(loess(y~x, .))) %>%
{ggplot(., aes(x, y, label = group, color = group)) +
geom_smooth() +
guides(color = F) +
geom_text(data = filter(., x == max(x)),
aes(x, .fitted), nudge_x = 5)}

ggplot2: histogram with normal curve

I've been trying to superimpose a normal curve over my histogram with ggplot 2.
My formula:
data <- read.csv (path...)
ggplot(data, aes(V2)) +
geom_histogram(alpha=0.3, fill='white', colour='black', binwidth=.04)
I tried several things:
+ stat_function(fun=dnorm)
....didn't change anything
+ stat_density(geom = "line", colour = "red")
...gave me a straight red line on the x-axis.
+ geom_density()
doesn't work for me because I want to keep my frequency values on the y-axis, and want no density values.
Any suggestions?
Solution found!
+geom_density(aes(y=0.045*..count..), colour="black", adjust=4)

Think I got it:
library(ggplot2)
set.seed(1)
df <- data.frame(PF = 10*rnorm(1000))
ggplot(df, aes(x = PF)) +
geom_histogram(aes(y =..density..),
breaks = seq(-50, 50, by = 10),
colour = "black",
fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(df$PF), sd = sd(df$PF)))

This has been answered here and partially here.
The area under a density curve equals 1, and the area under the histogram equals the width of the bars times the sum of their height ie. the binwidth times the total number of non-missing observations. To fit both on the same graph, one or other needs to be rescaled so that their areas match.
If you want the y-axis to have frequency counts, there are a number of options:
First simulate some data.
library(ggplot2)
set.seed(1)
dat_hist <- data.frame(
group = c(rep("A", 200), rep("B",150)),
value = c(rnorm(200, 20, 5), rnorm(150,25,10)))
# Set desired binwidth and number of non-missing obs
bw = 2
n_obs = sum(!is.na(dat_hist$value))
Option 1: Plot both histogram and density curve as density and then rescale the y axis
This is perhaps the easiest approach for a single histogram.
Using the approach suggested by Carlos, plot both histogram and density curve as density
g <- ggplot(dat_hist, aes(value)) +
geom_histogram(aes(y = ..density..), binwidth = bw, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(dat_hist$value), sd = sd(dat_hist$value)))
And then rescale the y axis.
ybreaks = seq(0,50,5)
## On primary axis
g + scale_y_continuous("Counts", breaks = round(ybreaks / (bw * n_obs),3), labels = ybreaks)
## Or on secondary axis
g + scale_y_continuous("Density", sec.axis = sec_axis(
trans = ~ . * bw * n_obs, name = "Counts", breaks = ybreaks))
Option 2: Rescale the density curve using stat_function
With code tidied as per PatrickT's answer.
ggplot(dat_hist, aes(value)) +
geom_histogram(colour = "black", binwidth = bw) +
stat_function(fun = function(x)
dnorm(x, mean = mean(dat_hist$value), sd = sd(dat_hist$value)) * bw * n_obs)
Option 3: Create an external dataset and plot using geom_line.
Unlike the above options, this one works with facets. (EDITED to provide dplyr rather than plyr based solution). Note, the summarised dataset is being used as the primary, and the raw passed in for the histogram only.
library(tidyverse)
dat_hist %>%
group_by(group) %>%
nest(data = c(value)) %>%
mutate(y = map(data, ~ dnorm(
.$value, mean = mean(.$value), sd = sd(.$value)
) * bw * sum(!is.na(.$value)))) %>%
unnest(c(data,y)) %>%
ggplot(aes(x = value)) +
geom_histogram(data = dat_hist, binwidth = bw, colour = "black") +
geom_line(aes(y = y)) +
facet_wrap(~ group)
Option 4: Create external functions to edit the data on the fly
A bit over the top perhaps, but might be useful for someone?
## Function to create scaled dnorm data along full x axis range
dnorm_scaled <- function(data, x = NULL, binwidth = 1, xlim = NULL) {
.x <- na.omit(data[,x])
if(is.null(xlim))
xlim = c(min(.x), max(.x))
x_range = seq(xlim[1], xlim[2], length.out = 101)
setNames(
data.frame(
x = x_range,
y = dnorm(x_range, mean = mean(.x), sd = sd(.x)) * length(.x) * binwidth),
c(x, "y"))
}
## Function to apply over groups
dnorm_scaled_group <- function(data, x = NULL, group = NULL, binwidth = NULL, xlim = NULL) {
dat_hists <- lapply(
split(data, data[, group]), dnorm_scaled,
x = x, binwidth = binwidth, xlim = xlim)
for(g in names(dat_hists))
dat_hists[[g]][, "group"] <- g
setNames(do.call(rbind, dat_hists), c(x, "y", group))
}
## Single histogram
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = bw, colour = "black") +
geom_line(data = ~ dnorm_scaled(., "value", binwidth = bw),
aes(y = y))
## With a single faceting variable
ggplot(dat_hist, aes(value)) +
geom_histogram(binwidth = 2, colour = "black") +
geom_line(data = ~ dnorm_scaled_group(
., x = "value", group = "group", binwidth = 2, xlim = c(0,50)),
aes(y = y)) +
facet_wrap(~ group)

This is an extended comment on JWilliman's answer. I found J's answer very useful. While playing around I discovered a way to simplify the code. I'm not saying it is a better way, but I thought I would mention it.
Note that JWilliman's answer provides the count on the y-axis and a "hack" to scale the corresponding density normal approximation (which otherwise would cover a total area of 1 and have therefore a much lower peak).
Main point of this comment: simpler syntax inside stat_function, by passing the needed parameters to the aesthetics function, e.g.
aes(x = x, mean = 0, sd = 1, binwidth = 0.3, n = 1000)
This avoids having to pass args = to stat_function and is therefore more user-friendly. Okay, it's not very different, but hopefully someone will find it interesting.
# parameters that will be passed to ``stat_function``
n = 1000
mean = 0
sd = 1
binwidth = 0.3 # passed to geom_histogram and stat_function
set.seed(1)
df <- data.frame(x = rnorm(n, mean, sd))
ggplot(df, aes(x = x, mean = mean, sd = sd, binwidth = binwidth, n = n)) +
theme_bw() +
geom_histogram(binwidth = binwidth,
colour = "white", fill = "cornflowerblue", size = 0.1) +
stat_function(fun = function(x) dnorm(x, mean = mean, sd = sd) * n * binwidth,
color = "darkred", size = 1)

This code should do it:
set.seed(1)
z <- rnorm(1000)
qplot(z, geom = "blank") +
geom_histogram(aes(y = ..density..)) +
stat_density(geom = "line", aes(colour = "bla")) +
stat_function(fun = dnorm, aes(x = z, colour = "blabla")) +
scale_colour_manual(name = "", values = c("red", "green"),
breaks = c("bla", "blabla"),
labels = c("kernel_est", "norm_curv")) +
theme(legend.position = "bottom", legend.direction = "horizontal")
Note: I used qplot but you can use the more versatile ggplot.

Here's a tidyverse informed version:
Setup
library(tidyverse)
Some data
d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv")
Preparing data
We'll use a "total" histogram for the whole sample, to that end, we'll need to remove the grouping information from the data.
d2 <-
d |>
select(-gender)
Here's a data set with summary data:
d_summary <-
d %>%
group_by(gender) %>%
summarise(height_m = mean(height, na.rm = T),
height_sd = sd(height, na.rm = T))
d_summary
Plot it
d %>%
ggplot() +
aes() +
geom_histogram(aes(y = ..density.., x = height, fill = gender)) +
facet_wrap(~ gender) +
geom_histogram(data = d2, aes(y = ..density.., x = height),
alpha = .5) +
stat_function(data = d_summary %>% filter(gender == "female"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "female")$height_m,
sd = filter(d_summary,
gender == "female")$height_sd)) +
stat_function(data = d_summary %>% filter(gender == "male"),
fun = dnorm,
#color = "red",
args = list(mean = filter(d_summary,
gender == "male")$height_m,
sd = filter(d_summary,
gender == "male")$height_sd)) +
theme(legend.position = "none",
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
labs(title = "Facetted histograms with overlaid normal curves",
caption = "The grey histograms shows the whole distribution (over) both groups, i.e. females and men") +
scale_fill_brewer(type = "qual", palette = "Set1")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using stat_summary to plot the location of the median - r

Related

How to plot stat_mean for scatterplot in R ggplot2?

How to plot multiple mean lines in a single histogram with multiple groups present?

How to add the x intercept from geom_vline in density plot as a label?

ggplot: labelling geom_smooth / stat_smooth values at correct value

ggplot2: histogram with normal curve

Categories

Resources