Display mean and median on two ggplot histograms - r

I'm a new stackoverflow user and can't comment currently on the original post to ask a question. I came across a previous stackoverflow answer (https://stackoverflow.com/a/34045068/11799491) and I was wondering how you would add two vertical lines (mean of the group and median of the group) to this graph here.
My attempt: I don't know how to add in the group variable "type"
geom_vline(aes(xintercept = mean(diff), ), color="black") +
geom_vline(aes(xintercept = median(diff), ), color="red")

There are a few different ways to do this, but I like creating a separate summarized data frame and then passing that into the geom_vline call. This lets you analyze the results and makes it easy to add multiple lines that are automatically sorted and colored by type:
library(tidyverse)
df <-
tibble(
x = rnorm(40),
category = rep(c(0, 1), each = 20)
)
df_stats <-
df %>%
group_by(category) %>%
summarize(
mean = mean(x),
median = median(x)
) %>%
gather(key = key, value = value, mean:median)
df %>%
ggplot(aes(x = x)) +
geom_histogram(bins = 20) +
facet_wrap(~ category) +
geom_vline(data = df_stats, aes(xintercept = value, color = key))

The easiest way is to pre-compute the means and the medians by groups of type. I will do it with aggregate.
agg <- aggregate(diff ~ type, data, function(x) {
c(mean = mean(x), median = median(x))
})
agg <- cbind(agg[1], agg[[2]])
agg <- reshape2::melt(agg, id.vars = "type")
library(ggplot2)
ggplot(data, aes(x = diff)) +
geom_histogram() +
geom_vline(data = agg, mapping = aes(xintercept = value,
color = variable)) +
facet_grid(~type) +
theme_bw()

Related

Is it possible to adjust a second graph to a second y-axis in ggplot?

I am trying to make a several bar plots with their standard errors added to the plot. I tried to add a second y-axis, which was not that hard, however, now I also want my standard errors to fit this new y-axis. I know that I can manipulate the y-axis, but this is not really what I want. I want it such that the standard errors fit to this new y-axis. To illustrate, this is the plot I have now, where I just divided the first y-axis by a 100.
but what I want it something more like this
How it should look like using Excel
to show for all barplots (this was done for the first barplot using Excel). Here is my code
df_bar <- as.data.frame(
rbind(
c('g1', 0.945131015, 1.083188828, 1.040164338,
1.115716593, 0.947886795),
c('g2', 1.393211286, 1.264193745, 1.463434395,
1.298126006, 1.112718796),
c('g3', 1.509976099, 1.450923745, 1.455102201,
1.280102338, 1.462689245),
c('g4', 1.591697668, 1.326292649, 1.767207296,
1.623619341, 2.528108183),
c('g5', 2.625114848, 2.164050167, 2.092843287,
2.301950359, 2.352736806)
)
)
colnames(df_bar)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_bar <- melt(df_bar, id.vars = "interval",
variable.name = "name",
value.name = "value")
df_line <- as.data.frame(
rbind(
c('g1', 0.0212972, 0.0164494, 0.0188898, 0.01888982,
0.03035883),
c('g2', 0.0195600, 0.0163811, 0.0188747, 0.01887467,
0.03548092),
c('g3', 0.0192249, 0.0161914, 0.02215852, 0.02267605,
0.03426538),
c('g4', 0.0187961, 0.0180842, 0.01962371, 0.02103450,
0.03902890),
c('g5', 0.0209987, 0.0164596, 0.01838280, 0.02282300,
0.03516818)
)
)
colnames(df_line)<-c('interval', 'lvl3.Mellem.Høj', 'lvl1.Lav', 'TOM',
',lvl4.Høj', 'lvl2.Lav.Mellem')
df_line <- melt(df_line, id.vars = "interval",
variable.name = "name",
value.name = "sd")
df <- inner_join(df_bar,df_line, by=c("interval", "name"))
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
geom_line(aes(x = interval, y = sd, group = 1),
color = "black", size = .75) +
scale_y_continuous("Value", sec.axis = sec_axis(~ . /100, name = "sd")) +
facet_grid(~name, scales = "free") +
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
Thanks in advance..
As described in this example, you have to also perform a transformation to your values from sd to match the scale of your second axis. In your example you divided by 100, therefore you have to multiply your sd by 100 as shown in the below:
library(tidyverse)
library(data.table)
df %>%
mutate(value = as.numeric(value)) %>%
mutate(sd = as.numeric(sd)) %>%
mutate(interval = as.factor(interval)) %>%
mutate(name = as.factor(name)) %>%
ggplot() +
geom_bar(aes(x = interval, y = value, fill = interval), stat = "identity") +
scale_y_continuous("Value", sec.axis = sec_axis(~ ./100, name = "sd"))+
geom_line(aes(x = interval, y = sd*100, group = 1),
color = "black", size = .75)+
facet_grid(~name, scales = "free")+
theme_bw() + theme(legend.position = "none") +
xlab("Interval") + ylab("Value") +
labs(caption = "Black line indicates standard deviation.")
You can also use a different value to scale your second axis. In this example I used 50 as a scaling factor, which in my opinion looks a bit better:
Created on 2022-08-25 with reprex v2.0.2
Here is what it should look like for the first barplot using Excel.

ggplot: How to show density instead of count in grouped bar plot with facet_wrap?

The dataframe consists of two factor variables: cls with 3 leveles and subset with 2 levels. I want to compare how much of each class (cls) is there in both groups of subset. I want to show percentages on y-axis. They should be computed within certain subset group, not whole dataset.
library(tidyverse)
data = data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
This was my attempt to show percentages, but it failed because they are computed within whole dataset instead of subset group:
ggplot(data, aes(x = cls, fill = cls)) + geom_bar(aes(y = ..count.. / sum(..count..))) + facet_wrap(~subset)
How can I fix it?
Edit related to the accepted answer:
plot_train_vs_test = function(data, var, subset_colname){
plot_data = data %>%
count(var, eval(subset_colname)) %>%
group_by(eval(subset_colname)) %>%
mutate(perc = n/sum(n))
ggplot(plot_data, aes(x = var, y = perc, fill = var)) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(~eval(subset_colname))
}
plot_train_vs_test(data, "cls", "subset")
Results in errors.
One option and easy fix would be to compute the percentages outside of ggplot and plot the summarized data:
library(ggplot2)
library(dplyr, warn = FALSE)
set.seed(123)
data <- data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
data_sum <- data %>%
count(cls, subset) %>%
group_by(subset) %>%
mutate(pct = n / sum(n))
ggplot(data_sum, aes(x = cls, y = pct, fill = cls)) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(~subset)
EDIT One approach to put the code in a function may look like so:
plot_train_vs_test <- function(.data, x, facet) {
.data_sum <- .data %>%
count({{ x }}, {{ facet }}) %>%
group_by({{ facet }}) %>%
mutate(pct = n / sum(n))
ggplot(.data_sum, aes(x = {{ x }}, y = pct, fill = {{ x }})) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(vars({{ facet }}))
}
plot_train_vs_test(data, cls, subset)
For more on the details and especially the {{ operator see Programming with dplyr, Programming with ggplot2 and Best practices for programming with ggplot2

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

How to mark minimum point from ggplot line plot [duplicate]

I am using the built-in economics (from the ggplot2 package) dataset in R, and have plotted a time-series for each variable in the same graph using the following code :
library(reshape2)
library(ggplot2)
me <- melt(economics, id = c("date"))
ggplot(data = me) +
geom_line(aes(x = date, y = value)) +
facet_wrap(~variable, ncol = 1, scales = 'free_y')
Now, I further want to refine my graph, For each series, I want to display a red point for the smallest and the largest value.
So I thought if I could find the co-ordinates of the min and max of each time-series, I could find a way to plot a red dot at beginning and ending of each time series. For this I used the following code :
which(pce == min(economics$pce), arr.ind = TRUE)
which(pca == max(pca), arr.ind = TRUE)
This doesnt really lead me anywhere.
Thank you:)
Method 1: Using Joins
This can be nice when you want to save the filtered subsets
library(reshape2)
library(ggplot2)
library(dplyr)
me <- melt(economics, id=c("date"))
me %>%
group_by(variable) %>%
summarise(min = min(value),
max = max(value)) -> me.2
left_join(me, me.2) %>%
mutate(color = value == min | value == max) %>%
filter(color == TRUE) -> me.3
ggplot(data=me, aes(x = date, y = value)) +
geom_line() +
geom_point(data=me.3, aes(x = date, y = value), color = "red") +
facet_wrap(~variable, ncol=1, scales='free_y')
Method 2: Simplified without Joins
Thanks #Gregor
me.2 <- me %>%
group_by(variable) %>%
mutate(color = (min(value) == value | max(value) == value))
ggplot(data=me.2, aes(x = date, y = value)) +
geom_line() +
geom_point(aes(color = color)) +
facet_wrap(~variable, ncol=1, scales="free_y") +
scale_color_manual(values = c(NA, "red"))

Resources