Adding labels in ggplot for summary statistics - r

About 18 months ago, this helpful exchange appeared, with code to show how to produce a plot of median along with interquartile ranges. Here's the code:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median)
Producing this plot:
What I'd wonder is how to add labels for the median and IQ ranges, and how to format the bar (color, alpha, etc). I tried calling the plot as an object to see if there were objects within I could then use to call format functions, but nothing was obvious when I looked at it in the r Studio IDE.
Is this even doable? I know I can do a boxplot but that would have to include min/max. I'd like to produce boxplots with just mean/median and IQs.

You can change the formating like you would any ggplot layer, see the docs for Vertical intervals: lines, crossbars & errorbars in this case. An example of this is the following:
library(ggplot2)
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median,
size = 4, # <- adjusts size
colour = "red", # <- adjusts colour
alpha = .3) # <- adjusts transparency
If you want to control formatting for the points and lines individually you need to do as #camille suggests and pre-process your data as geom_pointrange() draws a single graphical object so the points and lines are one in the same.
I would suggest something like this:
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise(median = median(depth),
lq = quantile(depth, 0.25),
uq = quantile(depth, 0.75)) %>%
ggplot(aes(cut, median)) +
geom_linerange(aes(ymin=lq, ymax=uq), size = 4, colour = "blue", alpha = .4) +
geom_point(size = 10, colour = "red", alpha = .8)

Related

Meansplot with Tukey HSD confidence intervals in R

I'm trying to make a meansplot with confidence intervals, but I would like the intervals to be Tukey HSD intervals after an ANOVA is computed.
I'll use the next example here to explain, in the dataframe there is a factor: poison {1,2,3}
library(magrittr)
library(ggplot2)
library(ggpubr)
library(dplyr)
library("agricolae")
PATH <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv"
df <- read.csv(PATH) %>%
select(-X) %>%
mutate(poison = factor(poison, ordered = TRUE))
glimpse(df)
ggplot(df, aes(x = poison, y = time, fill = poison)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
anova_one_way <- aov(time ~ poison, data = df)
summary(anova_one_way)
# Use TukeyHSD
tukeyHSD <- TukeyHSD(anova_one_way)
plot(tukeyHSD)
I would like the plot to be similar to the one from statgraphics, where you can see the mean point and the lenght of the bars is the HSD tuckey intervals, so in one simple glimpse you can apreciate the best level and if it is better and is statistically significantly better.
I have seen some examples in more complex questions but is for boxplots and I dont understand it enough to adapt the solutions here.
Tukey's results on boxplot in R
example1
example1
TukeyHSD results on boxplot after two-way anova
example2
example2
Edit#############
The answer provided by Allan Cameron #allan-cameron is great, however It doesnt work right now in my computer probably due to versions. stats_summary method keywords change a bit. I took his solution and did a couple of changes to make it work for me.
# Allans original response
tukeyCI <- (tukeyHSD$poison[1, 1] - tukeyHSD$poison[1, 2]) / 2
# Changed fun.max and min to ymax and ymin
# Changed fun to fun.y to make Allans solution work for me.
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.ymax = function(x) mean(x) + tukeyCI,
fun.ymin = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun.y = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Error response was:
Warning:Ignoring unknown parameters:fun.max, fun.min
Warning:Ignoring unknown parameters:fun
No summary function supplied, defaulting to `mean_se()
I'm currently using these versions:
version R version 3.5.2 (2018-12-20)
packageVersion("ggplot2") ‘3.1.0’
packageVersion("dplyr") ‘0.7.8’
The image from statgraphics shows error bars around the mean points, and if I understand you correctly then you want to be able to draw error bars around your mean points such that non-overlapping error bars mean there are significant differences between the variables. That being the case, we can extract the required confidence interval like this:
tukeyCI <- (tukeyHSD$poison[1,1] - tukeyHSD$poison[1,2])/2
And we can draw the result in ggplot like this:
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.max = function(x) mean(x) + tukeyCI,
fun.min = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Here we can see that there are significant differences between 1 and 3, and between 2 and 3, but that the difference between 1 and 2 is non-significant.

extend line in median and iqr graph

I am trying to make an IQR plot with no min or max and I just want to display the median and IQR. The reason is I am working with sensitive data and showing outliers might identify participants. I have been able to create a version of the plot below but as you can see the "-" that represents the median does not stretch all the way across the bars. Is it possible to get the "-" to be the length of each bar? Other packages to answer the question are welcome, also acceptable is to have the plot made from aggregate or raw data.
library(ggplot2)
library(dplyr)
median_IQR <- function(x) {
data.frame(y = median(x), # Median
ymin = quantile(x)[2], # 1st quartile
ymax = quantile(x)[4]) # 3rd quartile
}
iris <- mutate(iris, let = rep(c("a","b"), nrow(iris)/2))
p <- ggplot(iris, aes(x=Species, y=Sepal.Length, color = let, fill=let ))
posn.d <- position_dodge(width=0.5)
p +
stat_summary(geom = "crossbar",
fun.data = median_IQR,
position = 'dodge') +
stat_summary(geom = "point",
fun = "median",
position = posn.d,
size = 3,
col = "black",
shape = "_")
Rewriting the first stats_summarysolution, using the geom function instead of the stat function:
rm(list = ls())
library(ggplot2)
library(dplyr)
iris <- mutate(iris, let = rep(c("a","b"), nrow(iris)/2))
p <- ggplot(iris, aes(x=Species, y=Sepal.Length, color = let, fill=let ))+theme_bw()
p + geom_pointrange(mapping = aes(x=Species, y=Sepal.Length),
stat = "summary",
fun.min = function(z) {quantile(z,0.25)},
fun.max = function(z) {quantile(z,0.75)},
fun = median,
size=1)
To group the points side by side
p+boxplot()
Otherwise

ggplot median and percentile

I'm trying to replicate this image.
I was able to plot a scatter plot and the median (but it's not continuous).
I failed to plot the percentiles.
The median varies according to different spell length.
ggplot(df,aes(x=Spell.Length,y=Growth.Rate)) +
geom_point() +
stat_summary(fun = median, fun.min = median, fun.max = median,
geom = "crossbar", width = 0.5,colour="red")
What I'm trying to do
What I got so far
Use dplyr::summarize to create a data frame of the values of percentiles also group_by(Spell.Length), then plot those using geom_line(). Then the horizontal lines with geom_hline().
df %>% group_by(Spell.Length) %>%
summarize(median = quantile(Growth.Rate, p = .5), q1 = quantile(Growth.Rate, p = .25)) %>%
ggplot(aes(x = Spell.Length, y = median) +
geom_line() +
geom_line(aes(x = Spell.Length, y = q1)) +
geom_hline(yintercept = 3)
would be the basic idea.
geom_line() for each specific line style/group
Red lines geom_hline()

How to avoid overlapping of labels/texts of boxplot in R?

I am drawing a boxplot along with violin plot to see the distribution of data using ggplot2. The quartiles of the box plot are very close to each other. That's why it causes overlapping.
I used ggrepel::geom_label_repel but, it did not work. If I remove geom_label_repel, some labels overlap.
Here is my R code and a sample data:
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
ggplot(dataset, aes(x = "", y = Age)) +
geom_violin(position = "dodge", width = 1, fill = "blue") +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.05), size = 3) +
ggrepel::geom_label_repel(aes(label = quantile)) +
ggtitle("") +
xlab("") +
ylab(Age)
In addition to this, does anyone familiar with the combination of boxplot and violin plot? The left side of the plot is box-plot and the right side is the violin plot (I am not asking side by side plots. Just one plot).
Here a slightly different approach, without ggrepel. Half a violin plot is actually a classic density plot, just vertical. That's the basis for the plot. I am adding a horizontal box plot with ggstance::geom_boxploth. For the labels, we cannot use stat_summary any more, because we cannot summarise over x values (maybe someone knows how to do that, I don't). So I used this fantastically obscure code by #eipi10 to pre-calculate the quantiles in one go. You can set the position of the boxplot to 0, and just fill the density plot, in order to avoid some real hack with calculating your segments etc.
You can then pretty neatly fine tune your graphs to your liking.
library(tidyverse)
library(ggstance)
#>
#> Attaching package: 'ggstance'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_errorbarh, GeomErrorbarh
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
my_quant <- dataset %>%
summarise(Age = list(enframe(quantile(Age, probs=c(0.25,0.5,0.75))))) %>%
unnest
my_y <- 0
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age)) +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Now adding a fill.
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age), fill = 'white') +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Created on 2019-07-29 by the reprex package (v0.2.1)
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph.
Example:
#
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels =
boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels =
boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels =
boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels =
boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
#
Above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2.
This solves any kind of statistical parameters overlapping into boxplots
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph, for example:
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels = boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels = boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot.
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels = boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels = boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2
This solves any kind of statistical parameters overlapping into boxplots

Removing lower and upper quartiles in boxplot, with connection between whiskers in R

So im trying to make some different Boxplots,
Completely normal boxplot
I can't figure out how to create the boxplot without the lower and upper quantile, which essentially would be the outliers and the median connected by the whiskers. So something which would look like this
My attempt
But i need a total connection with a vertical line between the whisker?
what i did for the second plot in R was the following
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", xlab="Number of Cylinders",
ylab="Miles Per Gallon",col="white",frame=F,medcol = "black", boxlty =0,
whisklty = 1, staplelwd = 1,boxwex=0.4)
Many Thanks.
Here is a way to get what you are looking for using a scatter plot and error bars:
library(tidyverse)
data_summary <- data %>%
group_by(grouping_var) %>%
summarize(median = median(quant_var),
max = max(quant_var),
min = min(quant_var))
ggplot(data_summary, aes(x = grouping_var,
y = median)) +
geom_point() +
geom_errorbar(aes(ymin = min,
ymax = max))
Then if you need to overlay your old data you can just add a new geom like so:
ggplot(data_summary, aes(x = grouping_var,
y = median)) +
geom_point() +
geom_errorbar(aes(ymin = min,
ymax = max)) +
geom_point(data = data, aes(x = grouping_var,
y = quant_var))

Resources