Meansplot with Tukey HSD confidence intervals in R - r

I'm trying to make a meansplot with confidence intervals, but I would like the intervals to be Tukey HSD intervals after an ANOVA is computed.
I'll use the next example here to explain, in the dataframe there is a factor: poison {1,2,3}
library(magrittr)
library(ggplot2)
library(ggpubr)
library(dplyr)
library("agricolae")
PATH <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv"
df <- read.csv(PATH) %>%
select(-X) %>%
mutate(poison = factor(poison, ordered = TRUE))
glimpse(df)
ggplot(df, aes(x = poison, y = time, fill = poison)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
anova_one_way <- aov(time ~ poison, data = df)
summary(anova_one_way)
# Use TukeyHSD
tukeyHSD <- TukeyHSD(anova_one_way)
plot(tukeyHSD)
I would like the plot to be similar to the one from statgraphics, where you can see the mean point and the lenght of the bars is the HSD tuckey intervals, so in one simple glimpse you can apreciate the best level and if it is better and is statistically significantly better.
I have seen some examples in more complex questions but is for boxplots and I dont understand it enough to adapt the solutions here.
Tukey's results on boxplot in R
example1
example1
TukeyHSD results on boxplot after two-way anova
example2
example2
Edit#############
The answer provided by Allan Cameron #allan-cameron is great, however It doesnt work right now in my computer probably due to versions. stats_summary method keywords change a bit. I took his solution and did a couple of changes to make it work for me.
# Allans original response
tukeyCI <- (tukeyHSD$poison[1, 1] - tukeyHSD$poison[1, 2]) / 2
# Changed fun.max and min to ymax and ymin
# Changed fun to fun.y to make Allans solution work for me.
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.ymax = function(x) mean(x) + tukeyCI,
fun.ymin = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun.y = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Error response was:
Warning:Ignoring unknown parameters:fun.max, fun.min
Warning:Ignoring unknown parameters:fun
No summary function supplied, defaulting to `mean_se()
I'm currently using these versions:
version R version 3.5.2 (2018-12-20)
packageVersion("ggplot2") ‘3.1.0’
packageVersion("dplyr") ‘0.7.8’

The image from statgraphics shows error bars around the mean points, and if I understand you correctly then you want to be able to draw error bars around your mean points such that non-overlapping error bars mean there are significant differences between the variables. That being the case, we can extract the required confidence interval like this:
tukeyCI <- (tukeyHSD$poison[1,1] - tukeyHSD$poison[1,2])/2
And we can draw the result in ggplot like this:
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.max = function(x) mean(x) + tukeyCI,
fun.min = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Here we can see that there are significant differences between 1 and 3, and between 2 and 3, but that the difference between 1 and 2 is non-significant.

Related

Half of the boxplot made in GGPLOT2

I'm doing a boxplot using the ggplot2 package, however, for some external reason, only half of the boxplot is being made for the "Control" and "Commercial IMD" treatments.
See below that when making the graph using the "boxplot" function, the graph is normally done.
mediasCon = tapply(dados$CS, dados$Trat, mean)
boxplot(dados$CS ~ dados$Trat, data = dados, col="gray",
xlab = 'Tratamentos', ylab = 'Espermatozoides - Cabeça Solta')
points(1:3, mediasCon, col = 'Red', pch = 16)
However, when making the same graph using the GGPLOT2 function, see that for the first two treatments only half of the graph is being done, why is this occurring?
Plus, how do I add boxplot "tails" using a ggplot2 function?
library(ggplot2)
ggplot(data=dados, aes(x=Trat, y=CS)) + geom_boxplot(fill=c("#DEEBF7","#2171B5","#034E7B"),color="black") +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta') +
stat_summary(fun=mean, colour="black", geom="point",
shape=18, size=5) +
theme(axis.title = element_text(size = 20),
axis.text = element_text(size = 16))
If you look at the help file under ?geom_boxplot you will see:
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for more information on how hinge positions are calculated for boxplot().
In your case, the 4 entries for IMD Commercial are c(0, 1, 1, 1), which is certainly a small sample.
One way around this is to calculate where you want the hinges to be and pass that data to ggplot, using stat = "identity". This makes the code a bit more complex, but this is often the case when you are trying to modify default behaviour:
library(ggplot2)
library(dplyr)
dados %>%
group_by(Trat) %>%
summarize(median = median(CS), mean = mean(CS),
upper = quantile(CS, 0.75, type = 2),
lower = quantile(CS, 0.25, type = 2),
max = max(CS), min = min(CS)) %>%
ggplot(aes(x = Trat, y = mean, fill = Trat)) +
geom_boxplot(aes(ymin = min, lower = lower,
middle = median, upper = upper, ymax = max),
stat = "identity", color = "black") +
geom_point(size = 3, shape = 21, fill = "red") +
scale_fill_manual(values = c("#DEEBF7","#2171B5","#034E7B")) +
theme_classic() +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta')

How to include number of observations in each quartile of boxplot using ggplot2 in R?

I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?
I included some code below which can generate box-plot with the values of quartiles.
df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.
this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.
Counting made with some inspiration from https://stackoverflow.com/a/54451575
quantile_counts=function(x){
df= data.frame(label=table(cut(x, quantile(x))),
label_pos=diff(quantile(x))/2+quantile(x)[1:4])
return(df)
}
df_quantile_counts=quantile_counts(df$Sepal.Length)
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
position = position_nudge(x = +0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
HTH, Tobi
#TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.
quantile_counts2 <- function(x){
label_pos_extension <- c(0,3,4,0)
if(length(unique(quantile(x))) < 5){
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
} else {
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
} return(df)
}
PS. I tried to put my edited function in comment but, it did not work.

How to avoid overlapping of labels/texts of boxplot in R?

I am drawing a boxplot along with violin plot to see the distribution of data using ggplot2. The quartiles of the box plot are very close to each other. That's why it causes overlapping.
I used ggrepel::geom_label_repel but, it did not work. If I remove geom_label_repel, some labels overlap.
Here is my R code and a sample data:
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
ggplot(dataset, aes(x = "", y = Age)) +
geom_violin(position = "dodge", width = 1, fill = "blue") +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.05), size = 3) +
ggrepel::geom_label_repel(aes(label = quantile)) +
ggtitle("") +
xlab("") +
ylab(Age)
In addition to this, does anyone familiar with the combination of boxplot and violin plot? The left side of the plot is box-plot and the right side is the violin plot (I am not asking side by side plots. Just one plot).
Here a slightly different approach, without ggrepel. Half a violin plot is actually a classic density plot, just vertical. That's the basis for the plot. I am adding a horizontal box plot with ggstance::geom_boxploth. For the labels, we cannot use stat_summary any more, because we cannot summarise over x values (maybe someone knows how to do that, I don't). So I used this fantastically obscure code by #eipi10 to pre-calculate the quantiles in one go. You can set the position of the boxplot to 0, and just fill the density plot, in order to avoid some real hack with calculating your segments etc.
You can then pretty neatly fine tune your graphs to your liking.
library(tidyverse)
library(ggstance)
#>
#> Attaching package: 'ggstance'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_errorbarh, GeomErrorbarh
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
my_quant <- dataset %>%
summarise(Age = list(enframe(quantile(Age, probs=c(0.25,0.5,0.75))))) %>%
unnest
my_y <- 0
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age)) +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Now adding a fill.
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age), fill = 'white') +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Created on 2019-07-29 by the reprex package (v0.2.1)
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph.
Example:
#
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels =
boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels =
boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels =
boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels =
boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
#
Above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2.
This solves any kind of statistical parameters overlapping into boxplots
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph, for example:
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels = boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels = boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot.
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels = boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels = boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2
This solves any kind of statistical parameters overlapping into boxplots

Adding labels in ggplot for summary statistics

About 18 months ago, this helpful exchange appeared, with code to show how to produce a plot of median along with interquartile ranges. Here's the code:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median)
Producing this plot:
What I'd wonder is how to add labels for the median and IQ ranges, and how to format the bar (color, alpha, etc). I tried calling the plot as an object to see if there were objects within I could then use to call format functions, but nothing was obvious when I looked at it in the r Studio IDE.
Is this even doable? I know I can do a boxplot but that would have to include min/max. I'd like to produce boxplots with just mean/median and IQs.
You can change the formating like you would any ggplot layer, see the docs for Vertical intervals: lines, crossbars & errorbars in this case. An example of this is the following:
library(ggplot2)
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median,
size = 4, # <- adjusts size
colour = "red", # <- adjusts colour
alpha = .3) # <- adjusts transparency
If you want to control formatting for the points and lines individually you need to do as #camille suggests and pre-process your data as geom_pointrange() draws a single graphical object so the points and lines are one in the same.
I would suggest something like this:
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise(median = median(depth),
lq = quantile(depth, 0.25),
uq = quantile(depth, 0.75)) %>%
ggplot(aes(cut, median)) +
geom_linerange(aes(ymin=lq, ymax=uq), size = 4, colour = "blue", alpha = .4) +
geom_point(size = 10, colour = "red", alpha = .8)

ggplot2 create barplot with CIs with descriptive data (i.e., without raw data)

in the base version of R it is easy (but cumbersome) to create a plot with error bars based on the descriptive data. With ggplot2 I am struggling to do so and all the examples I have found are based on the raw data.
Specifically, how can I create a barplot with confidence intervals for a simple two-group design? M1 = 3, M2 = 4, SD1 = 1, SD2 = 1.2, n1 = 111, n2 = 222? I started off simply with
ggplot(aes(x=c(1:2), y=c(3, 4))) + geom_bar()
# or
ggplot(aes(y=c(3, 4))) + geom_bar()
but not even this seem to work to create a barplot.
Any suggestions?
What about using ggplot2::stat_summary()? You can let it take care of your mean and se calculations (it relies on library(Hmisc) for most of these summary functions, so look there for more help).
library(ggplot2)
ggplot(mtcars, aes(cyl, mpg)) +
stat_summary(geom = "bar", fun.y = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se)
Adjust width = for skinnier bars or error bars.
You can also use a true confidence interval with mean_cl_normal or mean_cl_boot and for a better visualization of the data dispersion:
ggplot(mtcars, aes(cyl, mpg)) +
stat_summary(geom = "crossbar", fun.data = mean_cl_normal)
Edit:
If your want to recreate a published paper just roll your data into a data.frame first:
datf <- data.frame(
group = c("1", "2"),
means = c(3,4),
sds = c(1,1.2),
ns = c(111, 222)
)
# add your CI calcs as column called upr and lwr
library(tidyverse)
datf <- datf %>% mutate(lwr = means - (qnorm(.975)*(sds/sqrt(ns))),
upr = means + (qnorm(.975)*(sds/sqrt(ns))))
ggplot(datf, aes(group, y = means, ymin = lwr, ymax = upr)) +
geom_crossbar()
Or the traditional standard of columns with error bars if you must like this:
ggplot(datf, aes(group, y = means, ymin = lwr, ymax = upr)) +
geom_col() +
geom_errorbar()
You can draw an error bar to whatever values you want. They have an aesthetic called ymin and ymax that you can set. Here I draw the bars +/- 1 standard devaiation from the mean
dd<-read.table(text="sample mean sd n
1 3 1 111
2 4 1.2 222", header=T)
ggplot(dd, aes(sample)) +
geom_col(aes(y=mean)) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd))

Resources