Boxplot and five statistics - plotting quantiles - R - r

I want to visualize my data using Boxplot.
I have created a boxplot and a stripchar using the follwonig comands:
adjbox(nkv.murgang$NK, main = "NKV, Murgang - Skewness Adjusted", horizontal = T, axes = F,
staplewex = 1, xlab = "Nutzen Kosten Verhältnis")
stripchart(nkv.murgang$NK, main = "NKV, Murgang - Stripchart", horizontal = T, pch = 1,
method = "jitter", xlab = "Nutzen Kosten Verhältnis")
However I cannot figure out how to incorporate the corresponding five number statistics into the graph (min, 1st Qu., Mean, 3rd Qu., Max). I want them to be display next to the whiskers.
What is my y-axis in this case?
Additionaly, I also want to highlight the mean and median with different colours. Something like this:
is it possible to combine this two into one graph?
Thanks for any input. I know this seems very basic, but I am stuck here...

You can combine a boxplot with a point-plot using ggplot2 as follows
require(ggplot2)
ggplot(mtcars, aes(x = as.factor(gear), y = wt)) +
geom_boxplot() +
geom_jitter(aes(col = (cyl == 4)), width = 0.1)
The result would be:

Instead of using adjbox, use ggplot:
There is a trick for the unknown x-axis: x = factor(0).
ggplot(nkv.murgang, aes(x = factor(0), nkv.murgang$NK)) +
geom_boxplot(notch = F, outlier.color = "darkgrey", outlier.shape = 1,
color = "black", fill = "darkorange", varwidth = T) +
ggtitle("NKV Murgang - Einfamilienhaus") +
labs(x = "Murgang", y = "Nutzen / Konsten \n Verhälhniss") +
stat_summary(geom = "text", fun.y = quantile,
aes(label=sprintf("%1.1f", ..y..)),
position=position_nudge(x=0.4), size=3.5)
This question explains.

Related

How to use stat_compare_means in a ggplot facet?

I have this boxplot faceted by TCGA Cancer Type, I want to compare the means of each pair of boxplots in all of the graphs, however when I use stat_compare_means you can see that it only applies the mean comparison to the top left corner, under ACC. What would I have to do to so that the p-value shows up on all of the graphs?
l <- ggplot(TP53,
aes(x = TP53_Mutation_Status,
y = Leukocyte_Fraction, fill = TCGA_Cancer_Type)) +
geom_boxplot(outlier.color = "red")+
scale_x_reordered()+
geom_jitter(alpha = 0.125, size = 1)+
ggtitle("Leukocyte Fractions Comparison between TP53+ and WT")+
xlab("TP53 Mutation Status")+
ylab("Leukocyte Fractions")+
stat_compare_means(comparison = list(c("FALSE", "TRUE")), label.y = .75, group = TP53[3])+
facet_wrap(~ TCGA_Cancer_Type,nrow = 4)+
theme_classic()
l

Half of the boxplot made in GGPLOT2

I'm doing a boxplot using the ggplot2 package, however, for some external reason, only half of the boxplot is being made for the "Control" and "Commercial IMD" treatments.
See below that when making the graph using the "boxplot" function, the graph is normally done.
mediasCon = tapply(dados$CS, dados$Trat, mean)
boxplot(dados$CS ~ dados$Trat, data = dados, col="gray",
xlab = 'Tratamentos', ylab = 'Espermatozoides - Cabeça Solta')
points(1:3, mediasCon, col = 'Red', pch = 16)
However, when making the same graph using the GGPLOT2 function, see that for the first two treatments only half of the graph is being done, why is this occurring?
Plus, how do I add boxplot "tails" using a ggplot2 function?
library(ggplot2)
ggplot(data=dados, aes(x=Trat, y=CS)) + geom_boxplot(fill=c("#DEEBF7","#2171B5","#034E7B"),color="black") +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta') +
stat_summary(fun=mean, colour="black", geom="point",
shape=18, size=5) +
theme(axis.title = element_text(size = 20),
axis.text = element_text(size = 16))
If you look at the help file under ?geom_boxplot you will see:
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for more information on how hinge positions are calculated for boxplot().
In your case, the 4 entries for IMD Commercial are c(0, 1, 1, 1), which is certainly a small sample.
One way around this is to calculate where you want the hinges to be and pass that data to ggplot, using stat = "identity". This makes the code a bit more complex, but this is often the case when you are trying to modify default behaviour:
library(ggplot2)
library(dplyr)
dados %>%
group_by(Trat) %>%
summarize(median = median(CS), mean = mean(CS),
upper = quantile(CS, 0.75, type = 2),
lower = quantile(CS, 0.25, type = 2),
max = max(CS), min = min(CS)) %>%
ggplot(aes(x = Trat, y = mean, fill = Trat)) +
geom_boxplot(aes(ymin = min, lower = lower,
middle = median, upper = upper, ymax = max),
stat = "identity", color = "black") +
geom_point(size = 3, shape = 21, fill = "red") +
scale_fill_manual(values = c("#DEEBF7","#2171B5","#034E7B")) +
theme_classic() +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta')

How to avoid overlapping of labels/texts of boxplot in R?

I am drawing a boxplot along with violin plot to see the distribution of data using ggplot2. The quartiles of the box plot are very close to each other. That's why it causes overlapping.
I used ggrepel::geom_label_repel but, it did not work. If I remove geom_label_repel, some labels overlap.
Here is my R code and a sample data:
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
ggplot(dataset, aes(x = "", y = Age)) +
geom_violin(position = "dodge", width = 1, fill = "blue") +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.05), size = 3) +
ggrepel::geom_label_repel(aes(label = quantile)) +
ggtitle("") +
xlab("") +
ylab(Age)
In addition to this, does anyone familiar with the combination of boxplot and violin plot? The left side of the plot is box-plot and the right side is the violin plot (I am not asking side by side plots. Just one plot).
Here a slightly different approach, without ggrepel. Half a violin plot is actually a classic density plot, just vertical. That's the basis for the plot. I am adding a horizontal box plot with ggstance::geom_boxploth. For the labels, we cannot use stat_summary any more, because we cannot summarise over x values (maybe someone knows how to do that, I don't). So I used this fantastically obscure code by #eipi10 to pre-calculate the quantiles in one go. You can set the position of the boxplot to 0, and just fill the density plot, in order to avoid some real hack with calculating your segments etc.
You can then pretty neatly fine tune your graphs to your liking.
library(tidyverse)
library(ggstance)
#>
#> Attaching package: 'ggstance'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_errorbarh, GeomErrorbarh
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
my_quant <- dataset %>%
summarise(Age = list(enframe(quantile(Age, probs=c(0.25,0.5,0.75))))) %>%
unnest
my_y <- 0
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age)) +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Now adding a fill.
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age), fill = 'white') +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Created on 2019-07-29 by the reprex package (v0.2.1)
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph.
Example:
#
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels =
boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels =
boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels =
boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels =
boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
#
Above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2.
This solves any kind of statistical parameters overlapping into boxplots
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph, for example:
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels = boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels = boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot.
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels = boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels = boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2
This solves any kind of statistical parameters overlapping into boxplots

Removing lower and upper quartiles in boxplot, with connection between whiskers in R

So im trying to make some different Boxplots,
Completely normal boxplot
I can't figure out how to create the boxplot without the lower and upper quantile, which essentially would be the outliers and the median connected by the whiskers. So something which would look like this
My attempt
But i need a total connection with a vertical line between the whisker?
what i did for the second plot in R was the following
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", xlab="Number of Cylinders",
ylab="Miles Per Gallon",col="white",frame=F,medcol = "black", boxlty =0,
whisklty = 1, staplelwd = 1,boxwex=0.4)
Many Thanks.
Here is a way to get what you are looking for using a scatter plot and error bars:
library(tidyverse)
data_summary <- data %>%
group_by(grouping_var) %>%
summarize(median = median(quant_var),
max = max(quant_var),
min = min(quant_var))
ggplot(data_summary, aes(x = grouping_var,
y = median)) +
geom_point() +
geom_errorbar(aes(ymin = min,
ymax = max))
Then if you need to overlay your old data you can just add a new geom like so:
ggplot(data_summary, aes(x = grouping_var,
y = median)) +
geom_point() +
geom_errorbar(aes(ymin = min,
ymax = max)) +
geom_point(data = data, aes(x = grouping_var,
y = quant_var))

Plot Gaussian Mixture in R using ggplot2

I'm approximating a distribution with gaussian mixtures and was wondering whether there was an easy way to automatically plot the estimated kernel density of the whole (uni-dimensional) dataset as the sum of the component densities in a nice fashion like this using ggplot2:
Given the following example data, my approach in ggplot2 would be to manually plot the subset densities into the scaled overall density like this:
#example data
a<-rnorm(1000,0,1) #component 1
b<-rnorm(1000,5,2) #component 2
d<-c(a,b) #overall data
df<-data.frame(d,id=rep(c(1,2),each=1000)) #add group id
##ggplot2
require(ggplot2)
ggplot(df) +
geom_density(aes(x=d,y=..scaled..)) +
geom_density(data=subset(df,id==1), aes(x=d), lty=2) +
geom_density(data=subset(df,id==2), aes(x=d), lty=4)
Note that this does not work out regarding the scales. It also does not work when you scale all 3 densities or no density at all. So I was not able to replicate above plot.
In addition, I am not able to automatically generate this plot without having to subset manually. I tried using position = "stacked" as parameter in geom_density.
I usually have around 5-6 Components per dataset, so manually subsetting would be possible. However, I would like to have different colors or line-types per component density which are displayed in the legend of ggplot, so doing all subsets manually would increase the workload quite a bit.
Any ideas?
Thanks!
Here is a possible solution by specifying each density in the aes call with position = "identity" in one layer and in the second layer using stacked density without the legend.
ggplot(df) +
stat_density(aes(x = d, linetype = as.factor(id)), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = as.factor(id)), position = "identity", geom = "line")
Do note that when using more then two groups:
a <- rnorm(1000, 0, 1)
b <- rnorm(1000, 5, 2)
c <- rnorm(1000, 3, 2)
d <- rnorm(1000, -2, 1)
d <- c(a, b, c, d)
df <- data.frame(d, id = as.factor(rep(c(1, 2, 3, 4), each = 1000)))
curves for each stack appear (this is a problem with the two group example but linetype in first layer disguised it - use group instead to check) :
gplot(df) +
stat_density(aes(x = d, group = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = id), position = "identity", geom = "line")
A relatively easy fix to this is to add alpha mapping and manually set it to 0 for unwanted curves:
ggplot(df) +
stat_density(aes(x=d, alpha = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x=d, linetype = id), position = "identity", geom = "line")+
scale_alpha_manual(values = c(1,0,0,0))

Resources