I am drawing a boxplot along with violin plot to see the distribution of data using ggplot2. The quartiles of the box plot are very close to each other. That's why it causes overlapping.
I used ggrepel::geom_label_repel but, it did not work. If I remove geom_label_repel, some labels overlap.
Here is my R code and a sample data:
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
ggplot(dataset, aes(x = "", y = Age)) +
geom_violin(position = "dodge", width = 1, fill = "blue") +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.05), size = 3) +
ggrepel::geom_label_repel(aes(label = quantile)) +
ggtitle("") +
xlab("") +
ylab(Age)
In addition to this, does anyone familiar with the combination of boxplot and violin plot? The left side of the plot is box-plot and the right side is the violin plot (I am not asking side by side plots. Just one plot).
Here a slightly different approach, without ggrepel. Half a violin plot is actually a classic density plot, just vertical. That's the basis for the plot. I am adding a horizontal box plot with ggstance::geom_boxploth. For the labels, we cannot use stat_summary any more, because we cannot summarise over x values (maybe someone knows how to do that, I don't). So I used this fantastically obscure code by #eipi10 to pre-calculate the quantiles in one go. You can set the position of the boxplot to 0, and just fill the density plot, in order to avoid some real hack with calculating your segments etc.
You can then pretty neatly fine tune your graphs to your liking.
library(tidyverse)
library(ggstance)
#>
#> Attaching package: 'ggstance'
#> The following objects are masked from 'package:ggplot2':
#>
#> geom_errorbarh, GeomErrorbarh
dataset <- data.frame(Age = sample(1:20, 100, replace = T))
my_quant <- dataset %>%
summarise(Age = list(enframe(quantile(Age, probs=c(0.25,0.5,0.75))))) %>%
unnest
my_y <- 0
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age)) +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Now adding a fill.
ggplot(dataset) +
ggstance::geom_boxploth(aes(x = Age, y = my_y), width = .05) +
geom_density(aes(x = Age), fill = 'white') +
annotate(geom = "label", x = my_quant$value, my_y, label = my_quant$value) +
coord_flip()
Created on 2019-07-29 by the reprex package (v0.2.1)
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph.
Example:
#
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels =
boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels =
boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels =
boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels =
boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
#
Above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2.
This solves any kind of statistical parameters overlapping into boxplots
When using the standard R boxplot command, use the command text to include the 5 statistical parameters into the graph, for example:
boxplot(arq1$J00_J99,arq1$V01_Y89,horizontal = TRUE)
text(x = boxplot.stats(arq1$J00_J99)$stats, labels = boxplot.stats(arq1$J00_J99)$stats, y = 0.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats, labels = boxplot.stats(arq1$V01_Y89)$stats, y = 2.5)
This shows one overlap of the labels into the upper boxplot.
To avoid this, execute text twice, selecting distinct statistical parameters into distinct y heights:
text(x = boxplot.stats(arq1$V01_Y89)$stats[2:5], labels = boxplot.stats(arq1$V01_Y89)$stats[2:5], y = 2.5)
text(x = boxplot.stats(arq1$V01_Y89)$stats[1], labels = boxplot.stats(arq1$V01_Y89)$stats[1], y = 2.)
above I have asked to include the parameters from 2 to 5: 1st quartile, median, 3rd quartile and maximum value at y=2.5 and the minimum value at y=2
This solves any kind of statistical parameters overlapping into boxplots
Related
I'm plotting correlations in ggpairs and am splitting the data based on a filter.
The density plots are normalising themselves on the number of data points in each filtered group. I would like them to normalise on the total number of data points in the entire data set. Essentially, I would like to be able to have the sum of the individual density plots be equal to the density plot of the entire dataset.
I know this probably breaks the definition of "density plot", but this is a presentation style I'd like to explore.
In plain ggplot, I can do this by adding y=..count.. to the aesthetic, but ggpairs doesn't accept x or y aesthetics.
Some sample code and plots:
set.seed(1234)
group = as.numeric(cut(runif(100),c(0,1/2,1),c(1,2)))
x = rnorm(100,group,1)
x[group == 1] = (x[group == 1])^2
y = (2 * x) + rnorm(100,0,0.1)
data = data.frame(group = as.factor(group), x = x, y = y)
#plot of everything
data %>%
ggplot(aes(x)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I want
data %>%
ggplot(aes(x,y=..count.., fill=group)) +
geom_density(color = "black", alpha = 0.7)
#the scaling I get
data %>%
ggplot(aes(x, fill=group)) +
geom_density(color = "black", alpha = 0.7)
data %>% ggpairs(., columns = 2:3,
mapping = ggplot2::aes(colour=group),
lower = list(continuous = wrap("smooth", alpha = 0.5, size=1.0)),
diag = list(continuous = wrap("densityDiag", alpha=0.5 ))
)
Are there any suggestions that don't involve reformatting the entire dataset?
I am not sure I understand the question but if the densities of both groups plus the density of the entire data is to be plotted, it can easily be done by
Getting rid of the grouping aesthetics, in this case, fill.
Placing another call to geom_density but this time with inherit.aes = FALSE so that the previous aesthetics are not inherited.
And then plot the densities.
library(tidyverse)
data %>%
ggplot(aes(x, y=..count.., fill = group)) +
geom_density(color = "black", alpha = 0.7) +
geom_density(mapping = aes(x, y = ..count..),
inherit.aes = FALSE)
I really struggle to set the correct legend for a geom_point plot with loess regression, while there is 2 data set used
I got a data set, who is summarizing activity over a day, and then I plot on the same graph, all the activity per hours and per days recorded, plus a regression curve smoothed with a loess function, plus the mean of each hours for all the days.
To be more precise, here is an example of the first code, and the graph returned, without legend, which is exactly what I expected:
# first graph, which is given what I expected but with no legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = 20, size = 3) +
geom_smooth(method = "loess", span = 0.2, color = "red", fill = "blue")
and the graph (in grey there is all the data, per hours, per days. the red curve is the loess regression. The blue dots are the means for each hours):
When I tried to set the legend I failed to plot one with the explanation for both kind of dots (data in grey, mean in blue), and the loess curve (in red). See below some example of what I tried.
# second graph, which is given what I expected + the legend for the loess that
# I wanted but with not the dot legend
p <- ggplot(dat1, aes(x = Hour, y = value)) +
geom_point(color = "darkgray", size = 1) +
geom_point(data = dat2, mapping = aes(x = Hour, y = mean),
color = "blue", size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_identity(name = "legend model", guide = "legend",
labels = "loess regression \n with confidence interval")
I obtained the good legend for the curve only
and another trial :
# I tried to combine both date set into a single one as following but it did not
# work at all and I really do not understand how the legends works in ggplot2
# compared to the normal plots
A <- rbind(dat1, dat2)
p <- ggplot(A, aes(x = Heure, y = value, color = variable)) +
geom_point(data = subset(A, variable == "data"), size = 1) +
geom_point(data = subset(A, variable == "Moy"), size = 3) +
geom_smooth(method = "loess", span = 0.2, aes(color = "red"), fill = "blue") +
scale_color_manual(name = "légende",
labels = c("Data", "Moy", "loess regression \n with confidence interval"),
values = c("darkgray", "royalblue", "red"))
It appears that all the legend settings are mixed together in a "weird" way, the is a grey dot covering by a grey line, and then the same in blue and in red (for the 3 labels). all got a background filled in blue:
If you need to label the mean, might need to be a bit creative, because it's not so easy to add legend manually in ggplot.
I simulate something that looks like your data below.
dat1 = data.frame(
Hour = rep(1:24,each=10),
value = c(rnorm(60,0,1),rnorm(60,2,1),rnorm(60,1,1),rnorm(60,-1,1))
)
# classify this as raw data
dat1$Data = "Raw"
# calculate mean like you did
dat2 <- dat1 %>% group_by(Hour) %>% summarise(value=mean(value))
# classify this as mean
dat2$Data = "Mean"
# combine the data frames
plotdat <- rbind(dat1,dat2)
# add a dummy variable, we'll use it later
plotdat$line = "Loess-Smooth"
We make the basic dot plot first:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)
Note with the size, we set guide to FALSE so it will not appear. Now we add the loess smooth, one way to introduce the legend is to introduce a linetype, and since there's only one group, you will have just one variable:
ggplot(plotdat, aes(x = Hour, y = value,col=Data,size=Data)) +
geom_point() +
scale_color_manual(values=c("blue","darkgray"))+
scale_size_manual(values=c(3,1),guide=FALSE)+
geom_smooth(data=subset(plotdat,Data="Raw"),
aes(linetype=line),size=1,alpha=0.3,
method = "loess", span = 0.2, color = "red", fill = "blue")
I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?
I included some code below which can generate box-plot with the values of quartiles.
df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.
this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.
Counting made with some inspiration from https://stackoverflow.com/a/54451575
quantile_counts=function(x){
df= data.frame(label=table(cut(x, quantile(x))),
label_pos=diff(quantile(x))/2+quantile(x)[1:4])
return(df)
}
df_quantile_counts=quantile_counts(df$Sepal.Length)
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
position = position_nudge(x = +0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
HTH, Tobi
#TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.
quantile_counts2 <- function(x){
label_pos_extension <- c(0,3,4,0)
if(length(unique(quantile(x))) < 5){
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
} else {
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
} return(df)
}
PS. I tried to put my edited function in comment but, it did not work.
I'm approximating a distribution with gaussian mixtures and was wondering whether there was an easy way to automatically plot the estimated kernel density of the whole (uni-dimensional) dataset as the sum of the component densities in a nice fashion like this using ggplot2:
Given the following example data, my approach in ggplot2 would be to manually plot the subset densities into the scaled overall density like this:
#example data
a<-rnorm(1000,0,1) #component 1
b<-rnorm(1000,5,2) #component 2
d<-c(a,b) #overall data
df<-data.frame(d,id=rep(c(1,2),each=1000)) #add group id
##ggplot2
require(ggplot2)
ggplot(df) +
geom_density(aes(x=d,y=..scaled..)) +
geom_density(data=subset(df,id==1), aes(x=d), lty=2) +
geom_density(data=subset(df,id==2), aes(x=d), lty=4)
Note that this does not work out regarding the scales. It also does not work when you scale all 3 densities or no density at all. So I was not able to replicate above plot.
In addition, I am not able to automatically generate this plot without having to subset manually. I tried using position = "stacked" as parameter in geom_density.
I usually have around 5-6 Components per dataset, so manually subsetting would be possible. However, I would like to have different colors or line-types per component density which are displayed in the legend of ggplot, so doing all subsets manually would increase the workload quite a bit.
Any ideas?
Thanks!
Here is a possible solution by specifying each density in the aes call with position = "identity" in one layer and in the second layer using stacked density without the legend.
ggplot(df) +
stat_density(aes(x = d, linetype = as.factor(id)), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = as.factor(id)), position = "identity", geom = "line")
Do note that when using more then two groups:
a <- rnorm(1000, 0, 1)
b <- rnorm(1000, 5, 2)
c <- rnorm(1000, 3, 2)
d <- rnorm(1000, -2, 1)
d <- c(a, b, c, d)
df <- data.frame(d, id = as.factor(rep(c(1, 2, 3, 4), each = 1000)))
curves for each stack appear (this is a problem with the two group example but linetype in first layer disguised it - use group instead to check) :
gplot(df) +
stat_density(aes(x = d, group = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = id), position = "identity", geom = "line")
A relatively easy fix to this is to add alpha mapping and manually set it to 0 for unwanted curves:
ggplot(df) +
stat_density(aes(x=d, alpha = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x=d, linetype = id), position = "identity", geom = "line")+
scale_alpha_manual(values = c(1,0,0,0))
I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)