How to embed the number of observations into violin plots? - r

I want to put data on facets of violin plots and annotate these violins with the number of observations used to plot the violin.
Here is an example of what I have without observation counts:
library(ggplot2)
library(dplyr)
library(tidyverse)
data("iris")
c <- rep(c('r', 'g', 'b'), 50)
c <- sample(c)
facet_row <- rep(c('row1', 'row2', 'row3', 'row4', 'row5'), 30)
facet_col <- rep(c('col1', 'col2', 'col3'), 50)
iris$facet_rows <- facet_row
iris$facet_cols <- facet_col
iris$color <- c
iris$count <- sample(1:10, size = 150, replace = T)
p <- ggplot(iris, aes(x=Species, y=Petal.Length, fill=color)) +
geom_violin(alpha = 0.7, na.rm = T) +
coord_flip() +
facet_grid(rows = vars(facet_rows), cols = vars(facet_cols))
print(p)
Result:
I want to put the number of observations right behind those violins.
I tried this so far:
count_data <- function (y){
df <- data.frame(y = min(y) - 0.2, label = length(y))
return(df)
}
p <- ggplot(iris, aes(x=Species, y=Petal.Length, fill=color)) +
geom_violin(alpha = 0.7, na.rm = T) + stat_summary(fun.data = count_data, geom = "text", aes(group = Species)) +
coord_flip() +
facet_grid(rows = vars(facet_rows), cols = vars(facet_cols))
print(p)
This produces an output with an issue:
Grouped violins now have one count value. The problem is that those violins most definetly will have different number of observations.
I have tried to just draw a geom_text using precomputed number of observations
(assume that iris$count actually contains observation counts that will have the same value for different rows, but random here):
p <- ggplot(iris, aes(x=Species, y=Petal.Length, fill=color)) +
geom_violin(alpha = 0.7, na.rm = T) + geom_text(aes(label=count, y=Petal.Length), nudge_y = -0.1) +
coord_flip() +
facet_grid(rows = vars(facet_rows), cols = vars(facet_cols))
print(p)
This has a similar problem with the previous approach:
It has values for two violins in the same group in one line.
Each violin repeats the number of observations once for each observation.
I am relatively new to R, I feel like there is a clean way to do this, but I can't figure it out...

Removing the explicit grouping and putting position_dodge resolved the issue:
count_data <- function (y){
df <- data.frame(y = min(y) - 0.2, label = length(y))
return(df)
}
p <- ggplot(iris, aes(x=Species, y=Petal.Length, fill=color)) +
geom_violin(alpha = 0.7, na.rm = T) + stat_summary(fun.data = count_data, geom = "text", position = position_dodge(1)) +
coord_flip() +
facet_grid(rows = vars(facet_rows), cols = vars(facet_cols))
print(p)

Related

How to get a complete vector of breaks from the scale of a plot in R?

I am trying to add captions outside the plots using the solutions from this post and this one.
I think I managed to get what I want, but I am trying to automatize the code if the data changes. Now my problem is that I need a way to get the vector of all the values/breaks from the y-axis from the plot. I don't want to change the y-axis and I don't want to get only the range (I found this post to get the ranges, but I don't want only that)
On the other hand, I found this post, but the solution doesn't work for new versions of ggplot2 (mine is 3.3.5).
This is my example:
library(ggplot2)
library(dplyr)
# DATA
val1 <- c(2.1490626,2.2035281,1.5927854,3.1399245,2.3967338,3.7915825,4.6691277,3.0727319,2.9230937,2.6239759,3.7664386,4.0160378,1.2500835,4.7648343,0.0000000,5.6740227,2.7510256,3.0709322,2.7998003,4.0809085,2.5178086,5.9713330,2.7779843,3.6724801,4.2648527,3.6841084,2.5597235,3.8477471,2.6587736,2.2742209,4.5862788,6.1989269,4.1167091,3.1769325,4.2404515,5.3627032,4.1576810,4.3387921,1.4024381,0.0000000,4.3999099,3.4381837,4.8269218,2.6308474,5.3481382,4.9549753,4.5389650,1.3002293,2.8648220,2.4015338,2.0962332,2.6774765,3.0581759,2.5786137,5.0539080,3.8545796,4.3429043,4.2233248,2.0434363,4.5980727)
val2 <- c(3.7691229,3.6478055,0.5435826,1.9665861,3.0802654,1.2248374,1.7311236,2.2492826,2.2365337,1.5726119,2.0147144,2.3550348,1.9527204,3.3689502,1.7847986,3.5901329,1.6833872,3.4240479,1.8372175,0.0000000,2.5701453,3.6551315,4.0327091,3.8781182)
val3 <- c(2.1490626,2.2035281,1.5927854,3.1399245,2.3967338,3.7915825,4.6691277,3.0727319,2.9230937,2.6239759,3.7664386,4.0160378,1.2500835,4.7648343,0.0000000,5.6740227,2.7510256,3.0709322,2.7998003,4.0809085,2.5178086,5.9713330,2.7779843,3.6724801,4.2648527,3.6841084,2.5597235,3.8477471,2.6587736,2.2742209,4.5862788,6.1989269,4.1167091,3.1769325,4.2404515,5.3627032,4.1576810,4.3387921,1.4024381,0.0000000,4.3999099,3.4381837,4.8269218,2.6308474,5.3481382,4.9549753,4.5389650,1.3002293,2.8648220,2.4015338,2.0962332,2.6774765,3.0581759,2.5786137,5.0539080,3.8545796,4.3429043,4.2233248,2.0434363,4.5980727)
df1 <- data.frame(value = val1)
df2 <- data.frame(value = val2)
df3 <- data.frame(value = val3)
data <- bind_rows(lst(df1, df2, df3), .id = 'id')
data$Sex <- rep(c("Male", "Female"), times=72)
data$d <- "ff"
data <- as.data.frame(unclass(data), stringsAsFactors = TRUE)
# PLOT
p <- data %>%
ggplot(aes(value)) +
geom_density(lwd = 1.2, colour="red", show.legend = FALSE) +
geom_histogram(aes(y=..density.., fill = id), bins=10, col="black", alpha=0.2) +
facet_grid(id ~ Sex ) +
xlab("type_data") +
ylab("Density") +
ggtitle("title") +
guides(fill=guide_legend(title="legend_title")) +
theme(strip.text.y = element_blank())
p
# ADD CAPTION
caption_df = data.frame(value = c(min(data$value), max(data$value)), id = c(rep(tail(levels(data$id), n=1), times=length(levels(data$Sex)))),
Sex = c(levels(data$Sex)))
p + coord_cartesian(clip = "off",
ylim = layer_scales(p)$y$range$range,
xlim = layer_scales(p)$x$range$range) +
geom_text(data = caption_df,
aes(y = -0.15, label = c(levels(data$Sex))))
Before adding the caption:
After the caption:
The idea is that I want to avoid having to set up the y parameter every time I change the data. Imagine that that the y-axis is different (it is something like this: 0.0000, 0.0005, 0.0010, 0.0015). In that case, the appropriate y would be -0.0005 because the "jump" is 0.0005, so I just have to make it negative.
For that reason, I was wondering if it is possible to get the COMPLETE vector of values from the y-axis.
For example, if we want to get all the values/breaks of the y-axis from the previous images would be: c(0.0, 0.2, 0.4, 0.6).
Does anyone know if I can get ALL the values from the y-axis of a plot?
Thanks in advance
You can get the y axis breaks from the p object like this:
as.numeric(na.omit(layer_scales(p)$y$break_positions()))
#> [1] 0.0 0.2 0.4 0.6
However, if you want the labels to be a fixed distance below the panel regardless of the y axis scale, it would be best to use a fixed fraction of the entire panel range rather than the breaks:
yrange <- layer_scales(p)$y$range$range
ypos <- min(yrange) - 0.2 * diff(yrange)
p + coord_cartesian(clip = "off",
ylim = layer_scales(p)$y$range$range,
xlim = layer_scales(p)$x$range$range) +
geom_text(data = caption_df,
aes(y = ypos, label = c(levels(data$Sex))))
For example, suppose you had a y scale that was twice the size:
p <- data %>%
ggplot(aes(value)) +
geom_density(lwd = 1.2, colour="red", show.legend = FALSE) +
geom_histogram(aes(y= 2 * ..density.., fill = id), bins=10, col="black", alpha=0.2) +
facet_grid(id ~ Sex ) +
xlab("type_data") +
ylab("Density") +
ggtitle("title") +
guides(fill=guide_legend(title="legend_title")) +
theme(strip.text.y = element_blank())
Then the exact same code would give you the exact same label placement, without any reference to breaks:
yrange <- layer_scales(p)$y$range$range
ypos <- min(yrange) - 0.2 * diff(yrange)
p + coord_cartesian(clip = "off",
ylim = layer_scales(p)$y$range$range,
xlim = layer_scales(p)$x$range$range) +
geom_text(data = caption_df,
aes(y = ypos, label = c(levels(data$Sex))))

Make geom_histogram display x-axis labels as integers instead of numerics

I have a data.frame that has counts for several groups:
set.seed(1)
df <- data.frame(group = sample(c("a","b"),200,replace = T),
n = round(runif(200,1,2)))
df$n <- as.integer(df$n)
And I'm trying to display a histogram of df$n, facetted by the group using ggplot2's geom_histogram:
library(ggplot2)
ggplot(data = df, aes(x = n)) + geom_histogram() + facet_grid(~group) + theme_minimal()
Any idea how to get ggplot2 to label the x-axis ticks with the integers the histogram is summarizing rather than the numeric values it is currently showing?
You could tweak this by the binwidth argument of geom_histogram:
library(ggplot2)
ggplot(data = df, aes(x = n)) +
geom_histogram(binwidth = 0.5) +
facet_grid(~group) +
theme_minimal()
Another example:
set.seed(1)
df <- data.frame(group = sample(c("a","b"),200,replace = T),
n = round(runif(200,1,5)))
library(ggplot2)
ggplot(data = df, aes(x = n)) +
geom_histogram(binwidth = 0.5) +
facet_grid(~group) +
theme_minimal()
You can manually specify the breaks with scale_x_continuous(breaks = seq(1, 2)). Alternatively, you can set the breaks and labels separately as well.

Color outlier dots above a specific value in R

How do I color outliers that are above a specific value using ggplot2 in R?.
(Sorry for the seemingly easy question, I am a beginner. the reason why is that these are frequencies of a value of 0, I am then transforming this column of data by taking the -log10(). So anything that has a frequency of 0 would then be transformed into Inf. Attached is a screenshot of my plot, essentially I want to make all the outlier points above 10 on the y axis to be a different color.
boxplots <- function(df){
df$'frequency'[is.na(df$'frequency')] <- 0.00
df$'-log10(frequency)' <- -log10(df$'frequency')
x <- data.frame(group = 'x', value = df$'-log10(frequency)'[df$'Type'=='x'])
y <- data.frame(group = 'y', value = df$'-log10(frequency)'[df$'Type'=='y'])
z <- data.frame(group = 'z', value = df$'-log10(frequency)'[df$'Type'=='c=z'])
plot.data <<- rbind(x, y, z)
labels <- c("z", "y", "z")
t<-plot.data %>%
ggplot(aes(x = group, y = value, fill = group))+
geom_boxplot()+
scale_fill_viridis(discrete = TRUE, alpha = 0.6)+
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Distribution of -log10(frequency) by Type") +
xlab("Type")+
ylab("-log10(frequency)")+
scale_x_discrete(labels=labels)+
scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, by = 2))
print(t)
s<<-t
ggsave("frequency_by_type.png", plot = t)
}
you could just create a new column indicating wheather it is an outlier or not and map this to the geom_jitter color. I resumed the answer in a smaller example but you should be able to fit this accordingly:
library(ggplot2)
library(viridis)
plot.data <- data.frame(group = c("1","1","1","1","1","2","2","2","2","2"),
value = c(1,5,10,6,3,1,5,10,6,3))
t<-plot.data %>%
mutate(outlier = ifelse(value >9, "YES", "NO")) %>%
ggplot(aes(x = group, y = value, fill = group))+
geom_boxplot()+
geom_jitter(aes(group, value, color = outlier) , size=2, alpha=0.9)+
scale_fill_viridis(discrete = TRUE, alpha = 0.6)
t
library(ggplot2)
# Basic box plot
p <- ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot()
p
# Rotate the box plot
p + coord_flip()
# Notched box plot
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot(notch=TRUE)
# Change outlier, color, shape and size
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_boxplot(outlier.colour="red", outlier.shape=8,
outlier.size=4)

is it possible to create a ggMarginal plot without desaggredating the data?

I have a data frame with some points and their frequency of occurrence and I want to plot points (balls) using their frequency to represent their size. But I also want to use ggMarginal to create the marginal plots. The code bellow creates the marginal without taking in account their frequencies.
library(ggplot2)
df <- data.frame("x" = 1:5, "y" = c(5,8,8,12,10), "f" = c(4,5,8,8,5))
p <- ggplot(df, aes(x=x, y=y, size=f)) + geom_point() + theme_bw()
ggExtra::ggMarginal(p, data=df, type = "histogram")
I don't want to create another data frame with disaggregated data. But it would lead to the right marginals. As presented bellow:
# disaggregated data
df2 <- df[ rep(1:nrow(df), df$f), c("x", "y") ]
p <- ggplot(df2, aes(x=x, y=y)) + geom_point() + theme_bw()
ggExtra::ggMarginal(p, data=df2, type = "histogram")
But even if I try to use both data frames, the resulting marginals still go wrong.
p <- ggplot(df, aes(x=x, y=y, size=f)) + geom_point() + theme_bw()
ggExtra::ggMarginal(p, data=df2, type = "histogram")
Is it possible to create the marginals with disaggregating the data? How?
If 1. is not possible, how to do it anyway, since none of the examples above provided the desired plot?
It can be done with cowplot package.
library(tidyverse)
library(cowplot)
df <- data.frame("x" = 1:5,
"y" = c(5,8,8,12,10),
"f" = c(4,5,8,8,5))
df2 <- df[rep(1:nrow(df), df$f), c("x", "y") ]
p <-
ggplot(df, aes(x=x, y=y, size=f)) +
geom_count() +
theme_bw()
xhist <-
axis_canvas(p, axis = "x") +
geom_histogram(data = df2, aes(x = x), color = 'lightgray')
yhist <-
axis_canvas(p, axis = "y", coord_flip = TRUE) +
geom_histogram(data = df2, aes(x = y), color = 'lightgray') +
coord_flip()
p %>%
insert_xaxis_grob(xhist, grid::unit(1, "in"), position = "top") %>%
insert_yaxis_grob(yhist, grid::unit(1, "in"), position = "right") %>%
ggdraw()

ggplot2 confusion matrix geom_text labeling

I've plotted a confusion matrix (predicting 5 outcomes) in R using ggplot and scales for geom_text labeling.
The way geom_text(aes(label = percent(Freq/sum(Freq))) is written in code, it's showing Frequency of each box divided by sum of all observations, but what I want to do is get Frequency of each box divided by sum Frequency for each Reference.
In other words, instead of A,A = 15.8%,
it should be A,A = 15.8%/(0.0%+0.0%+0.0%+0.0%+15.8%%) = 100.0%
library(ggplot2)
library(scales)
valid_actual <- as.factor(c("A","B","B","C","C","C","E","E","D","D","A","A","A","E","E","D","D","C","B"))
valid_pred <- as.factor(c("A","B","C","C","E","C","E","E","D","B","A","B","A","E","D","E","D","C","B"))
cfm <- confusionMatrix(valid_actual, valid_pred)
ggplotConfusionMatrix <- function(m){
mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
"Kappa", percent_format()(m$overall[2]))
p <-
ggplot(data = as.data.frame(m$table) ,
aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(Freq)), colour = "white") +
scale_fill_gradient(low = "white", high = "green") +
geom_text(aes(x = Reference, y = Prediction, label = percent(Freq/sum(Freq)))) +
theme(legend.position = "none") +
ggtitle(mytitle)
return(p)
}
ggplotConfusionMatrix(cfm)
The problem is that, as far as I know, ggplot is not able to do group calculation. See this recent post for similar question.
To solve your problem you should take advantage of the dplyrpackage.
This should work
library(ggplot2)
library(scales)
library(caret)
library(dplyr)
valid_actual <- as.factor(c("A","B","B","C","C","C","E","E","D","D","A","A","A","E","E","D","D","C","B"))
valid_pred <- as.factor(c("A","B","C","C","E","C","E","E","D","B","A","B","A","E","D","E","D","C","B"))
cfm <- confusionMatrix(valid_actual, valid_pred)
ggplotConfusionMatrix <- function(m){
mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
"Kappa", percent_format()(m$overall[2]))
data_c <- mutate(group_by(as.data.frame(m$table), Reference ), percentage =
percent(Freq/sum(Freq)))
p <-
ggplot(data = data_c,
aes(x = Reference, y = Prediction)) +
geom_tile(aes(fill = log(Freq)), colour = "white") +
scale_fill_gradient(low = "white", high = "green") +
geom_text(aes(x = Reference, y = Prediction, label = percentage)) +
theme(legend.position = "none") +
ggtitle(mytitle)
return(p)
}
ggplotConfusionMatrix(cfm)
And the result:

Resources