geom_violin overlapping plots - r

By default, the neighboring violins will touch each other at the widest point if the widest point occurs at the same height. I would like to make my violin plots wider so that they overlap each other. Basically, something more similar to a ridge plot:
Is that possible with geom_violin?
I saw the width parameter, but if I set it higher than 1, I get these warnings, which makes me think that may not be the most appropriate approach:
Warning: position_dodge requires non-overlapping x intervals

I don't think geom_violin is meant to do this by design, but we can hack it with some effort.
Illustration using the diamonds dataset from ggplot2:
# normal violin plot
p1 <- diamonds %>%
ggplot(aes(color, depth)) +
geom_violin()
# overlapping violin plot
p2 <- diamonds %>%
rename(x.label = color) %>% # rename the x-variable here;
# rest of the code need not be changed
mutate(x = as.numeric(factor(x.label)) / 2) %>%
ggplot(aes(x = x, y = depth, group = x)) +
# plot violins in two separate layers, such that neighbouring x values are
# never plotted in the same layer & there's no overlap WITHIN each layer
geom_violin(data = . %>% filter(x %% 1 != 0)) +
geom_violin(data = . %>% filter(x %% 1 == 0)) +
# add label for each violin near the bottom of the chart
geom_text(aes(y = min(depth), label = x.label), vjust = 2, check_overlap = TRUE) +
# hide x-axis labels as they are irrelevant now
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
gridExtra::grid.arrange(
p1 + ggtitle("Normal violins"),
p2 + ggtitle("Overlapping violins"),
nrow = 2
)

Related

ggplot2 - align overlayed points in center of boxplot, and connect the points with lines

I am working on a boxplot with points overlayed and lines connecting the points between two time sets, example data provided below.
I have two questions:
I would like the points to look like this, with just a little height jitter and more width jitter. However, I want the points to be symmetrically centered around the middle of the boxplot on each y axis label (to make the plots more visually pleasing). For example, I would like the 6 datapoints at y = 4 and x = "after to be placed 3 to the right of the boxplot center and 3 to the left of the center, at symmetrical distances from the center.
Also, I want the lines to connect with the correct points, but now the lines start and end in the wrong places. I know I can use position = position_dodge() in geom_point() and geom_line() to get the correct positions, but I want to be able to adjust the points by height also (why do the points and lines align with position_dodge() but not with position_jitter?).
Are these to things possible to achieve?
Thank you!
examiner <- rep(1:15, 2)
time <- rep(c("before", "after"), each = 15)
result <- c(1,3,2,3,2,1,2,4,3,2,3,2,1,3,3,3,4,4,5,3,4,3,2,2,3,4,3,4,4,3)
data <- data.frame(examiner, time, result)
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_point(aes(group = examiner),
position = position_jitter(width = 0.2, height = 0.03)) +
geom_line(aes(group = examiner),
position = position_jitter(width = 0.2, height = 0.03), alpha = 0.3)
I'm not sure that you can satisfy both of your questions together.
You can have a more "symmetric" jitter by using a geom_dotplot, as per:
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075)
The problem is that when you add the lines, they will join at the original, un-jittered points.
To join jittered points with lines that map to the jittered points, the jitter can be added to the data before plotting. As you saw, jittering both ends up with points and lines that don't match. See Connecting grouped points with lines in ggplot for a better explanation.
library(dplyr)
data <- data %>%
mutate(result_jit = jitter(result, amount=0.1),
time_jit = jitter(case_when(
time == "before" ~ 2,
time == "after" ~ 1
), amount=0.1)
)
ggplot(data, aes(time, result, fill=time)) +
geom_boxplot() +
geom_point(aes(x=time_jit, y=result_jit, group = examiner)) +
geom_line(aes(x=time_jit, y=result_jit, group = examiner), alpha=0.3)
Result
It is possible to extract the transformed points from the geom_dotplot using ggplot_build() - see Is it possible to get the transformed plot data? (e.g. coordinates of points in dot plot, density curve)
These points can be merged onto the original data, to be used as the anchor points for the geom_line.
Putting it all together:
library(dplyr)
library(ggplot2)
examiner <- rep(1:15, 2)
time <- rep(c("before", "after"), each = 15)
result <- c(1,3,2,3,2,1,2,4,3,2,3,2,1,3,3,3,4,4,5,3,4,3,2,2,3,4,3,4,4,3)
# Create a numeric version of time
data <- data.frame(examiner, time, result) %>%
mutate(group = case_when(
time == "before" ~ 2,
time == "after" ~ 1)
)
# Build a ggplot of the dotplot to extract data
dotpoints <- ggplot(data, aes(time, result, fill=time)) +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075)
# Extract values of the dotplot
dotpoints_dat <- ggplot_build(dotpoints)[["data"]][[1]] %>%
mutate(key = row_number(),
x = as.numeric(x),
newx = x + 1.2*stackpos*binwidth/2) %>%
select(key, x, y, newx)
# Join the extracted values to the original data
data <- arrange(data, group, result) %>%
mutate(key = row_number())
newdata <- inner_join(data, dotpoints_dat, by = "key") %>%
select(-key)
# Create final plot
ggplot(newdata, aes(time, result, fill=time)) +
geom_boxplot() +
geom_dotplot(binaxis="y", aes(x=time, y=result, group = time),
stackdir = "center", binwidth = 0.075) +
geom_line(aes(x=newx, y=result, group = examiner), alpha=0.3)
Result

ggplot2 barplot - adding percentage labels inside the stacked bars but retaining counts on the y-axis

I have created an stacked barplot with the counts of a variables. I want to keep these as counts, so that the different bar sizes represent different group sizes. However, inside the bar plot i would like to add labels that show the proportion of each stack - in terms of percentage.
I managed to create the stacked plot of count for every group. Also I have created the labels and they are are placed correctly. What i struggle with is how to calculate the percentage there?
I have tried this, but i get an error:
dataex <- iris %>%
dplyr::group_by(group, Species) %>%
dplyr::summarise(N = n())
names(dataex)
dataex <- as.data.frame(dataex)
str(dataex)
ggplot(dataex, aes(x = group, y = N, fill = factor(Species))) +
geom_bar(position="stack", stat="identity") +
geom_text(aes(label = ifelse((..count..)==0,"",scales::percent((..count..)/sum(..count..)))), position = position_stack(vjust = 0.5), size = 3) +
theme_pubclean()
Error in (count) == 0 : comparison (1) is possible only for atomic
and list types
desired result:
well, just found answer ... or workaround. Maybe this will help someone in the future: calculate the percentage before the ggplot and then just just use that vector as labels.
dataex <- iris %>%
dplyr::group_by(group, Species) %>%
dplyr::summarise(N = n()) %>%
dplyr::mutate(pct = paste0((round(N/sum(N)*100, 2))," %"))
names(dataex)
dataex <- as.data.frame(dataex)
str(dataex)
ggplot(dataex, aes(x = group, y = N, fill = factor(Species))) +
geom_bar(position="stack", stat="identity") +
geom_text(aes(label = dataex$pct), position = position_stack(vjust = 0.5), size = 3) +
theme_pubclean()

Density curves on multiple histograms sharing same y-axis

I need to overlay normal density curves on 3 histograms sharing the same y-axis. The curves need to be separate for each histogram.
My dataframe (example):
height <- seq(140,189, length.out = 50)
weight <- seq(67,86, length.out = 50)
fev <- seq(71,91, length.out = 50)
df <- as.data.frame(cbind(height, weight, fev))
I created the histograms for the data as:
library(ggplot)
library(tidyr)
df %>%
gather(key=Type, value=Value) %>%
ggplot(aes(x=Value,fill=Type)) +
geom_histogram(binwidth = 8, position="dodge")
I am now stuck at how to overlay normal density curves for the 3 variables (separate curve for each histogram) on the histograms that I have generated. I won't mind the final figure showing either count or density on the y-axis.
Any thoughts on how to proceed from here?
Thanks in advance.
I believe that the code in the question is almost right, the code below just uses the answer in the link provided by #akrun.
Note that I have commented out the call to facet_wrap by placing a comment char before the last plus sign.
library(ggplot2)
library(tidyr)
df %>%
gather(key = Type, value = Value) %>%
ggplot(aes(x = Value, color = Type, fill = Type)) +
geom_histogram(aes(y = ..density..),
binwidth = 8, position = "dodge") +
geom_density(alpha = 0.25) #+
facet_wrap(~ Type)

Repelling selective text in ggplot in R

I have a set of text to be printed using ggplot at (x,y) locations where only a subset of them overlaps. I would like to keep the ones not overlapping exactly where they are and then repel the ones that overlap (I know which ones do these -- for example the names of states in New England overlap while in the west nothing overlaps, I want to keep the western state names where they are but repel the ones in New England). When I use the geom_text_repel it repels all of the text. If I chose the subset that does not overlap and use geom_text to print them and the other using geom_text_repel because they are at different layers. Is there a way to fix some subset of text and repel the rest using geom_text_repel or do I need to go for a completely different solution?
Here is an example:
library(tidyverse)
library(ggrepel)
# state centers by fixing Alaska and Hawaii to look good in our maps
df = data.frame(x = state.center$x, y= state.center$y, z = state.abb)
overlaps = c('RI', 'DE', 'CT', 'MA')
df %>%
ggplot() +
geom_point(aes(x,y),
size = 1) +
# plot the ones I would like to keep where the are
# I want these right centered around the points
geom_text(aes(x,y,label=z),
data = df %>% filter(! z %in% overlaps),
size = 4) +
# plot the ones I would like to repel
geom_text_repel(aes(x,y,label=z),
data = df %>% filter(z %in% overlaps),
size = 4,
min.segment.length = unit(0, "npc")) +
coord_map() +
theme_minimal()
df %>%
ggplot() +
geom_point(aes(x,y),
size = 1) +
# if we repel all instead
geom_text_repel(aes(x,y,label=z),
size = 4,
min.segment.length = unit(0, "npc")) +
coord_map() +
theme_minimal()

ggplot's scale_y_log10 behavior

Trying to plot a stacked histogram using ggplot:
set.seed(1)
my.df <- data.frame(param = runif(10000,0,1),
x = runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param, breaks = 5)
require(ggplot2)
not logging the y-axis:
ggplot(my.df,aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey()
gives:
But I want to log10+1 transform the y-axis to make it easier to read:
ggplot(my.df, aes_string(x = "x", y = "..count..+1", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
which gives:
The tick marks on the y-axis don't make sense.
I get the same behavior if I log10 transform rather than log10+1:
ggplot(my.df, aes_string(x = "x", fill = "param.range")) +
geom_histogram(binwidth = 0.1, pad = TRUE) +
scale_fill_grey() +
scale_y_log10()
Any idea what is going on?
It looks like invoking scale_y_log10 with a stacked histogram is causing ggplot to plot the product of the counts for each component of the stack within each x bin. Below is a demonstration. We create a data frame called product.of.counts that contains the product, within each x bin of the counts for each param.range bin. We use geom_text to add those values to the plot and see that they coincide with the top of each stack of histogram bars.
At first I thought this was a bug, but after a bit of searching, I was reminded of the way ggplot does the log transformation. As described in the linked answer, "scale_y_log10 makes the counts, converts them to logs, stacks those logs, and then displays the scale in the anti-log form. Stacking logs, however, is not a linear transformation, so what you have asked it to do does not make any sense."
As a simpler example, say each of five components of a stacked bar have a count of 100. Then log10(100) = 2 for all five and the sum of the logs will be 10. Then ggplot takes the anti-log for the scale, which gives 10^10 for the total height of the bar (which is 100^5), even though the actual height is 100x5=500. This is exactly what's happening with your plot.
library(dplyr)
library(ggplot2)
# Data
set.seed(1)
my.df <- data.frame(param=runif(10000,0,1),x=runif(10000,0.5,1))
my.df$param.range <- cut(my.df$param,breaks=5)
# Calculate product of counts within each x bin
product.of.counts = my.df %>%
group_by(param.range, breaks=cut(x, breaks=seq(-0.05, 1.05, 0.1), labels=seq(0,1,0.1))) %>%
tally %>%
group_by(breaks) %>%
summarise(prod = prod(n),
param.range=NA) %>%
ungroup %>%
mutate(breaks = as.numeric(as.character(breaks)))
ggplot(my.df, aes(x, fill=param.range)) +
geom_histogram(binwidth = 0.1, colour="grey30") +
scale_fill_grey() +
scale_y_log10(breaks=10^(0:14)) +
geom_text(data=product.of.counts, size=3.5,
aes(x=breaks, y=prod, label=format(prod, scientific=TRUE, digits=3)))

Resources