How to add "arbitrary" points to a violin plot? - r

Long story short, I ran a bunch of stochastic simulations for each of 15 groups, and have one integer per group that I need to add to each violin in the plot, and can't seem to figure out how to do it. Here's a reproducible example:
# Making data
df <- data.frame(c(rep(1,10), rep(2,10), rep(3,10)), sample.int(100, 30), c(rep(85,10), rep(60,10), rep(55,10)))
colnames(df) <- c("Group", "Data", "Extra")
# Grouping data
df$Group <- as.factor(df$Group)
# Plotting
Violin2 <- ggplot(data = df, aes(x = Group, y = Data))+
geom_violin(aes(fill = Group, color = Group))+
stat_summary(aes(y = Data), fun=mean, geom="point", color = "navyblue", shape = 17, size = 3)+
stat_summary(aes(y = Data), fun=median, geom="point", color = "black", shape = 16, size = 3)
#geom_point(aes(y = Extra, color = "#00BB66", shape = 16, size = 3)+
Violin2
So here, I'm saying that within the df, there are three groups: 1, 2, and 3, that are applied to the "Data" column. What I need to add, are the integers from the "Extra" column of the df, as single points on each violin (so the three integers would be 85, 60, and 55).
I initially tried to add a geom_point layer, and thought Extra would be grouped by Group, just as Data was, but that didn't work (Error: Discrete value supplied to continuous scale).
I've been searching around on here a lot, and can't find a solution, so any advice would be greatly appreciated! Thanks so much in advance for any help! :)
This is the data:
And this is the plot so far:

So it's actually just one more line of code - you can stitch different geom's together in ggplot and it makes it really easy to do exactly what you're talking about. Just add
geom_point(aes(y = Data)) +
So the whole code would look like this
ggplot(data = df, aes(x = Group, y = Data))+
geom_violin(aes(fill = Group, color = Group))+
geom_point(aes(y = Extra), size = 2, colour = "red") +
stat_summary(aes(y = Data), fun=mean, geom="point",
color = "navyblue", shape = 17, size = 3)+
stat_summary(aes(y = Data), fun=median, geom="point",
color = "black", shape = 16, size = 3)
I've coloured the points red and made them bigger but you can change that. That gives:

Your example is working perfectly. The only thing to update is to not use constant value for color arg inside aes. You could use it like that only outside the aes.
# Making data
library(ggplot2)
df <- data.frame(c(rep(1,10), rep(2,10), rep(3,10)), sample.int(100, 10), c(rep(85,10), rep(60,10), rep(55,10)))
colnames(df) <- c("Group", "Data", "Extra")
# Grouping data
df$Group <- as.factor(df$Group)
# Plotting
Violin2 <- ggplot(data = df, aes(x = Group, y = Data))+
geom_violin(aes(fill = Group, color = Group))+
stat_summary(aes(y = Data), fun=mean, geom="point", color = "navyblue", shape = 17, size = 3)+
stat_summary(aes(y = Data), fun=median, geom="point", color = "black", shape = 16, size = 3) +
geom_point(aes(y = Extra))
Violin2
Created on 2021-06-08 by the reprex package (v2.0.0)

Related

R drawing baseline in facet_wrap specific to each facet

Let's say I have the following dataset:
set.seed(42)
data <- data.frame(type = sample(LETTERS[1:2], 40, replace = T),
condition = sample(c("Control", "Treatment"), 40, replace = T),
measurement = runif(40))
And I'd like to create the facetted graph:
ggplot(data, aes( x= condition, y = measurement))+
geom_point()+
facet_wrap(~type)
I'd like also to show the baseline (with geom_hline, for example), that equals mean of control values (mean(data$measurement[data$condition == "Control"]). But because control values will be different in different types (meaning facets on the graph), I can't just calculate one single mean. As they will be different between the facets.
Is there any way to specify yintercept for geom_hline between different facets ?
Something like this, but with the specified yintercept value, calculating the mean values for the control group for each individual facet:
ggplot(data, aes( x= condition, y = measurement))+
geom_point()+
geom_hline(yintercept= mean(data$measurement[data$condition == "Control"]),
linetype="dashed",
color = "red", size=1)+
facet_wrap(~type)
Thanks a lot!
Best regards,
Eugene
You can use stat_summary with fun = mean and geom = "hline", passing only the control subset to the data parameter. You can map yintercept to the y value calculated by the stat.
ggplot(data, aes(x = condition, y = measurement))+
geom_point() +
stat_summary(fun = mean, geom = "hline", aes(yintercept = after_stat(y)),
data = data[data$condition == "Control",], color = "red",
linetype = "dashed") +
facet_wrap(~type)

How do I add a legend to identify vertical lines in ggplot?

I have a chart that shows mobile usage by operating system. I'd like to add vertical lines to identify when those operating systems were released. I'll go through the chart and then the code.
The chart -
The code -
dev %>%
group_by(os) %>%
mutate(monthly_change = prop - lag(prop)) %>%
ggplot(aes(month, monthly_change, color = os)) +
geom_line() +
geom_vline(xintercept = as.numeric(ymd("2013-10-01"))) +
geom_text(label = "KitKat", x = as.numeric(ymd("2013-10-01")) + 80, y = -.5)
Instead of adding the text in the plot, I'd like to create a legend to identify each of the lines. I'd like to give each of them its own color and then have a legend to identify each. Something like this -
Can I make my own custom legend like that?
1) Define a data frame that contains the line data and then use geom_vline with it. Note that BOD is a data frame that comes with R.
line.data <- data.frame(xintercept = c(2, 4), Lines = c("lower", "upper"),
color = c("red", "blue"), stringsAsFactors = FALSE)
ggplot(BOD, aes( Time, demand ) ) +
geom_point() +
geom_vline(aes(xintercept = xintercept, color = Lines), line.data, size = 1) +
scale_colour_manual(values = line.data$color)
2) Alternately put the labels right on the plot itself to avoid an extra legend. Using the line.data frame above. This also has the advantage of avoiding possible multiple legends with the same aesthetic.
ggplot(BOD, aes( Time, demand ) ) +
geom_point() +
annotate("text", line.data$xintercept, max(BOD$demand), hjust = -.25,
label = line.data$Lines) +
geom_vline(aes(xintercept = xintercept), line.data, size = 1)
3) If the real problem is that you want two color legends then there are two packages that can help.
3a) ggnewscale Any color geom that appears after invoking new_scale_color will get its own scale.
library(ggnewscale)
BOD$g <- gl(2, 3, labels = c("group1", "group2"))
line.data <- data.frame(xintercept = c(2, 4), Lines = c("lower", "upper"),
color = c("red", "blue"), stringsAsFactors = FALSE)
ggplot(BOD, aes( Time, demand ) ) +
geom_point(aes(colour = g)) +
scale_colour_manual(values = c("red", "orange")) +
new_scale_color() +
geom_vline(aes(xintercept = xintercept, colour = line.data$color), line.data,
size = 1) +
scale_colour_manual(values = line.data$color)
3b) relayer The experimental relayer package (only on github) allows one to define two color aethetics, color and color2, say, and then have separate scales for each one.
library(dplyr)
library(relayer)
BOD$g <- gl(2, 3, labels = c("group1", "group2"))
ggplot(BOD, aes( Time, demand ) ) +
geom_point(aes(colour = g)) +
geom_vline(aes(xintercept = xintercept, colour2 = line.data$color), line.data,
size = 1) %>% rename_geom_aes(new_aes = c("colour" = "colour2")) +
scale_colour_manual(aesthetics = "colour", values = c("red", "orange")) +
scale_colour_manual(aesthetics = "colour2", values = line.data$color)
You can definitely make your own custom legend, but it is a bit complicated, so I'll take you through it step-by-step with some fake data.
The fake data contained 100 samples from a normal distribution (monthly_change for your data), 5 groupings (similar to the os variable in your data) and a sequence of dates from a random starting point.
library(tidyverse)
library(lubridate)
y <- rnorm(100)
df <- tibble(y) %>%
mutate(os = factor(rep_len(1:5, 100)),
date = seq(from = ymd('2013-01-01'), by = 1, length.out = 100))
You already use the colour aes for your call to geom_line, so you will need to choose a different aes to map onto the calls to geom_vline. Here, I use linetype and a call to scale_linetype_manual to manually edit the linetype legend to how I want it.
ggplot(df, aes(x = date, y = y, colour = os)) +
geom_line() +
# set `xintercept` to your date and `linetype` to the name of the os which starts
# at that date in your `aes` call; set colour outside of the `aes`
geom_vline(aes(xintercept = min(date),
linetype = 'os 1'), colour = 'red') +
geom_vline(aes(xintercept = median(date),
linetype = 'os 2'), colour = 'blue') +
# in the call to `scale_linetype_manual`, `name` will be the legend title;
# set `values` to 1 for each os to force a solid vertical line;
# use `guide_legend` and `override.aes` to change the colour of the lines in the
# legend to match the colours in the calls to `geom_vline`
scale_linetype_manual(name = 'lines',
values = c('os 1' = 1,
'os 2' = 1),
guide = guide_legend(override.aes = list(colour = c('red',
'blue'))))
And there you go, a nice custom legend. Please do remember next time that if you can provide your data, or a minimally reproducible example, we can better answer your question without having to generate fake data.

How to modify and add an extra legend in a ggplot2 figure

I have data that looks like this:
example.df <- as.data.frame(matrix( c("height","fruit",0.2,0.4,0.7,
"height","veggies",0.3,0.6,0.8,
"height","exercise",0.1,0.2,0.5,
"bmi","fruit",0.2,0.4,0.6,
"bmi","veggies",0.1,0.5,0.7,
"bmi","exercise",0.4,0.7,0.8,
"IQ","fruit",0.4,0.5,0.6,
"IQ","veggies",0.3,0.5,0.7,
"IQ","exercise",0.1,0.4,0.6),
nrow=9, ncol=5, byrow = TRUE))
colnames(example.df) <- c("phenotype","predictor","corr1","corr2","corr3")
So basically three different correlations between 3x3 variables. I want to visualize the increase in correlations as follows:
ggplot(example.df, aes(x=phenotype, y=corr1, yend=corr3, colour = predictor)) +
geom_linerange(aes(x = phenotype,
ymin = corr1, ymax = corr3,
colour = predictor),
position = position_dodge(width = 0.5))+
geom_point(size = 3,
aes(x = phenotype, y = corr1, colour = predictor),
position = position_dodge(width = 0.5), shape=4)+
geom_point(size = 3,
aes(x = phenotype, y = corr2, colour = predictor),
position = position_dodge(width = 0.5), shape=18)+
geom_point(size = 3,
aes(x = phenotype, y = corr3, colour = predictor),
position = position_dodge(width = 0.5))+
labs(x=NULL, y=NULL,
title="Stackoverflow Example Plot")+
scale_colour_manual(name="", values=c("#4682B4", "#698B69", "#FF6347"))+
theme_minimal()
This gives me the following plot:
Problems:
Tthere is something wrong with the way the geom_point shapes are dodged with BMI and IQ. They should be all with on the line with the same colour, like with height.
How do I get an extra legend that can show what the circle, cross, and square represent? (i.e., the three different correlations shown on the line: cross = correlation 1, square = correlation 2, circle = correlation 3).
The legend now shows a line, circle, cross through each other, while just a line for the predictors (exercise, fruit, veggies) would suffice..
Sorry for the multiple issues, but adding the extra legend (problem #2) is the most important one, and I would be already very satisfied if that could be solved, the rest is bonus! :)
See if the following works for you? The main idea is to convert the data frame from wide to long format for the geom_point layer, and map correlation as a shape aesthetic:
example.df %>%
ggplot(aes(x = phenotype, color = predictor, group = predictor)) +
geom_linerange(aes(ymin = corr1, ymax = corr3),
position = position_dodge(width = 0.5)) +
geom_point(data = . %>% tidyr::gather(corr, value, -phenotype, -predictor),
aes(y = value, shape = corr),
size = 3,
position = position_dodge(width = 0.5)) +
scale_color_manual(values = c("#4682B4", "#698B69", "#FF6347")) +
scale_shape_manual(values = c(4, 18, 16),
labels = paste("correlation", 1:3)) +
labs(x = NULL, y = NULL, color = "", shape = "") +
theme_minimal()
Note: The colour legend is based on both geom_linerange and geom_point, hence the legend keys include both a line and a point shape. While it's possible to get rid of the second one, it does take some more convoluted code, and I don't think the plot would be much improved as a result...

Tile density plot by fill gradient with multiple colours

I'm enjoying using tile density plots to represent probability densities. I often use the second (y) dimension to illustrate comparisons of densities between factors, but I'm having trouble introducing a third dimension. I want to use colour to represent the third dimension. How can I do this? (I've tried inserting aes references to Type in the example below but they appear to collide with the ..density.. aesthetic.)
Beginning with the following plot,
library(ggplot2)
dz <- data.frame(Type = c(rep("A", 100), rep("B", 100)),
Costs = c(rnorm(100), rnorm(100, 5, 1))
)
ggplot(dz, aes(x = Costs, y = 1)) +
stat_density(aes(fill = ..density..), geom = "tile", position = "identity") +
scale_fill_gradient(low = "white", high = "black")
What I want is a combination of the following. For A:
and B:
If you map fill to Type, and alpha to the density, you get more or less what you want:
ggplot(dz, aes(x = Costs, y = 1, fill=Type)) +
stat_density(aes(alpha=..density..), geom = "tile", position = "identity") +
scale_fill_manual(values=c("red", "blue"))

Selecting entries in legend of ggplot in R

I'm creating a figure using ggplot. That figure has 27 lines that I want to show but not emphasize, and two lines, mean and weighted mean, that I want to emphasize. I would like only these last two lines to appear into the legend of the plot. Here is my code:
p_plot <- ggplot(data = dta, aes(x = date, y = premium, colour = State)) +
geom_line(, show_guide=FALSE) +
scale_color_manual(values=c(rep("gray60", 27)))
p_plot <- p_plot + geom_line(aes(y = premium.m), colour = "blue", size = 1.25,
show_guide=TRUE) + geom_line(aes(y = premium.m.w), colour = "red",
size = 1.25, show_guide=c(TRUE)) + ylab("Pe/pg")
p_plot
The show_guide = FALSE statement in the first geom_line seems to be overridden by the other show_guide=TRUE statements. How can I limit the number of entries in the legend of my figures to the lines "premium.m" and "premium.m.w"? Thank you.
I think this should answer your question: (the code's been slightly modified but the concept is the same)
dta <- data.frame(date = rep(seq.Date(as.Date("2010-01-01"), as.Date("2010-12-01"), "months"), 26),
premium = rnorm(12*26),
State = rep(letters, each = 12))
library(ggplot2)
p_plot <- ggplot(data = dta) +
geom_line(aes(x = date, y = premium, group = State), colour = "grey60")
p_plot + geom_line(aes(x = unique(date), y = as.numeric(tapply(premium, date, mean)), colour = "mean"),
size = 1.25) +
geom_line(aes(x = unique(date), y = as.numeric(tapply(premium, date, median)), colour = "median"),
size = 1.25) + ylab("Pe/pg") + scale_color_discrete("stats")
p_plot
However, this is just a (ugly) workaround and far from the best practice for data visualisation (especially for the purposes ggplot has been implemented for). Anyway, I could provide you with a more elegant solution if you edited your question adding more details.

Resources