Overlay two plots from different dataframes in R - r

I would like to overlay two ggplots from different data sources. I don't think a left_join will work because the dataframes are of two different lengths and would potential change the underlying plots.[Maybe?]
library(tidyverse)
set.seed(123)
player_df <- tibble(name = rep(c("A","B","C","D"), each = 10, times = 1),
pos = rep(c("DEF","DEF","MID","MID"), each = 10, times = 1),
load = c(rnorm(10, mean = 200, sd = 100),
rnorm(10, mean = 300, sd = 50),
rnorm(10, mean = 400, sd = 100),
rnorm(10, mean = 500, sd = 50)))
p1 <- player_df %>%
ggplot(aes(x = load, y = name)) +
geom_point()
pos_df <- tibble(pos = rep(c("DEF","MID"), each = 30, times = 1),
load = (c(rnorm(30, mean = 250, sd = 100),
rnorm(30, mean = 350, sd = 100))))
p2 <- pos_df %>%
ggplot(aes(x = load, y = pos)) +
geom_boxplot()
p1
p2
# add p2 to every p1 player plot by pos
I would like p1 to have the corresponding p2 - by pos - appear behind it. So... add the matching p2 boxplot to each p1 scatterplot.
p1:
p2:

It's not really advisable to attempt to superimpose two plots on each other. A ggplot is made of layers already, so usually it's just a case of superimposing one geom on another. This can be difficult if (as in your case) one of the axes has different labels. However, with a little work it is possible to wrangle your data so that it all sits on a single plot. In your case, you could do something like:
levs <- c("A", "DEF", "B", "C", "MID", "D")
ggplot(within(pos_df, pos <- factor(pos, levs)), aes(x = load, y = pos)) +
geom_boxplot(width = 2.3) +
geom_point(data = within(player_df, pos <- factor(name, levs))) +
scale_y_discrete(limits = c("A", "DEF", "B", " ", "C", "MID", "D"))

Dug into ggplot a bit and re-engineered a boxplot bit by bit.
# manually calculate stats that are used in boxplots
pos_df_summary <- pos_df %>%
group_by(pos, .drop = FALSE) %>%
summarise(min = fivenum(load)[1],
Q1 = fivenum(load)[2],
median = fivenum(load)[3],
Q3 = fivenum(load)[4],
max = fivenum(load)[5]
)
# add the boxplot data to each player
joined_df <- player_df %>%
left_join(., pos_df_summary, by = "pos") %>%
distinct(name, .keep_all = TRUE)
# plot
ggplot(data = NULL, aes(group = name)) +
# create the line from min to max
geom_segment(data = joined_df, aes(y = name, yend = name, x=min, xend=max), color="black") +
#create the box with median line
geom_crossbar(data = joined_df,
aes(y = name, xmin = Q1, xmax = Q3, x = median, fill = "NA"),
color = "black",
fatten = 1) +
scale_fill_manual(values = "white") +
# add the points from the player_df
geom_point(data = player_df,
aes(x = load, y = name, group=name),
color = "red",
show.legend=FALSE) +
theme(legend.position = "none")
There may be some extraneous code in here as I cobbled it from some other resources. Specifically, I'm not sure what the aes(group = name) in the ggplot() call does exactly.

Related

Dodge two different geoms apart in ggplot2

Let's say I have two different sources of data. One is of repeated observations, and one is just a mean +/- standard error predicted by a model.
n <- 30
obs <- data.frame(
group = rep(c("A", "B"), each = n*3),
level = rep(rep(c("low", "med", "high"), each = n), 2),
yval = c(
rnorm(n, 30), rnorm(n, 50), rnorm(n, 90),
rnorm(n, 40), rnorm(n, 55), rnorm(n, 70)
)
) %>%
mutate(level = factor(level, levels = c("low", "med", "high")))
model_preds <- data.frame(
group = c("A", "A", "A", "B", "B", "B"),
level = rep(c("low", "med", "high"), 2),
mean = c(32,56,87,42,51,74),
sem = runif(6, min = 2, max = 5)
)
now I can plot these on the same graph easily enough
p <- ggplot(obs, aes(x = level, y = yval, fill = group)) +
geom_boxplot() +
geom_point(data = model_preds, aes(x = level, y = mean), size = 2, colour = "forestgreen") +
geom_errorbar(data = model_preds, aes(x = level, y = mean, ymax = mean + sem, ymin = mean - sem), colour = "forestgreen", size = 1) +
facet_wrap(~group)
and use that the visually look at the difference between the model predictions and the observed results.
But I think this looks a bit ugly, so ideally would want to 'dodge' the point-and-errorbars geom(s) from the boxplot geom.
If you'll forgive my quick paint drawing, something like this:
It seems like position_dodge() might be the way to go but I haven't figured out how to combine two different geoms this way and the docs don't have any examples.
Might be that it's impossible, but thought I'd ask to check
As a consequence of the grammer of graphics, which clearly separates various aspects of plotting, there is no way to communicate information between different layers (geoms and stats) of a plot. This also means that a position adjustment cannot be shared across layers, such that they can be dodged in a multi-layer fashion.
The next best thing you could do, is to use position = position_nudge() in every layer, so that across the layers they seem dodged. You might also want to adjust the width parameter of the boxplot and errorbar for this. Example below:
library(tidyverse)
n <- 30
obs <- data.frame(
group = rep(c("A", "B"), each = n*3),
level = rep(rep(c("low", "med", "high"), each = n), 2),
yval = c(
rnorm(n, 30), rnorm(n, 50), rnorm(n, 90),
rnorm(n, 40), rnorm(n, 55), rnorm(n, 70)
)
) %>%
mutate(level = factor(level, levels = c("low", "med", "high")))
model_preds <- data.frame(
group = c("A", "A", "A", "B", "B", "B"),
level = rep(c("low", "med", "high"), 2),
mean = c(32,56,87,42,51,74),
sem = runif(6, min = 2, max = 5)
)
ggplot(obs, aes(x = level, y = yval, fill = group)) +
geom_boxplot(position = position_nudge(x = -0.3),
width = 0.5) +
geom_point(data = model_preds, aes(x = level, y = mean),
size = 2, colour = "forestgreen",
position = position_nudge(x = 0.3)) +
geom_errorbar(data = model_preds,
aes(x = level, y = mean, ymax = mean + sem, ymin = mean - sem),
colour = "forestgreen", size = 1, width = 0.5,
position = position_nudge(x = 0.3)) +
facet_wrap(~group)
Created on 2021-01-17 by the reprex package (v0.3.0)

How to extend line across entire violin plot

Dataframe as example:
library(tidyverse)
set.seed(123)
df <- data.frame("b" = runif(1000, min = 2, max = 10),
"c" = runif(1000, min = 2, max = 10),
"d" = runif(1000, min = 2, max = 10))
df_2 <- data.frame(id = c("b", "c", "d"),
cutoff = c(5, 3, 5),
stringsAsFactors = FALSE)
df <-
pivot_longer(
df,
cols = c("b", "c", "d"),
names_to = "id",
values_to = "value"
) %>%
left_join(df_2, by = "id")
I can now make a violin plot (or a boxplot, same issue) with a line overlaid:
df %>%
ggplot(aes(x = id)) +
geom_violin(aes(y = value)) +
geom_line(aes(x = id, y = cutoff, group = 1), color = red)
What I'd like though is three lines (don't need to be connected) each of which extend across the entire width of a single violin, at the cutoff value specified in df_2.
I can do this manually with geom_segment, but is there a better, more programmatic way?
df %>%
ggplot(aes(x = id)) +
geom_violin(aes(y = value)) +
geom_segment(aes(x = 0.55, xend = 1.45, y = 5, yend = 5), color = "blue") +
geom_segment(aes(x = 1.55, xend = 2.45, y = 3, yend = 3), color = "blue") +
geom_segment(aes(x = 2.55, xend = 3.45, y = 5, yend = 5), color = "blue")
I understand that at some fundamental level the x-axis is ordered by factor level, with b = 1, c = 2 etc., so asking for a line intersecting x = 0.9 would require specifying corresponding y value. In another sense though, ggplot2 clearly knows (in some sense) that the region above x = 0.9 (that is, y values intersected by a vertical line at x = 0.9) is associated with factor level b because the corresponding violin for b overlaps that region. Is there a way to get at that information?
You can use geom_errorbar(). So change your second block to:
df %>%
ggplot(aes(x = id)) +
geom_violin(aes(y = value)) +
geom_errorbar(aes(x = id, ymin = cutoff,ymax = cutoff), color = "red")

How to pass break values to stat_contour by facet or group

I am trying to use the ks library to calculate the 95% home range for groups within a data set. The problem is that the "break" values which define the cut-off for the 95% contours differ between groups. So far, I have been able to get my plots, but I have to manually add the break values for each group/level and I would really like to find a solution where I can create figures in ggplot where the break values are imported automatically.
require(ks)
require(dplyr)
require(ggplot2)
# define the ks function to pass to a grouped_df
ksFUN = function(data){
H = Hpi(data[,c("x","y")], binned = TRUE) * 1
fhata = kde(data[,c("x","y")], H = H, compute.cont = TRUE,
xmin = c(minX, minY), xmax = c(maxX, maxY))
res95 = data.frame(HR = contourSizes(fhata, cont = 95, approx = TRUE))
dimnames(fhata[['estimate']]) = list(fhata[["eval.points"]][[1]],
fhata[["eval.points"]][[2]])
dat = reshape2::melt(fhata[['estimate']])
dat$breaks50 = fhata[["cont"]]["50%"]
dat$breaks95 = fhata[["cont"]]["5%"]
return(dat)
}
set.seed(100)
# create some data
df1 = data.frame(x = rnorm(100, 0, 5),
y = rnorm(100, 0, 5),
Group = "Test1")
df2 = data.frame(x = rnorm(100, 10, 5),
y = rnorm(100, 10, 5),
Group = "Test2")
df = rbind(df1, df2)
# Set the minimum and maximum x and y values outside
# of the ksFUN to keep the data on the same scale
minX = min(df$x, na.rm = T); maxX = max(df$x, na.rm = T)
minY = min(df$y, na.rm = T); maxY = max(df$y, na.rm = T)
xx = df %>%
group_by(Group) %>%
do(as.data.frame(ksFUN(.)))
# extract the break value for the 95% contour (home range) and 50% (core area)
breaks = xx %>%
group_by(Group) %>%
summarize(breaks95 = mean(breaks95),
breaks50 = mean(breaks50))
breaks
# The only way I have been able to add the breaks is to manually add them
ggplot(data = xx, aes(x = Var1, y = Var2, fill = Group)) +
geom_point(data = df, aes(x = x, y = y, col = Group)) +
stat_contour(data = xx[xx$Group == "Test1",], aes(z = value),
breaks = 0.000587, alpha = 0.3, geom = "polygon") +
stat_contour(data = xx[xx$Group == "Test2",], aes(z = value),
breaks = 0.000527, alpha = 0.3, geom = "polygon")
I would really like to find a solution where I don't have to explicitly pass the break values to the stat_contour functions
Is there a problem with using the breaks column in breaks? e.g.
# base plot
pl <- ggplot(data = xx, aes(x = Var1, y = Var2, fill = Group)) +
geom_point(data = df, aes(x = x, y = y, col = Group))
groups <- unique(xx$Group)
# loop and add for each group
for(i in groups){
pl <- pl + stat_contour(data = xx[xx$Group == i,], aes(z = value),
breaks = breaks[breaks$Group == i, ]$breaks,
alpha = 0.3, geom = "polygon")
}
pl
I get some weird plots, at the edges, especially when I remove the breaks part from stat_contour, which leads me to think there might be a bug in ksFUN

Use position_jitterdodge without mapping aesthetic

I would like to produce a plot like the one obtained with the code below. However, I would like to dodge by "replicate", but without actually mapping an aesthetic (because I would like to assign fill and colors to other aesthetics).
dataset <- data_frame(sample = rep(c("Sample1","Sample2","Sample3", "Sample4"), each = 25),
replicate = sample(x = c("A", "B"), size = 100, replace = TRUE),
value = rnorm(n = 100, mean = 0, sd = 10))
ggplot(data = dataset, aes(x = sample, y = value, fill = replicate)) +
geom_point(position = position_jitterdodge(jitter.width = 0.15, dodge.width = 0.75),
show.legend = F)
I had hope using group = replicate instead of fill = replicate but this doesn't work. I can imagine a workaround using for example alpha = replicate as an aesthetic and setting scale_alpha_manual(values = c(1, 1)) in case of duplicates, but I don't find this solution ideal and would like to keep all aesthetics available (other than x and y available for further use)
ggplot(data = dataset, aes(x = sample, y = value, alpha = replicate)) +
geom_point(position = position_jitterdodge(jitter.width = 0.15, dodge.width = 0.75),
show.legend = F) +
scale_alpha_manual(values = c(1, 1))
The plot that I expect to get is:
I hope my question makes sense, any hint ?
Best,
Yvan
You could unite the sample and replicate columns and use that as the x-axis, injecting a 'Placeholder' value for spacing between samples.
library(tidyverse)
set.seed(20181101)
dataset <- data_frame(sample = rep(c("Sample1","Sample2","Sample3", "Sample4"), each = 25),
replicate = sample(x = c("A", "B"), size = 100, replace = TRUE),
value = rnorm(n = 100, mean = 0, sd = 10))
dataset %>%
bind_rows({
#create a dummy placeholder to allow for spacing between samples
data.frame(sample = unique(dataset$sample),
replicate = rep("Placeholder", length(unique(dataset$sample))),
stringsAsFactors = FALSE)
}) %>%
#unite the sample & replicate columns, and use it as the new x-axis
unite(sample_replicate, sample, replicate, remove = FALSE) %>%
ggplot(aes(x = sample_replicate, y = value, color = replicate)) +
geom_jitter() +
#only have x-axis labels for each sample
scale_x_discrete(breaks = paste0("Sample", 1:length(unique(dataset$sample)), "_B"),
labels = paste0("Sample ", 1:length(unique(dataset$sample)))) +
labs(x = "Sample") +
#don't show the Placeholder value in the legend
scale_color_discrete(breaks = c("A", "B"))

Add hline with population median for each facet

I'd like to plot a horizontal facet-wide line with the population median of that facet.
I tried the approach without creating a dummy summary table with the following code:
require(ggplot2)
dt = data.frame(gr = rep(1:2, each = 500),
id = rep(1:5, 2, each = 100),
y = c(rnorm(500, mean = 0, sd = 1), rnorm(500, mean = 1, sd = 2)))
ggplot(dt, aes(x = as.factor(id), y = y)) +
geom_boxplot() +
facet_wrap(~ gr) +
geom_hline(aes(yintercept = median(y), group = gr), colour = 'red')
However, the line is drawn for the median of the entire dataset instead of the median separately for each facet:
In the past, a solution has been suggested to use
geom_line(stat = "hline", yintercept = "median")
but it's been discontinued (produces the error "No stat called StatHline").
Another solution suggested
geom_errorbar(aes(ymax=..y.., ymin=..y.., y = mean))
but it generates
Error in data.frame(y = function (x, ...) :
arguments imply differing number of rows: 0, 1000
Finally, there's a way to plot the median by creating a dummy table with the desired stats but I'd like to avoid it.
You could create an extra column in dt for median per facet.
library(dplyr) # With dplyr for example
dt <- dt %>% group_by(gr) %>%
mutate(med = median(y))
# Rerun ggplot line with yintercept = med
ggplot(dt, aes(x = as.factor(id), y = y)) +
geom_boxplot() +
facet_wrap(~ gr) +
geom_hline(aes(yintercept = med, group = gr), colour = 'red')
If you don't want to add a new column with the computed median, you can add a geom_smooth using a quantile regression :
library(ggplot2)
library(quantreg)
set.seed(1234)
dt <- data.frame(gr = rep(1:2, each = 500),
id = rep(1:5, 2, each = 100),
y = c(rnorm(500, mean = 0, sd = 1),
rnorm(500, mean = 1, sd = 2)))
ggplot(dt, aes(y = y)) +
geom_boxplot(aes(x = as.factor(id))) +
geom_smooth(aes(x = id), method = "rq", formula = y ~ 1, se = FALSE) +
facet_wrap(~ gr)

Resources