Add hline with population median for each facet

Add hline with population median for each facet - r

I'd like to plot a horizontal facet-wide line with the population median of that facet.
I tried the approach without creating a dummy summary table with the following code:
require(ggplot2)
dt = data.frame(gr = rep(1:2, each = 500),
id = rep(1:5, 2, each = 100),
y = c(rnorm(500, mean = 0, sd = 1), rnorm(500, mean = 1, sd = 2)))
ggplot(dt, aes(x = as.factor(id), y = y)) +
geom_boxplot() +
facet_wrap(~ gr) +
geom_hline(aes(yintercept = median(y), group = gr), colour = 'red')
However, the line is drawn for the median of the entire dataset instead of the median separately for each facet:
In the past, a solution has been suggested to use
geom_line(stat = "hline", yintercept = "median")
but it's been discontinued (produces the error "No stat called StatHline").
Another solution suggested
geom_errorbar(aes(ymax=..y.., ymin=..y.., y = mean))
but it generates
Error in data.frame(y = function (x, ...) :
arguments imply differing number of rows: 0, 1000
Finally, there's a way to plot the median by creating a dummy table with the desired stats but I'd like to avoid it.

You could create an extra column in dt for median per facet.
library(dplyr) # With dplyr for example
dt <- dt %>% group_by(gr) %>%
mutate(med = median(y))
# Rerun ggplot line with yintercept = med
ggplot(dt, aes(x = as.factor(id), y = y)) +
geom_boxplot() +
facet_wrap(~ gr) +
geom_hline(aes(yintercept = med, group = gr), colour = 'red')

If you don't want to add a new column with the computed median, you can add a geom_smooth using a quantile regression :
library(ggplot2)
library(quantreg)
set.seed(1234)
dt <- data.frame(gr = rep(1:2, each = 500),
id = rep(1:5, 2, each = 100),
y = c(rnorm(500, mean = 0, sd = 1),
rnorm(500, mean = 1, sd = 2)))
ggplot(dt, aes(y = y)) +
geom_boxplot(aes(x = as.factor(id))) +
geom_smooth(aes(x = id), method = "rq", formula = y ~ 1, se = FALSE) +
facet_wrap(~ gr)

Related

Overlay two plots from different dataframes in R

I would like to overlay two ggplots from different data sources. I don't think a left_join will work because the dataframes are of two different lengths and would potential change the underlying plots.[Maybe?]
library(tidyverse)
set.seed(123)
player_df <- tibble(name = rep(c("A","B","C","D"), each = 10, times = 1),
pos = rep(c("DEF","DEF","MID","MID"), each = 10, times = 1),
load = c(rnorm(10, mean = 200, sd = 100),
rnorm(10, mean = 300, sd = 50),
rnorm(10, mean = 400, sd = 100),
rnorm(10, mean = 500, sd = 50)))
p1 <- player_df %>%
ggplot(aes(x = load, y = name)) +
geom_point()
pos_df <- tibble(pos = rep(c("DEF","MID"), each = 30, times = 1),
load = (c(rnorm(30, mean = 250, sd = 100),
rnorm(30, mean = 350, sd = 100))))
p2 <- pos_df %>%
ggplot(aes(x = load, y = pos)) +
geom_boxplot()
p1
p2
# add p2 to every p1 player plot by pos
I would like p1 to have the corresponding p2 - by pos - appear behind it. So... add the matching p2 boxplot to each p1 scatterplot.
p1:
p2:

It's not really advisable to attempt to superimpose two plots on each other. A ggplot is made of layers already, so usually it's just a case of superimposing one geom on another. This can be difficult if (as in your case) one of the axes has different labels. However, with a little work it is possible to wrangle your data so that it all sits on a single plot. In your case, you could do something like:
levs <- c("A", "DEF", "B", "C", "MID", "D")
ggplot(within(pos_df, pos <- factor(pos, levs)), aes(x = load, y = pos)) +
geom_boxplot(width = 2.3) +
geom_point(data = within(player_df, pos <- factor(name, levs))) +
scale_y_discrete(limits = c("A", "DEF", "B", " ", "C", "MID", "D"))

Dug into ggplot a bit and re-engineered a boxplot bit by bit.
# manually calculate stats that are used in boxplots
pos_df_summary <- pos_df %>%
group_by(pos, .drop = FALSE) %>%
summarise(min = fivenum(load)[1],
Q1 = fivenum(load)[2],
median = fivenum(load)[3],
Q3 = fivenum(load)[4],
max = fivenum(load)[5]
)
# add the boxplot data to each player
joined_df <- player_df %>%
left_join(., pos_df_summary, by = "pos") %>%
distinct(name, .keep_all = TRUE)
# plot
ggplot(data = NULL, aes(group = name)) +
# create the line from min to max
geom_segment(data = joined_df, aes(y = name, yend = name, x=min, xend=max), color="black") +
#create the box with median line
geom_crossbar(data = joined_df,
aes(y = name, xmin = Q1, xmax = Q3, x = median, fill = "NA"),
color = "black",
fatten = 1) +
scale_fill_manual(values = "white") +
# add the points from the player_df
geom_point(data = player_df,
aes(x = load, y = name, group=name),
color = "red",
show.legend=FALSE) +
theme(legend.position = "none")
There may be some extraneous code in here as I cobbled it from some other resources. Specifically, I'm not sure what the aes(group = name) in the ggplot() call does exactly.

predefine bins in geom_histogram

How do I pre-define bins for the histogram? For e.g.
predefine_bin_edges <- seq(0, 10, 1)
Can someone please use this predefine bin edges to update the example.
Edited later -
Also, Is there a way to include the extreme points, that are outside the binrange, in the corner bins for a complete picture of data distribution.
library(tidyverse)
# data
x <- rnorm(n = 1000, mean = 5, sd = 3)
tbl <- tibble(x)
# geom_histogram()
ggplot(data = tbl,
aes(x = x)) +
geom_histogram()

Do you mean like this?
library(tidyverse)
# data
x <- rnorm(n = 1000, mean = 5, sd = 3)
tbl <- tibble(x)
# geom_histogram()
ggplot(data = tbl,
aes(x = x)) +
geom_histogram(breaks = seq(0, 10, 1))
I don't think there is an argument to geom histogram that can include the values outside the range in the corner bins. However, you can squish the data at the aes() level.
ggplot(data = tbl,
aes(x = pmax(pmin(x, 10), 0))) +
geom_histogram(breaks = seq(0, 10, 1))
Or if you're uncomfortable with that, you can set the limits + oob arguments in the scale.
ggplot(data = tbl,
aes(x = x)) +
geom_histogram(breaks = seq(0, 10, 1)) +
scale_x_continuous(limits = c(0, 10),
oob = scales::oob_squish)

Remove points with 0 density (no data) in stat_density_2d(geom = 'point')

I have two dataframes, one which I want to make a stat_density_2d plot using a 'raster' geom and one in which I want to use a 'point' geom. For the point geom I want to remove any point where there is no data though, as measured by a point size of 0.
The following is my code:
library(tidyverse)
set.seed(1)
#tibble for raster density plot
df <- tibble(x = runif(1000000, min = -7, max = 5),
y = runif(1000000, min = 0, max = 1000))
#tibble for point density plot
df2 <- tibble(x = runif(20000, min = -2, max = 2),
y = runif(20000, min = 0, max = 500))
#create the density plot
p1 <- ggplot(NULL, aes(x=x, y=y) ) +
stat_density_2d(data = df, aes(fill = stat(density)), geom = "raster", contour = FALSE) +
scale_fill_gradient(low="transparent", high="red") +
stat_density_2d(data = df2, geom = "point", aes(size = ..density..), n = 40, contour = FALSE) +
theme_bw() +
theme(text=element_text(size=18)) +
ylim(0, 1000) + xlim(-7, 5)
p1
which returns:
But where the points are smallest (outside the bounds specified in the df2 tibble) I don't want any density points to be shown. Is there anyway to remove these?

Here's a hack, though I don't know how robust it is to differences in data.
BLUF: add scale_radius(range=c(-1,6)).
I reduced your data a lot so that it doesn't take 5 minutes to render.
set.seed(1)
df <- tibble(x = runif(1000, min = -7, max = 5),
y = runif(1000, min = 0, max = 1000))
df2 <- tibble(x = runif(20, min = -2, max = 2),
y = runif(20, min = 0, max = 500))
Four plots:
Your code (my data), no other change;
scale_radius();
scale_radius(range = c(-0.332088004, 6)); and
scale_radius(range = c(-1, 6)).
This is surely a hack, and I don't know how to find a more precise way of filtering out specific levels.
The modified code:
p1 <- ggplot(NULL, aes(x=x, y=y) ) +
stat_density_2d(data = df, aes(fill = stat(density)), geom = "raster", contour = FALSE) +
scale_fill_gradient(low="transparent", high="red") +
stat_density_2d(data = df2, geom = "point", aes(size = ..density..), n = 40, contour = FALSE) +
theme_bw() +
# scale_radius() +
# scale_radius(range = c(-0.332088004, 6)) +
scale_radius(range = c(-1, 6)) +
theme(text=element_text(size=18)) +
ylim(0, 1000) + xlim(-7, 5)

shade block between two lines, values vary with facet_wrap

I'm plotting the relationships between speed and time for four different species (each in a different facet). For each species, I have a range of speeds I'm interested in, and would like to shade the area between the min and max values. However, these ranges are different for the 4th species compared to the first three.
#data to plot as points
species <- sample(letters[1:4], 40, replace = TRUE)
time <- runif(40, min = 1, max = 100)
speed <- runif(40, min = 1, max = 20)
df <- data.frame(species, time, speed)
#ranges of key speeds
sp <- letters[1:4]
minspeed <- c(5, 5, 5, 8)
maxspeed <- c(10, 10, 10, 13)
df.range <- data.frame(sp, minspeed, maxspeed)
ggplot() +
geom_hline(data = df.range, aes(yintercept = minspeed),
colour = "red") +
geom_hline(data = df.range, aes(yintercept = maxspeed),
colour = "red") +
geom_point(data=df, aes(time, speed),
shape = 1) +
facet_wrap(~species) +
theme_bw()
How do I:
get geom_hline to only plot the max and min ranges for the correct species, and
shade the area between the two lines?
For the later part, I've tried adding geom_ribbon to my plot, but I keep getting an error message that I'm unsure how to address.
geom_ribbon(data = df,
aes(ymin = minspeed, ymax = maxspeed,
x = c(0.0001, 100)),
fill = "grey",
alpha = 0.5) +
Error: Aesthetics must be either length 1 or the same as the data
(40): x, ymin, ymax

As per my comment, the following should work. Perhaps there are other unobserved differences between your actual use case & the example in your question?
colnames(df.range)[which(colnames(df.range) == "sp")] <- "species"
ggplot() +
geom_hline(data = df.range, aes(yintercept = minspeed),
colour = "red") +
geom_hline(data = df.range, aes(yintercept = maxspeed),
colour = "red") +
geom_point(data = df, aes(time, speed),
shape = 1) +
geom_rect(data = df.range,
aes(xmin = -Inf, xmax = Inf, ymin = minspeed, ymax = maxspeed),
fill = "grey", alpha = 0.5) +
facet_wrap(~species) +
theme_bw()
Data used:
df <- data.frame(species = sample(letters[1:4], 40, replace = TRUE),
time = runif(40, min = 1, max = 100),
speed = runif(40, min = 1, max = 20))
df.range <- data.frame(sp = letters[1:4],
minspeed = c(5, 5, 5, 8),
maxspeed = c(10, 10, 10, 13))

Show multiple histogram using facet_wrap

Sample data
df <- data.frame(id = rep(1:6, each = 50), x = rnorm(50*6, mean = 10, sd = 5),
y = rnorm(50*6, mean = 20, sd = 10),
z = rnorm(50*6, mean = 30, sd = 15))
ggplot(df, aes(x)) + geom_histogram() + facet_wrap(~id)
How do I show x, y, z in the same plot for each id in different colours

It's best to reshape data from wide to long first, and then add a fill aesthetic to map what (i.e. x, y, z) to different fill colours:
library(tidyverse)
df %>%
gather(what, val, -id) %>%
ggplot(aes(val, fill = what)) + geom_histogram() + facet_wrap(~id)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Add hline with population median for each facet - r

Related

Overlay two plots from different dataframes in R

predefine bins in geom_histogram

Remove points with 0 density (no data) in stat_density_2d(geom = 'point')

shade block between two lines, values vary with facet_wrap

Show multiple histogram using facet_wrap

Categories

Resources