How to make geom_smooth less dynamic - r

When generating smoothed plots with faceting in ggplot, if the range of the data changes from facet to facet the smoothing may acquire too many degress of freedom for the facets with less data.
For example
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)
generates the following:
The z=-5 facet is fine, but as one moves to subsequent facets the smoothing seems to 'overfit'; indeed z=-1 already suffers from that, and in the last facet, z=2, the smoothed line fits the data perfectly. Ideally, what I would like is a less dynamic smoothing that for example always smooths about 4 points (or kernel smoothing with a fixed kernel).
The following SO question is related but perhaps more ambitious (in that it wants more control over span); here I want a simpler form of smoothing.

I moved a few things around in your code to get this to work. I'm not sure if it's the best way to do it, but it's a simple way.
First we group by your z variable and then generate a number span that is small for large numbers of observations but large for small numbers. I guessed at 10/length(x). Perhaps there's some more statistically sound way of looking at it. Or perhaps it should be 2/diff(range(x)). Since this is for your own visual smoothing, you'll have to fine tune that parameter yourself.
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
distinct(z, span)
# A tibble: 8 x 2
# Groups: z [8]
z span
<int> <dbl>
1 -5 0.2000000
2 -4 0.2222222
3 -3 0.2500000
4 -2 0.2857143
5 -1 0.3333333
6 0 0.4000000
7 1 0.5000000
8 2 0.6666667
Update
The method I did have here was not working correctly. The best way to do this (and the most flexible way to do model-fitting in general) is to pre-compute it.
So we take our grouped dataframe with the computed span, fit a loess model to each group with the appropriate span, and then use broom::augment to form it back into a dataframe.
library(broom)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
filter(z <= x) %>%
group_by(z) %>%
mutate(y = dnorm(x) + 0.4*runif(length(x)),
span = 10/length(x)) %>%
do(fit = list(augment(loess(y~x, data = ., span = unique(.$span)), newdata = .))) %>%
unnest()
# A tibble: 260 x 7
z z1 x y span .fitted .se.fit
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -5 -5 -5.000000 0.045482851 0.2 0.07700057 0.08151451
2 -5 -5 -4.795918 0.248923802 0.2 0.18835244 0.05101045
3 -5 -5 -4.591837 0.243720422 0.2 0.25458037 0.04571323
4 -5 -5 -4.387755 0.249378098 0.2 0.28132026 0.04947480
5 -5 -5 -4.183673 0.344429272 0.2 0.24619206 0.04861535
6 -5 -5 -3.979592 0.256269425 0.2 0.19213489 0.05135924
7 -5 -5 -3.775510 0.004118627 0.2 0.14574901 0.05135924
8 -5 -5 -3.571429 0.093698117 0.2 0.15185599 0.04750935
9 -5 -5 -3.367347 0.267809673 0.2 0.17593182 0.05135924
10 -5 -5 -3.163265 0.208380125 0.2 0.22919335 0.05135924
# ... with 250 more rows
This has the side effect of duplicating the grouping column z, but it intelligently renames it to avoid name-collision, so we can ignore it. You can see that there are the same number of rows as the original data, and the original x, y, and z are there, as well as our computed span.
If you want to prove to yourself that it's really fitting each group with the right span, you can do something like:
... mutate(...) %>%
do(fit = (loess(y~x, data = ., span = unique(.$span)))) %>%
pull(fit) %>% purrr::map(summary)
That will print out the model summaries with the span included.
Now it's just a matter of plotting the augmented dataframe we just made, and manually reconstructing the smoothed line and confidence interval.
... %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ribbon(aes(x, ymin = .fitted - 1.96*.se.fit,
ymax = .fitted + 1.96*.se.fit),
alpha = 0.2) +
geom_line(aes(x, .fitted), color = "blue", size = 1) +
facet_wrap(~ z)

I would simply remove the span option (because 0.3 seems too granular) or use lm method to do polynomial fit.
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) %>%
ggplot(aes(x,y)) +
geom_line() +
geom_smooth(method = 'lm', formula = y ~ poly(x, 4)) +
#geom_smooth(method = 'loess') +
#geom_smooth(method = 'loess', span = 0.3) +
facet_wrap(~ z)

Since I asked how to do kernel smoothing I wanted to provide an answer for that.
I'll start by just adding it as extra data to data frame and plotting that, much as the accepted answer does.
First here is the data and packages I'll be using (same as in my post):
library(dplyr)
library(ggplot2) # ggplot2_2.2.1
set.seed(1234)
expand.grid(z = -5:2, x = seq(-5,5, len = 50)) %>%
mutate(y = dnorm(x) + 0.4*runif(n())) %>%
filter(z <= x) ->
Z
Next here is the plot:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = 2))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
which simply uses ksmooth from base R. Note that it's quite simple to avoid the dynamic smoothing (making the bandwidth constant takes care of that). In fact, one can recover the a dynamic style smoothing (i.e., more like geom_smooth) as follows:
Z %>%
group_by(z) %>%
do(data.frame(ksmooth(.$x, .$y, 'normal', bandwidth = diff(range(.$x))/5))) %>%
ggplot(aes(x,y)) +
geom_line(data = Z) +
geom_line(color = 'blue', size = 1) +
facet_wrap(~ z)
I also followed the example in https://github.com/hrbrmstr/ggalt/blob/master/R/geom_xspline.r to turn this idea into an actual stat_ and geom_ as follows:
geom_ksmooth <- function(mapping = NULL, data = NULL, stat = "ksmooth",
position = "identity", na.rm = TRUE, show.legend = NA,
inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
geom = GeomKsmooth,
mapping = mapping,
data = data,
stat = stat,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...)
)
}
GeomKsmooth <- ggproto("GeomKsmooth", GeomLine,
required_aes = c("x", "y"),
default_aes = aes(colour = "blue", size = 1, linetype = 1, alpha = NA)
)
stat_ksmooth <- function(mapping = NULL, data = NULL, geom = "line",
position = "identity", na.rm = TRUE, show.legend = NA, inherit.aes = TRUE,
bandwidth = 0.5, ...) {
layer(
stat = StatKsmooth,
data = data,
mapping = mapping,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(bandwidth = bandwidth,
...
)
)
}
StatKsmooth <- ggproto("StatKsmooth", Stat,
required_aes = c("x", "y"),
compute_group = function(self, data, scales, params,
bandwidth = 0.5) {
data.frame(ksmooth(data$x, data$y, kernel = 'normal', bandwidth = bandwidth))
}
)
(Note that I have a very poor understanding of the above code.) But now we can do:
Z %>%
ggplot(aes(x,y)) +
geom_line() +
geom_ksmooth(bandwidth = 2) +
facet_wrap(~ z)
And the smoothing is not dynamic, as I originally wanted.
I do wonder if there is a simpler way, though.

Related

Visualising diagonal in asymmetric matrix plot

I have a number of symmetric matrices of the same dimensionality, and I wish to visualise the mean and variance of the values in each cell across these matrices in an elegant way (which I will make more precise below) that makes use of the symmetric character.
Let me start by making some data to illustrate. The following creates 10 9x9 matrices, aggregates the mean and variance, and transforms to long format in preparation for plotting:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
rownames(m) <- colnames(m) <- letters[1:n]
m
}
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "mean")
vars <- apply(matrices, 1:2, var) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "var")
df <- full_join(means, vars, by = c("row", "col"))
head(df)
#> # A tibble: 6 x 4
#> row col mean var
#> <chr> <chr> <dbl> <dbl>
#> 1 a a 0.548 0.111
#> 2 a b 0.507 0.0914
#> 3 a c 0.374 0.105
#> 4 a d 0.350 0.0976
#> 5 a e 0.525 0.0752
#> 6 a f 0.452 0.0887
Now, I could simply use geom_tile to make one plot of the means, and one plot of the variances. However, considering that both of these are symmetric, this wastes quite a lot of space, and also fails to communicate the symmetric character to the audience.
To address this problem, I have been playing around with the ggasym package to create an asymmetric matrix plot. The following is a slight modification from the ggasym vignette:
library(ggasym)
library(ggplot2)
ggplot(df, aes(x = col, y = row)) +
geom_asymmat(aes(fill_diag = mean, fill_tl = mean, fill_br = var)) +
scale_fill_diag_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_tl_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_br_gradient(low = "lightblue1", high = "dodgerblue") +
geom_text(data = filter(df, row == col), aes(label = signif(var, 2)))
Created on 2020-06-27 by the reprex package (v0.3.0)
What bothers me about this is the diagonal. In the above, I have mapped the fill of the diagonal to the means, and overlaid the variance by text, which works, but doesn't seem great. Specifically, I would like to map all the information here to fill, so as to get rid of the text. I see a couple of options for how to do this, but I am not sure how to implement any of them:
Split the fill of the diagonal cells, so that (in the example above) the lower right of each cell on the diagonal is an appropriate shade of blue, while the upper left is some shade of red.
Plot the upper and lower matrices separately (each with the diagonal), and then somehow "overlay" these plots so that they end up next to each other in an appropriate way. In other words, this would plot the diagonal twice.
I am open to other suggestions for how to accomplish this in a clean way. Let me emphasise that I do not require a solution building on ggasym, this was simply the closest I have been able to get so far. However, I would like some kind of ggplot-based solution.
So here is my take on the 'split-the-fill' strategy. You can plot most of the things you would want in ggplot if you don't mind parameterising your stuff as polygons. We let the ggnewscale package handle the double fill mapping for us.
First off, we no longer autoname the matrices, as we will not use the dimnames.
suppressPackageStartupMessages({
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggnewscale)
})
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
# rownames(m) <- colnames(m) <- letters[1:n]
m
}
Below is a function that takes a matrix, parameterises it as a polygon and cuts off one half.
halfmat <- function(mat, side) {
side <- match.arg(side, c("upper", "lower", "both"))
# Convert to long format
dat <- data.frame(
x = as.vector(row(mat)),
y = as.vector(col(mat)),
id = seq_along(mat),
value = as.vector(mat)
)
# Parameterise as polygon
poly <- with(dat, data.frame(
x = c(x - 0.5, x + 0.5, x + 0.5, x - 0.5),
y = c(y - 0.5, y - 0.5, y + 0.5, y + 0.5),
id = rep(id, 4),
value = rep(value, 4)
))
# Slice off one of the triangles
if (side == "upper") {
poly <- filter(poly, y >= x)
} else if (side == "lower") {
poly <- filter(poly, x >= y)
}
poly
}
Then we generate the data, compute the means and variances and reparameterise them.
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>% halfmat("upper")
vars <- apply(matrices, 1:2, var) %>% halfmat("lower")
Then we put in the means and variances as two seperate polygon layers, since we need to seperate the fill mappings with new_scale_fill(). There is a bit of extra fiddling with the scales, as these are now continuous instead of discrete, but it is not that bad.
ggplot(means, aes(x, y, fill = value, group = id)) +
geom_polygon() +
scale_fill_distiller(palette = "Reds", name = "Mean") +
# Be sure to call new_scale_fill() only after you've set up a fill scale
# for the upper part
new_scale_fill() +
geom_polygon(data = vars, aes(fill = value)) +
scale_fill_distiller(palette = "Blues", name = "Variance") +
scale_x_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "col") +
scale_y_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "row")
Created on 2020-06-27 by the reprex package (v0.3.0)

Geom_freqpoly with Predefined Count

I can plot geom_freqpoly without problems using the number of observation
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value))
But I'd like to use the precalculated obeservation count contained in the data.
I tried using stat = "identity" but it apparently doesn't work.
ggplot(data=demo) +
geom_freqpoly(mapping=aes(x = value, y = cnt), stat = "identity")
This is my sample data
demo <- tribble(
~value, ~cnt,
.25, 20,
.25, 30,
.1, 40
)
TL;DR: You didn't get the graph you want, because the data of pre-calculated counts you passed to ggplot was NOTHING like what was used to produce the freqpoly graph.
Since you didn't include code for the original demo used to generate graph 1, I'll venture a guess:
demo.orig <- data.frame(value = c(0.25, 0.25, 0.1))
p <- ggplot(demo.orig, aes(x = value)) +
geom_freqpoly()
p # show plot to verify its appearance, which matches the graph in the question
layer_data(p) # look at the calculated data used by geom_freqpoly
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
y count x xmin xmax width density ncount ndensity PANEL group colour size linetype alpha
1 0 0 0.09310345 0.09051724 0.09568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
2 1 1 0.09827586 0.09568966 0.10086207 0.005172414 64.44444 0.5 0.5 1 -1 black 0.5 1 NA
3 0 0 0.10344828 0.10086207 0.10603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
... (omitted to conserve space)
30 0 0 0.24310345 0.24051724 0.24568966 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
31 2 2 0.24827586 0.24568966 0.25086207 0.005172414 128.88889 1.0 1.0 1 -1 black 0.5 1 NA
32 0 0 0.25344828 0.25086207 0.25603448 0.005172414 0.00000 0.0 0.0 1 -1 black 0.5 1 NA
From a small dataframe with only two unique values, stat_bin generated a much larger dataframe with the x-axis split into 30 bins (the default number), and count / y = 0 everywhere except for the two bins containing the original values.
> geom_freqpoly
function (mapping = NULL, data = NULL, stat = "bin", position = "identity",
..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE)
{
params <- list(na.rm = na.rm, ...)
if (identical(stat, "bin")) {
params$pad <- TRUE
}
layer(data = data, mapping = mapping, stat = stat, geom = GeomPath,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = params)
}
A quick check by printing geom_freqpoly to console shows that its underlying geom is simply GeomPath, which plots x/y pairs in sequential order.
In other words, if you want to get the peaks from graph 1, you need to provide a similar dataset, with rows indicating where y should drop to 0. While it's certainly possible to calculate this by digging into the code for StatBin$compute_group, I'd think it's simpler to expand from the data of pre-calculated counts and let ggplot do its normal job:
demo %>%
tidyr::uncount(cnt) %>%
ggplot(aes(x = value)) +
geom_freqpoly() +
theme_minimal()
Edit: solution without fully expanding dataframe of aggregated counts
Sample dataset with 2 groups:
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
Code:
library(ggplot2)
library(dplyr)
demo %>%
rename(x = value, y = cnt) %>% # rename here so approach below can be easily applied
# to other datasets with different column names
tidyr::nest(data = c(x, y)) %>% # nest to apply same approach for each group
mutate(data = purrr::map(
data,
function(d) ggplot2:::bin_vector( # cut x's range into appropriate bins
x = d$x,
bins = ggplot2:::bin_breaks_bins(
x_range = range(d$x),
bins = 30), # default bin count is 30; change if desired
pad = TRUE) %>%
select(x, xmin, xmax) %>%
# place y counts into the corresponding x bins (this is probably similar
# to interval join, but I don't have that package installed on my machine)
tidyr::crossing(d %>% rename(x2 = x)) %>%
mutate(y = ifelse(x2 >= xmin & x2 < xmax, y, 0)) %>%
select(-x2) %>%
group_by(x) %>%
filter(y == max(y)) %>%
ungroup() %>%
unique())) %>%
tidyr::unnest(cols = c(data)) %>% # unnest to get one flat dataframe back
ggplot(aes(x = x, y = y, colour = group)) + # plot as per normal
geom_path() +
theme_bw()
# package versions used: dplyr 1.0.0, ggplot2 3.3.1, tidyr 1.1.0, purrr 0.3.4
Based on the similar problem for histograms the solution seems to be as simple as to use the weight parameter in the aesthetics.
The solution using the sample data from the other answer would be
demo <- data.frame(value = c(0.25, 0.5, 0.1, 0.25, 0.75, 0.1),
cnt = c(5, 2, 4, 3, 8, 7) * 10e8,
group = rep(c("a", "b"), each = 3))
ggplot(demo, aes(value, weight = cnt, color = group)) + geom_freqpoly()

How to make stacked circle plot without coord_polar

I've got a dataset similar to this:
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
z <- exp(-(dist / 8)^2)
which can be visualised as follows:
data.frame(x, y, z) %>%
ggplot() + geom_point(aes(x, y, color = z))
What I would like to do is a stacked half-circle plot with averaged value of z in subsequent layers. I think it can be done with the combination of geom_col and coord_polar(), although the farthest I can get is
data.frame(x, y, z, dist) %>%
mutate(dist_fct = cut(dist, seq(0, max(dist), by = 5))) %>%
ggplot() + geom_bar(aes(x = 1, y = 1, fill = dist_fct), stat = 'identity', position = 'fill') +
coord_polar()
which is obviously far from the expectation (layers should be of equal size, plot should be clipped on the right half).
The problem is that I can't really use coord_polar() due to further use of annotate_custom(). So my question are:
can plot like this can be done without coord_polar()?
If not, how can it be done with coord_polar()?
The result should be similar to a graphic below, except from plotting layers constructed from points I would like to plot only layers as a whole with color defined as an average value of z inside a layer.
If you want simple radius bands, perhaps something like this would work as you pictured it in your question:
# your original sample data
x <- 100 - abs(rnorm(1e6, 0, 5))
y <- 50 + rnorm(1e6, 0, 3)
dist <- sqrt((x - 100)^2 + (y - 50)^2)
nbr_bands <- 6 # set nbr of bands to plot
# calculate width of bands
band_width <- max(dist)/(nbr_bands-1)
# dist div band_width yields an integer 0 to nbr bands
# as.factor makes it categorical, which is what you want for the plot
band = as.factor(dist %/% (band_width))
library(dplyr)
library(ggplot2)
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed() +
theme_dark() # dark theme
Edit to elaborate:
As you first attempted, it would be nice to use the very handy cut() function to calculate the radius color categories.
One way to get categorical (discrete) colors, rather than continuous shading, for your plot color groups is to set your aes color= to a factor column.
To directly get a factor from cut() you may use option ordered_result=TRUE:
band <- cut(dist, nbr_bands, ordered_result=TRUE, labels=1:nbr_bands) # also use `labels=` to specify your own labels
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Or more simply you may use cut() without options and convert to a factor using as.factor():
band <- as.factor( cut(dist, nbr_bands, labels=FALSE) )
data.frame(x, y, band) %>%
ggplot() + geom_point(aes(x, y, color = band)) + coord_fixed()
Sounds like you may find the circle & arc plotting functions from the ggforce package useful:
# data
set.seed(1234)
df <- data.frame(x = 100 - abs(rnorm(1e6, 0, 5)),
y = 50 + rnorm(1e6, 0, 3)) %>%
mutate(dist = sqrt((x - 100)^2 + (y - 50)^2)) %>%
mutate(z = exp(-(dist / 8)^2))
# define cut-off values
cutoff.values <- seq(0, ceiling(max(df$dist)), by = 5)
df %>%
# calculate the mean z for each distance band
mutate(dist_fct = cut(dist, cutoff.values)) %>%
group_by(dist_fct) %>%
summarise(z = mean(z)) %>%
ungroup() %>%
# add the cutoff values to the dataframe for inner & outer radius
arrange(dist_fct) %>%
mutate(r0 = cutoff.values[-length(cutoff.values)],
r = cutoff.values[-1]) %>%
# add coordinates for circle centre
mutate(x = 100, y = 50) %>%
# plot
ggplot(aes(x0 = x, y0 = y,
r0 = r0, r = r,
fill = z)) +
geom_arc_bar(aes(start = 0, end = 2 * pi),
color = NA) + # hide outline
# force equal aspect ratio in order to get true circle
coord_equal(xlim = c(70, 100), expand = FALSE)
Plot generation took <1s on my machine. Yours may differ.
I'm not sure this satisfies everything, but it should be a start. To cut down on the time for plotting, I'm summarizing the data into a grid, which lets you use geom_raster. I don't entirely understand the breaks and everything you're using, so you might want to tweak some of how I divided the data for making the distinct bands. I tried out a couple ways with cut_interval and cut_width--this would be a good place to plug in different options, such as the number or width of bands.
Since you mentioned getting the average z for each band, I'm grouping by the gridded x and y and the cut dist, then using mean of z for setting bands. I threw in a step to make labels like in the example--you probably want to reverse them or adjust their positioning--but that comes from getting the number of each band's factor level.
library(tidyverse)
set.seed(555)
n <- 1e6
df <- data_frame(
x = 100 - abs(rnorm(n, 0, 5)),
y = 50 + rnorm(n, 0, 3),
dist = sqrt((x - 100)^2 + (y - 50)^2),
z = exp(-(dist / 8)^2)
) %>%
mutate(brk = cut(dist, seq(0, max(dist), by = 5), include.lowest = T))
summarized <- df %>%
filter(!is.na(brk)) %>%
mutate(x_grid = floor(x), y_grid = floor(y)) %>%
group_by(x_grid, y_grid, brk) %>%
summarise(avg_z = mean(z)) %>%
ungroup() %>%
# mutate(z_brk = cut_width(avg_z, width = 0.15)) %>%
mutate(z_brk = cut_interval(avg_z, n = 9)) %>%
mutate(brk_num = as.numeric(z_brk))
head(summarized)
#> # A tibble: 6 x 6
#> x_grid y_grid brk avg_z z_brk brk_num
#> <dbl> <dbl> <fct> <dbl> <fct> <dbl>
#> 1 75 46 (20,25] 0.0000697 [6.97e-05,0.11] 1
#> 2 75 47 (20,25] 0.000101 [6.97e-05,0.11] 1
#> 3 75 49 (20,25] 0.0000926 [6.97e-05,0.11] 1
#> 4 75 50 (20,25] 0.0000858 [6.97e-05,0.11] 1
#> 5 75 52 (20,25] 0.0000800 [6.97e-05,0.11] 1
#> 6 76 51 (20,25] 0.000209 [6.97e-05,0.11] 1
To make the labels, summarize that data to have a single row per band--I did this by taking the minimum of the gridded x, then using the average of y so they'll show up in the middle of the plot.
labels <- summarized %>%
group_by(brk_num) %>%
summarise(min_x = min(x_grid)) %>%
ungroup() %>%
mutate(y_grid = mean(summarized$y_grid))
head(labels)
#> # A tibble: 6 x 3
#> brk_num min_x y_grid
#> <dbl> <dbl> <dbl>
#> 1 1 75 49.7
#> 2 2 88 49.7
#> 3 3 90 49.7
#> 4 4 92 49.7
#> 5 5 93 49.7
#> 6 6 94 49.7
geom_raster is great for these situations where you have data in an evenly spaced grid that just needs uniform tiles at each position. At this point, the summarized data has 595 rows, instead of the original 1 million, so the time to plot shouldn't be an issue.
ggplot(summarized) +
geom_raster(aes(x = x_grid, y = y_grid, fill = z_brk)) +
geom_label(aes(x = min_x, y = y_grid, label = brk_num), data = labels, size = 3, hjust = 0.5) +
theme_void() +
theme(legend.position = "none", panel.background = element_rect(fill = "gray40")) +
coord_fixed() +
scale_fill_brewer(palette = "PuBu")
Created on 2018-11-04 by the reprex package (v0.2.1)

Adding group-specific text/data to faceted plot in R/ggplot2

I am comparing the intra-group correlation between duplicate samples within a large gene expression experiment, where I have multiple separate biological groups - the idea being to see if any of the groups is much less well-correlated than the others, indicating a potential sample mixup or other error.
I am using ggplot to plot the expression values of each duplicate pair against each other. I would like to also be able to add the correlation coefficient and p-value to each panel of the plot, which I obtain through summarize and cor.test. You can use this code to get the general idea: in exp1, the duplicates are correlated, but not in exp2.
library(tidyverse)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
head(df)
tail(df)
df %>% ggplot(aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp)
group_by(df, exp) %>%
summarize(corr=cor.test(a,b)$estimate, pval=cor.test(a,b)$p.value)
This is the plot I generated via ggplot, and I've manually added the R and p-values that I got at the end. But of course, if I have a lot of sample pairs to analyze, it would be nice to be able to add these automatically from within the ggplot call. I'm just not sure how to do it.
If, for whatever reason, you want to build this yourself instead of using the ggpubr functions, you can create your summary data, format labels, and place the labels with geom_text.
I'm formatting the stats so that R has a fixed 3 significant digits and p has 3 digits, falling back on scientific notation. I changed the names of those columns in summarise to R and p to make the labels below. Reshaping to long data and creating a new column with unite gets this:
library(tidyverse)
...
group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ")
#> # A tibble: 4 x 2
#> exp stat
#> <chr> <chr>
#> 1 exp1 R = 0.965
#> 2 exp2 R = 0.0438
#> 3 exp1 p = 1.14e-58
#> 4 exp2 p = 0.665
Then for each of the groups, I want to collapse both labels, separated by a newline \n. This is a place that will scale well—you might have more summary stats to display, but this should still work.
summ <- group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ") %>%
group_by(exp) %>%
summarise(both_stats = paste(stat, collapse = "\n"))
summ
#> # A tibble: 2 x 2
#> exp both_stats
#> <chr> <chr>
#> 1 exp1 "R = 0.965\np = 1.14e-58"
#> 2 exp2 "R = 0.0438\np = 0.665"
In geom_text, I'm setting the x coordinate to -Inf, which gets the minimum of all x values, and the y coordinate as Inf for the maximum of all y values. That puts the label in the top-left corner, regardless of the values in the data.
The one thing I don't like here is then hacking the hjust and vjust outside their intended ranges of 0 to 1. But nudge_x/nudge_y won't do anything because of the values being set to infinity.
df %>%
ggplot(aes(x = a, y = b)) +
geom_point() +
geom_text(aes(x = -Inf, y = Inf, label = both_stats), data = summ,
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~ exp)
Created on 2018-11-14 by the reprex package (v0.2.1)
We can use the stat_cor function from the ggpubr package.
set.seed(123)
library(dplyr)
library(ggplot2)
library(ggpubr)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
ggplot(df, aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp) +
stat_cor(method = "pearson")
Similar to the answer of camille, but you can do all in one run
library(tidyverse)
set.seed(123)
df %>%
group_by(exp) %>%
mutate(p = cor.test(a, b)$p.value,
rho = cor.test(a, b)$estimate) %>%
mutate_at(vars(p, rho), signif, 2) %>%
ggplot(aes(x=a, y=b)) +
geom_point() +
geom_text(data = . %>% distinct(p, rho, exp),
aes(x = -Inf, y = Inf,label = paste("p=",p,"\nrho=",rho)),
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~exp)

overlay/superimpose grouped bar plots in ggplot2

I'd like to make a bar plot featuring an overlay of data from two time points, 'before' and 'after'.
At each time point, participants were asked two questions ('pain' and 'fear'), which they would answer by stating a score of 1, 2, or 3.
My existing code plots the counts for the data from the 'before' time point nicely, but I can't seem to add the counts for the 'after' data.
This is a sketch of what I'd like the plot to look like with the 'after' data added, with the black bars representing the 'after' data:
I'd like to make the plot in ggplot2() and I've tried to adapt code from How to superimpose bar plots in R? but I can't get it to work for grouped data.
Many thanks!
#DATA PREP
library(dplyr)
library(ggplot2)
library(tidyr)
df <- data.frame(before_fear=c(1,1,1,2,3),before_pain=c(2,2,1,3,1),after_fear=c(1,3,3,2,3),after_pain=c(1,1,2,3,1))
df <- df %>% gather("question", "answer_option") # Get the counts for each answer of each question
df2 <- df %>%
group_by(question,answer_option) %>%
summarise (n = n())
df2 <- as.data.frame(df2)
df3 <- df2 %>% mutate(time = factor(ifelse(grepl("before", question), "before", "after"),
c("before", "after"))) # change classes and split data into two data frames
df3$n <- as.numeric(df3$n)
df3$answer_option <- as.factor(df3$answer_option)
df3after <- df3[ which(df3$time=='after'), ]
df3before <- df3[ which(df3$time=='before'), ]
# CODE FOR 'BEFORE' DATA ONLY PLOT - WORKS
ggplot(df3before, aes(fill=answer_option, y=n, x=question)) + geom_bar(position="dodge", stat="identity")
# CODE FOR 'BEFORE' AND 'AFTER' DATA PLOT - DOESN'T WORK
ggplot(mapping = aes(x, y,fill)) +
geom_bar(data = data.frame(x = df3before$question, y = df3before$n, fill= df3before$index_value), width = 0.8, stat = 'identity') +
geom_bar(data = data.frame(x = df3after$question, y = df3after$n, fill=df3after$index_value), width = 0.4, stat = 'identity', fill = 'black') +
theme_classic() + scale_y_continuous(expand = c(0, 0))
I think the clue is to set the width of the "after" bars, but to dodge them as if their width are 0.9 (i.e. the same (default) width as the "before" bars). In addition, because we don't map fill of the "after" bars, we need to use the group aesthetic instead to achieve the dodging.
I prefer to have only one data set and just subset it in each call to geom_col.
ggplot(mapping = aes(x = question, y = n, fill = factor(ans))) +
geom_col(data = d[d$t == "before", ], position = "dodge") +
geom_col(data = d[d$t == "after", ], aes(group = ans),
fill = "black", width = 0.5, position = position_dodge(width = 0.9))
Data:
set.seed(2)
d <- data.frame(t = rep(c("before", "after"), each = 6),
question = rep(c("pain", "fear"), each = 3),
ans = 1:3, n = sample(12))
Alternative data preparation using data.table, starting with your original 'df':
library(data.table)
d <- melt(setDT(df), measure.vars = names(df), value.name = "ans")
d[ , c("t", "question") := tstrsplit(variable, "_")]
Either pre-calculate the counts and proceed as above with geom_col
# d2 <- d[ , .N, by = .(question, ans)]
Or let geom_bar do the counting:
ggplot(mapping = aes(x = question, fill = factor(ans))) +
geom_bar(data = d[d$t == "before", ], position = "dodge") +
geom_bar(data = d[d$t == "after", ], aes(group = ans),
fill = "black", width = 0.5, position = position_dodge(width = 0.9))
Data:
df <- data.frame(before_fear = c(1,1,1,2,3), before_pain = c(2,2,1,3,1),
after_fear = c(1,3,3,2,3),after_pain = c(1,1,2,3,1))
My solution is very similar to #Henrik's, but I wanted to point out a few things.
First, you're building your data frames inside your geom_cols, which is probably messier than you need it to be. If you've already created df3after, etc., you might as well use it inside your ggplot.
Second, I had a hard time following your tidying. I think there are a couple tidyr functions that might make this task easier on you, so I went a different route, such as using separate to create the columns of time and measure, rather than essentially searching for them manually, making it more scalable. This also lets you put "pain" and "fear" on your x-axis, rather than still having "before_pain" and "before_fear", which are no longer accurate representations once you have "after" values on the plot as well. But feel free to disregard this and stick with your own method.
library(tidyverse)
df <- data.frame(before_fear = c(1,1,1,2,3),
before_pain = c(2,2,1,3,1),
after_fear = c(1,3,3,2,3),
after_pain = c(1,1,2,3,1))
df_long <- df %>%
gather(key = question, value = answer_option) %>%
mutate(answer_option = as.factor(answer_option)) %>%
count(question, answer_option) %>%
separate(question, into = c("time", "measure"), sep = "_", remove = F)
df_long
#> # A tibble: 12 x 5
#> question time measure answer_option n
#> <chr> <chr> <chr> <fct> <int>
#> 1 after_fear after fear 1 1
#> 2 after_fear after fear 2 1
#> 3 after_fear after fear 3 3
#> 4 after_pain after pain 1 3
#> 5 after_pain after pain 2 1
#> 6 after_pain after pain 3 1
#> 7 before_fear before fear 1 3
#> 8 before_fear before fear 2 1
#> 9 before_fear before fear 3 1
#> 10 before_pain before pain 1 2
#> 11 before_pain before pain 2 2
#> 12 before_pain before pain 3 1
I split this into before & after datasets, as you did, then plotted them with 2 geom_cols. I still put df_long into ggplot, treating it almost as a dummy to get uniform x and y aesthetics. Like #Henrik said, you can use different width in the geom_col and in its position_dodge to dodge the bars at a width of 90% but give the bars themselves a width of only 40%.
df_before <- df_long %>% filter(time == "before")
df_after <- df_long %>% filter(time == "after")
ggplot(df_long, aes(x = measure, y = n)) +
geom_col(aes(fill = answer_option),
data = df_before, width = 0.9,
position = position_dodge(width = 0.9)) +
geom_col(aes(group = answer_option),
data = df_after, fill = "black", width = 0.4,
position = position_dodge(width = 0.9))
What you could instead of making the two separate data frames is to filter inside each geom_col. This is generally my preference unless the filtering is more complex. This code will get the same plot as above.
ggplot(df_long, aes(x = measure, y = n)) +
geom_col(aes(fill = answer_option),
data = . %>% filter(time == "before"), width = 0.9,
position = position_dodge(width = 0.9)) +
geom_col(aes(group = answer_option),
data = . %>% filter(time == "after"), fill = "black", width = 0.4,
position = position_dodge(width = 0.9))

Resources