How is the binning done in stat_summary_bin in ggplot2? - r

I'm trying to add some custom features to a bin-scatter plot using ggplot2. The original way that I was doing the bin-scatter was with stat_summary_bin(fun.y="mean"). This seems to produce a reasonable binning, but when I try to reproduce it by binning manually, I keep getting slightly different results -- especially at the right tail.
Can anyone help me figure out how the binning in stat_summary_bin is done? I need to figure out if this is a reliable form of bin-scattering that I can use...
library(tidyverse)
library(mltools)
#>
#> Attaching package: 'mltools'
#> The following object is masked from 'package:tidyr':
#>
#> replace_na
x = runif(1000, 0, 10)
y = x + rnorm(1000, 0.5, 2)
plot(x,y)
df <- data.frame(x = x, y = y)
p <- df %>%
ggplot(aes(x = x, y = y)) +
stat_summary_bin(aes(color ="stat summary"),fun.y = "mean", size = 2.5, geom="point", bins=20)
p
## Attempt 1 at binning
df$x_bin <- mltools::bin_data(df$x, bins=20, binType = "explicit")
df_binned <- df %>%
group_by(x_bin) %>%
mutate(
x_binned = mean(x),
y_binned = mean(y)
) %>%
ungroup()
p <- p + geom_point(aes(x = df_binned$x_binned, y = df_binned$y_binned, color = "manual bin"), size = 2.5)
p
## Attempt 2 at binning
xbreaks = quantile(df$x, probs = seq(0,1,0.05))
df_binned$x_bin_2 <- cut(df$x, xbreaks, include.lowest = T)
df_binned <- df_binned %>%
group_by(x_bin_2) %>%
mutate(
x_binned2 = mean(x),
y_binned2 = mean(y)
) %>%
ungroup()
p <- p + geom_point(aes(x = df_binned$x_binned2, y = df_binned$y_binned2, color = "2nd manual bin"), size = 2.5)
p
Created on 2018-09-09 by the reprex
package (v0.2.0).

Related

Using assign within ggplot loop gives incorrect plots

I'm creating three plots in a loop over I and using assign to save each plot. The y variable is scaled by the loop index. The scaling should ensure that the final panel of plots each has y going from 0 to 1. This isn't happening and the plots seem to be being changed as the loop runs. I'd be grateful if someone could explain this apparently odd behaviour.
library(dplyr)
library(ggplot2)
library(gridExtra)
loci = c(1,2,3)
x <- seq(0,1,0.01)
df <- expand.grid(x = x, loci = loci)
df <- df %>% mutate(y = loci * x)
cols = c("red", "blue", "green")
for (i in loci){
plot_this <- df %>% filter(loci == i)
my_plot = ggplot(plot_this) +
geom_point( aes( x = x, y = y/i), colour = cols[i]) +
ylim(0,3) + ggtitle(paste0("i = ", i))
assign(paste0("plot_", i), my_plot)
print(plot_1)
}
grid.arrange(plot_1, plot_2, plot_3, ncol = 3)
It's due to the lazy evaluation nature of ggplot, and more explanation can be found in this post.
"Looping" with lapply avoids the problem.
Data
library(ggplot2)
library(gridExtra)
library(dplyr)
loci = c(1,2,3)
x <- seq(0,1,0.01)
df <- expand.grid(x = x, loci = loci)
df <- df %>% mutate(y = loci * x)
cols = c("red", "blue", "green")
Code
my_plot <- lapply(loci, function(i) {
df %>%
filter(loci == i) %>%
ggplot() +
geom_point(aes(x = x, y = y/i), colour = cols[i]) +
ylim(0,3) +
ggtitle(paste0("i = ", i))
})
grid.arrange(my_plot[[1]], my_plot[[2]], my_plot[[3]], ncol = 3)
Created on 2022-04-26 by the reprex package (v2.0.1)

Simulate a two-dimensional random walk in a grid in R and plot with ggplot

I was looking for a simple code that could simulate a two-dimensional random walk in a grid (using R), and then plot the data using ggplot.
In particular, I was interested to a random walk from few position (5 points) in a 2D grid to the center of the square grid. It is just for visualisation purposes.
And my idea was then to plot the results with ggplot on a discrete grid (as the one simulated), may be using the function geom_tile.
Do you have any suggestion for a pre-existing code that I could easily manipulate?
Here is a small example with a for loop. From here, you can simply adjust how X_t and Y_t are defined:
Xt = 0; Yt = 0
for (i in 2:1000)
{
Xt[i] = Xt[i-1] + rnorm(1,0,1)
Yt[i] = Yt[i-1] + rnorm(1,0,1)
}
df <- data.frame(x = Xt, y = Yt)
ggplot(df, aes(x=x, y=y)) + geom_path() + theme_classic() + coord_fixed(1)
EDIT ----
After chatting with OP I've revised the code to include a step probability. This may result in the walk being stationary much more frequently. In higher dimensions, you will need to scale your prob factor lower in order to compensate for more options.
finally, my function does not account for an absolute distance, it only considers points on the grid that are within a certain step size in all dimensions. For example, hypothetically, at position c(0,0) you could go to c(1,1) with this function. But I guess this is relative to the grid's connectiveness.
If the OP wants to only consider nodes that are within 1 (by distance) of the current position, then use the following version of move_step()
move_step <- function(cur_pos, grid, prob = 0.04, size = 1){
opts <- grid %>%
rowwise() %>%
mutate(across(.fns = ~(.x-.env$cur_pos[[cur_column()]])^2,
.names = '{.col}_square_diff')) %>%
filter(sqrt(sum(c_across(ends_with("_square_diff"))))<=.env$size) %>%
select(-ends_with("_square_diff")) %>%
left_join(y = mutate(cur_pos, current = TRUE), by = names(grid))
new_pos <- opts %>%
mutate(weight = case_when(current ~ 1-(prob*(n()-1)), #calculate chance to move,
TRUE ~ prob), #in higher dimensions, we may have more places to move
weight = if_else(weight<0, 0, weight)) %>% #thus depending on prob, we may always move.
sample_n(size = 1, weight = weight) %>%
select(-weight, -current)
new_pos
}
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(gganimate)
move_step <- function(cur_pos, grid, prob = 0.04, size = 1){
opts <- grid %>%
filter(across(.fns = ~ between(.x, .env$cur_pos[[cur_column()]]-.env$size, .env$cur_pos[[cur_column()]]+.env$size))) %>%
left_join(y = mutate(cur_pos, current = TRUE), by = names(grid))
new_pos <- opts %>%
mutate(weight = case_when(current ~ 1-(prob*(n()-1)), #calculate chance to move,
TRUE ~ prob), #in higher dimensions, we may have more places to move
weight = if_else(weight<0, 0, weight)) %>% #thus depending on prob, we may always move.
sample_n(size = 1, weight = weight) %>%
select(-weight, -current)
new_pos
}
sim_walk <- function(cur_pos, grid, grid_prob = 0.04, steps = 50, size = 1){
iterations <- cur_pos
for(i in seq_len(steps)){
cur_pos <- move_step(cur_pos, grid, prob = grid_prob, size = size)
iterations <- bind_rows(iterations, cur_pos)
}
iterations$i <- 1:nrow(iterations)
iterations
}
origin <- data.frame(x = 0, y =0)
small_grid <- expand.grid(x = -1:1, y = -1:1)
small_walk <- sim_walk(cur_pos = origin,
grid = small_grid)
ggplot(small_walk, aes(x, y)) +
geom_path() +
geom_point(color = "red") +
transition_reveal(i) +
labs(title = "Step {frame_along}") +
coord_fixed()
large_grid <- expand.grid(x = -10:10, y = -10:10)
large_walk <- sim_walk(cur_pos = origin,
grid = large_grid,
steps = 100)
ggplot(large_walk, aes(x,y)) +
geom_path() +
geom_point(color = "red") +
transition_reveal(i) +
labs(title = "Step {frame_along}") +
xlim(c(-10,10)) + ylim(c(-10,10))+
coord_fixed()
large_walk %>%
count(x, y) %>%
right_join(y = expand.grid(x = -10:10, y = -10:10), by = c("x","y")) %>%
mutate(n = if_else(is.na(n), 0L, n)) %>%
ggplot(aes(x,y)) +
geom_tile(aes(fill = n)) +
coord_fixed()
multi_dim_walk <- sim_walk(cur_pos = data.frame(x = 0, y = 0, z = 0),
grid = expand.grid(x = -20:20, y = -20:20, z = -20:20),
steps = 100, size = 2)
library(cowplot)
plot_grid(
ggplot(multi_dim_walk, aes(x, y)) + geom_path(),
ggplot(multi_dim_walk, aes(x, z)) + geom_path(),
ggplot(multi_dim_walk, aes(y, z)) + geom_path())
Created on 2021-05-06 by the reprex package (v1.0.0)
Here is a base R option using Reduce + replicate + plot for 2D random walk process
set.seed(0)
plot(
setNames(
data.frame(replicate(
2,
Reduce(`+`, rnorm(99), init = 0, accumulate = TRUE)
)),
c("X", "Y")
),
type = "o"
)

Visualising diagonal in asymmetric matrix plot

I have a number of symmetric matrices of the same dimensionality, and I wish to visualise the mean and variance of the values in each cell across these matrices in an elegant way (which I will make more precise below) that makes use of the symmetric character.
Let me start by making some data to illustrate. The following creates 10 9x9 matrices, aggregates the mean and variance, and transforms to long format in preparation for plotting:
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
rownames(m) <- colnames(m) <- letters[1:n]
m
}
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "mean")
vars <- apply(matrices, 1:2, var) %>%
as_tibble(rownames = "row") %>%
pivot_longer(-1, names_to = "col", values_to = "var")
df <- full_join(means, vars, by = c("row", "col"))
head(df)
#> # A tibble: 6 x 4
#> row col mean var
#> <chr> <chr> <dbl> <dbl>
#> 1 a a 0.548 0.111
#> 2 a b 0.507 0.0914
#> 3 a c 0.374 0.105
#> 4 a d 0.350 0.0976
#> 5 a e 0.525 0.0752
#> 6 a f 0.452 0.0887
Now, I could simply use geom_tile to make one plot of the means, and one plot of the variances. However, considering that both of these are symmetric, this wastes quite a lot of space, and also fails to communicate the symmetric character to the audience.
To address this problem, I have been playing around with the ggasym package to create an asymmetric matrix plot. The following is a slight modification from the ggasym vignette:
library(ggasym)
library(ggplot2)
ggplot(df, aes(x = col, y = row)) +
geom_asymmat(aes(fill_diag = mean, fill_tl = mean, fill_br = var)) +
scale_fill_diag_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_tl_gradient(limits = c(0, 1), low = "lightpink", high = "tomato") +
scale_fill_br_gradient(low = "lightblue1", high = "dodgerblue") +
geom_text(data = filter(df, row == col), aes(label = signif(var, 2)))
Created on 2020-06-27 by the reprex package (v0.3.0)
What bothers me about this is the diagonal. In the above, I have mapped the fill of the diagonal to the means, and overlaid the variance by text, which works, but doesn't seem great. Specifically, I would like to map all the information here to fill, so as to get rid of the text. I see a couple of options for how to do this, but I am not sure how to implement any of them:
Split the fill of the diagonal cells, so that (in the example above) the lower right of each cell on the diagonal is an appropriate shade of blue, while the upper left is some shade of red.
Plot the upper and lower matrices separately (each with the diagonal), and then somehow "overlay" these plots so that they end up next to each other in an appropriate way. In other words, this would plot the diagonal twice.
I am open to other suggestions for how to accomplish this in a clean way. Let me emphasise that I do not require a solution building on ggasym, this was simply the closest I have been able to get so far. However, I would like some kind of ggplot-based solution.
So here is my take on the 'split-the-fill' strategy. You can plot most of the things you would want in ggplot if you don't mind parameterising your stuff as polygons. We let the ggnewscale package handle the double fill mapping for us.
First off, we no longer autoname the matrices, as we will not use the dimnames.
suppressPackageStartupMessages({
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggnewscale)
})
make_matrix <- function(n) {
m <- matrix(NA, nrow = n, ncol = n)
m[lower.tri(m)] <- runif((n^2 - n) / 2)
m <- pmax(m, t(m), na.rm = TRUE)
diag(m) <- runif(n)
# rownames(m) <- colnames(m) <- letters[1:n]
m
}
Below is a function that takes a matrix, parameterises it as a polygon and cuts off one half.
halfmat <- function(mat, side) {
side <- match.arg(side, c("upper", "lower", "both"))
# Convert to long format
dat <- data.frame(
x = as.vector(row(mat)),
y = as.vector(col(mat)),
id = seq_along(mat),
value = as.vector(mat)
)
# Parameterise as polygon
poly <- with(dat, data.frame(
x = c(x - 0.5, x + 0.5, x + 0.5, x - 0.5),
y = c(y - 0.5, y - 0.5, y + 0.5, y + 0.5),
id = rep(id, 4),
value = rep(value, 4)
))
# Slice off one of the triangles
if (side == "upper") {
poly <- filter(poly, y >= x)
} else if (side == "lower") {
poly <- filter(poly, x >= y)
}
poly
}
Then we generate the data, compute the means and variances and reparameterise them.
matrices <- replicate(10, make_matrix(9))
means <- apply(matrices, 1:2, mean) %>% halfmat("upper")
vars <- apply(matrices, 1:2, var) %>% halfmat("lower")
Then we put in the means and variances as two seperate polygon layers, since we need to seperate the fill mappings with new_scale_fill(). There is a bit of extra fiddling with the scales, as these are now continuous instead of discrete, but it is not that bad.
ggplot(means, aes(x, y, fill = value, group = id)) +
geom_polygon() +
scale_fill_distiller(palette = "Reds", name = "Mean") +
# Be sure to call new_scale_fill() only after you've set up a fill scale
# for the upper part
new_scale_fill() +
geom_polygon(data = vars, aes(fill = value)) +
scale_fill_distiller(palette = "Blues", name = "Variance") +
scale_x_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "col") +
scale_y_continuous(breaks = function(x){seq(x[1] + 0.5, x[2] - 0.5, by = 1)},
labels = function(x){letters[x]},
expand = c(0,0), name = "row")
Created on 2020-06-27 by the reprex package (v0.3.0)

Adding group-specific text/data to faceted plot in R/ggplot2

I am comparing the intra-group correlation between duplicate samples within a large gene expression experiment, where I have multiple separate biological groups - the idea being to see if any of the groups is much less well-correlated than the others, indicating a potential sample mixup or other error.
I am using ggplot to plot the expression values of each duplicate pair against each other. I would like to also be able to add the correlation coefficient and p-value to each panel of the plot, which I obtain through summarize and cor.test. You can use this code to get the general idea: in exp1, the duplicates are correlated, but not in exp2.
library(tidyverse)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
head(df)
tail(df)
df %>% ggplot(aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp)
group_by(df, exp) %>%
summarize(corr=cor.test(a,b)$estimate, pval=cor.test(a,b)$p.value)
This is the plot I generated via ggplot, and I've manually added the R and p-values that I got at the end. But of course, if I have a lot of sample pairs to analyze, it would be nice to be able to add these automatically from within the ggplot call. I'm just not sure how to do it.
If, for whatever reason, you want to build this yourself instead of using the ggpubr functions, you can create your summary data, format labels, and place the labels with geom_text.
I'm formatting the stats so that R has a fixed 3 significant digits and p has 3 digits, falling back on scientific notation. I changed the names of those columns in summarise to R and p to make the labels below. Reshaping to long data and creating a new column with unite gets this:
library(tidyverse)
...
group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ")
#> # A tibble: 4 x 2
#> exp stat
#> <chr> <chr>
#> 1 exp1 R = 0.965
#> 2 exp2 R = 0.0438
#> 3 exp1 p = 1.14e-58
#> 4 exp2 p = 0.665
Then for each of the groups, I want to collapse both labels, separated by a newline \n. This is a place that will scale well—you might have more summary stats to display, but this should still work.
summ <- group_by(df, exp) %>%
summarize(R = cor.test(a, b)$estimate, p = cor.test(a, b)$p.value) %>%
mutate(R = formatC(R, format = "fg", digits = 3),
p = formatC(p, format = "g", digits = 3)) %>%
gather(key = measure, value = value, -exp) %>%
unite("stat", measure, value, sep = " = ") %>%
group_by(exp) %>%
summarise(both_stats = paste(stat, collapse = "\n"))
summ
#> # A tibble: 2 x 2
#> exp both_stats
#> <chr> <chr>
#> 1 exp1 "R = 0.965\np = 1.14e-58"
#> 2 exp2 "R = 0.0438\np = 0.665"
In geom_text, I'm setting the x coordinate to -Inf, which gets the minimum of all x values, and the y coordinate as Inf for the maximum of all y values. That puts the label in the top-left corner, regardless of the values in the data.
The one thing I don't like here is then hacking the hjust and vjust outside their intended ranges of 0 to 1. But nudge_x/nudge_y won't do anything because of the values being set to infinity.
df %>%
ggplot(aes(x = a, y = b)) +
geom_point() +
geom_text(aes(x = -Inf, y = Inf, label = both_stats), data = summ,
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~ exp)
Created on 2018-11-14 by the reprex package (v0.2.1)
We can use the stat_cor function from the ggpubr package.
set.seed(123)
library(dplyr)
library(ggplot2)
library(ggpubr)
df <- data.frame(exp=c(rep('exp1', 100), rep('exp2', 100)), a=rnorm(200, 1000, 200))
df <- mutate(df, b=ifelse(exp=='exp1', a*rnorm(100,1,0.05), rnorm(100, 1000, 200)))
ggplot(df, aes(x=a, y=b))+
geom_point() +
facet_wrap(~exp) +
stat_cor(method = "pearson")
Similar to the answer of camille, but you can do all in one run
library(tidyverse)
set.seed(123)
df %>%
group_by(exp) %>%
mutate(p = cor.test(a, b)$p.value,
rho = cor.test(a, b)$estimate) %>%
mutate_at(vars(p, rho), signif, 2) %>%
ggplot(aes(x=a, y=b)) +
geom_point() +
geom_text(data = . %>% distinct(p, rho, exp),
aes(x = -Inf, y = Inf,label = paste("p=",p,"\nrho=",rho)),
hjust = -0.1, vjust = 1.1, lineheight = 1) +
facet_wrap(~exp)

Point colors and facets in ggplot2

Hadley Wickham's "ggplot2: Elegant Graphics for Data Analysis" book has this example graph in chapter 7, where mean points for all three colors of z appear on each facet:
Here's the code provided by the book to make the graph:
df <- data.frame(
x = rnorm(120, c(0, 2, 4)),
y = rnorm(120, c(1, 2, 1)),
z = letters[1:3]
)
df_sum <- df %>%
group_by(z) %>%
summarise(x = mean(x), y = mean(y)) %>%
rename(z2 = z)
ggplot(df, aes(x, y)) +
geom_point() +
geom_point(data = df_sum, aes(colour = z2), size = 4) +
facet_wrap(~z)
Ideally, keeping the color name as z as opposed to renaming it to z2 should produce the same result:
df_sum <- df %>%
group_by(z) %>%
summarise(x = mean(x), y = mean(y))
df_sum <- df %>%
group_by(z) %>%
summarise(x = mean(x), y = mean(y))
ggplot(df, aes(x, y)) +
geom_point() +
geom_point(data = df_sum, aes(colour = z), size = 4) +
facet_wrap(~z)
But doing so makes it so that only one colored point appears on each facet:
Is there a reason for this?
No, it should not produce the same result. As you use facet_wrap by variable z, you split the data for each facet panel, only plotting what belongs to a, b or c (your "incorrect" graph). By renaming z to z2 you can keep the facet_wrap by z for the black points but still plot each coloured point in each panel (as in df_sum there is no variable z that facet_wrap would use).

Resources