binning geom_boxplot in ggplot2 in R? - r

I want to use geom_boxplot to make boxplots that correlate two variables: for each bin of x values, plot the distribution (as boxplot) of y values for that bin. I tried:
ggplot(cars) + geom_boxplot(aes(x=dist, y=speed))
but this creates basically one large bin of x values. How can I make it so for each bin of dist, there's a boxplot representing the corresponding speed values?

Not sure what you mean by "bin", since you haven't provided any bins in your question. If you just mean that you would like a speed boxplot for each unique dist value, you can do it like this (treating dist as discrete):
ggplot(cars) + geom_boxplot(aes(factor(dist), speed))
If you were to actually create bins you could do something like:
cars$bin <- cut(cars$dist, c(1, 10, 30, 50, 200))
ggplot(cars) + geom_boxplot(aes(bin, speed))

Just to put it out there, you could also do
bin_size <- 10
cars %>%
mutate(bin_dist = factor(dist%/%bin_size*10)) %>%
ggplot(aes(x = bin_dist, y = speed)) +
geom_boxplot()
And to make the labeling better:
(cars2 <- cars %>%
mutate(bin_dist = dist%/%bin_size*10)) %>%
ggplot(aes(x = factor(bin_dist), y = speed)) +
geom_boxplot() +
scale_x_discrete(labels = paste0(unique(cars2$bin_dist), "-", unique(cars2$bin_dist)+10)) +
labs(x = "dist")
cars2 gets saved so it can work in paste0.

Related

How can I generate data from a plotted line in R?

I have no dataset just two plotted lines and I want to generate scattered y-axis data 2 standard deviations away from the mean (the plotted line). Here is my code for the line:
ggplot() +
lims(x = c(0,20), y = c(0,1)) +
annotate("segment",x = .1,xend = 5, yend = .25, y = .1) +
annotate("segment",x = 5,xend = 20, yend = .35,y = .25)
Sorry if this post is unclear but I am not sure the best way to explain it. Let me know if you have any questions or if what I am trying to do isn't possible.
Here's an example for one of the lines you have (I didn't double check whether y = 0.09*x + 0 is consistent or not with what you showed, guiding my answer from your comment).
library(ggplot2)
library(dplyr)
df <- tibble(x=1:20,
y1=0.09*x,
y2=0.0067*x)
# generate dots for y1
# mean y1 and sd = 1
sapply(df$y1, function(tt) rnorm(10, tt)) %>%
# make it into tibble
as_tibble() %>%
# pivot into longer format
tidyr::pivot_longer(everything()) %>%
# names of the columns get assigned to V1 V2 ...
# we can clean that and get the actual x
# this works nicely because your x=1:20, will fail otherwise
mutate(X=as.numeric(stringr::str_remove(name, "V"))) %>%
# plot the thing
ggplot(aes(X, value)) +
geom_point() +
# add the "mean" values from before
geom_point(data=df, aes(x, y1), col="red", size=2)

Additional x axis on ggplot

I'm aware there are similar posts but I could not get those answers to work in my case.
e.g. Here and here.
Example:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut))
Returns a plot:
Since I used scale, those numbers are the zscores or standard deviations away from the mean of each break.
I would like to add as a row underneath the equivalent non scaled raw number that corresponds to each.
Tried:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(aes(label = price))
Gives:
Error: geom_text requires the following missing aesthetics: y
My primary question is how can I add the raw values underneath -3:3 of each break? I don't want to change those breaks, I still want 6 breaks between -3:3.
Secondary question, how can I get -3 and 3 to actually show up in the chart? They have been trimmed.
[edit]
I've been trying to make it work with geom_text but keep hitting errors:
diamonds %>%
ggplot(aes(x = scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(label = price)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, :
object 'price' not found
I then tried changing my call to geom_text()
geom_text(data = diamonds, aes(price), label = price)
This results in the same error message.
You can make a custom labeling function for your axis. This takes each label on the axis and performs a custom transform for you. In your case you could paste the z score, a line break, and the z-score times the standard deviation plus the mean. Because of the distribution of prices in the diamonds data set, this means that z scores below about -1 represent negative prices. This may not be a problem in your own data. For clarity I have drawn in a vertical line representing $0
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(diamonds$price) * x + mean(diamonds$price)))
}
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
geom_vline(aes(xintercept = -0.98580251364833), linetype = 2) +
facet_wrap(vars(cut)) +
scale_x_continuous(label = labeller, limits = c(-3, 3)) +
xlab("price")
We can use the sec_axis functionality in scale_x_continuous. To use this functionality we need to manually scale your data. This will add a secondary axis at the top of the plot, not underneath. So it's not quite exactly what you're looking for.
library(tidyverse)
# manually scale the data
mean_price <- mean(diamonds$price)
sd_price <- sd(diamonds$price)
diamonds$price_scaled <- (diamonds$price - mean_price) / sd_price
# make the plot
ggplot(diamonds, aes(price_scaled))+
geom_density()+
facet_wrap(~cut)+
scale_x_continuous(sec.axis = sec_axis(~ mean_price + (sd_price * .)),
limits = c(-3, 4), breaks = -3:3)
You could cheat a bit by passing some dummy data to geom_text:
geom_text(data = tibble(label = round(((-3:3) * sd_price) + mean_price),
y = -0.25,
x = -3:3),
aes(x, y, label = label))

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

ggplot2, y limits on geom_bar with faceting

In the following, by selecting free_y, the maximum values of each scale adjust as expected, however, how can I get the minimum values to also adjust? at the moment, they both start at 0, when I really want the upper facet to start at about 99 and go to 100, and the lower facet to start at around 900 and go to 1000.
library(ggplot2)
n = 100
df = rbind(data.frame(x = 1:n,y = runif(n,min=99,max=100),variable="First"),
data.frame(x = 1:n,y = runif(n,min=900,max=1000),variable="Second"))
ggplot(data=df,aes(x,y,fill=variable)) +
geom_bar(stat='identity') +
facet_grid(variable~.,scales='free')
You could use geom_linerange rather than geom_bar. A general way to do this is to first find the min of y for each value of variable and then merge the minimums with the original data. Code would look like:
library(ggplot2)
min_y <- aggregate(y ~ variable, data=df, min)
sp <- ggplot(data=merge(df, min_y, by="variable", suffixes = c("","min")),
aes(x, colour=variable)) +
geom_linerange(aes(ymin=ymin, ymax=y), size=1.3) +
facet_grid(variable ~ .,scales='free')
plot(sp)
Plot looks like:

Resources