ggplot2: geom_bar with facet-wise proportion and fill argument - r

I'm trying to plot proportions with geom_bar() combining fill and facet_grid.
library(tidyverse)
set.seed(123)
df <- data_frame(val_num = c(rep(1, 60), rep(2, 40), rep(1, 30), rep(2, 70)),
val_cat = ifelse(val_num == 1, "cat", "mouse"),
val_fill = sample(c("black", "white", "gray"), 200, replace = TRUE),
group = rep(c("A", "B"), each = 100))
ggplot(df) +
stat_count(mapping = aes(x = val_cat, y = ..count../tapply(..count.., ..x.. , sum)[..x..],
fill = val_fill),
position = position_dodge2(preserve = "single")) +
facet_grid(.~ group)
However, it seems that proportions are calculated for all cats (or all mices) in categories A and B together. In other words, sum of proportions in the first three columns is not 1.
It should be solved with adding group = group into the mapping. However:
ggplot(df) +
stat_count(mapping = aes(x = val_cat, y = ..count../tapply(..count.., ..x.. , sum)[..x..],
fill = val_fill, group = group),
position = position_dodge2(preserve = "single")) +
facet_grid(.~ group)
plot ignores fill argument (and moreover does not solve the issue). I tried to specify group with different choices including interaction() but without any real success.
I would like to solve problem within ggplot and I would like to avoid data manipulation before plotting.

So it wasn't as easy as I thought because I don't tend to use the stat_xxx() functions. Although you seem persistent in not manipulating the data before hand, here is an approach you can use.
grouped.df <- df %>%
group_by( group, val_fill ) %>%
count( val_cat ) %>%
ungroup() %>%
group_by( group, val_cat ) %>%
mutate( prop=n/sum(n) ) %>%
ungroup()
grouped.df %>%
ggplot() +
geom_col( aes(x=val_cat,y=prop,fill=val_fill), position="dodge" ) +
facet_wrap( ~ group )
to produce
But getting back to your "no data manipulation approach", I think your error is within your y variable. For example, consider the following code and output.
df2 %>%
ggplot() +
stat_count( aes(x=val_cat,y=..count..,color=val_fill,label=tapply(..count.., ..x.. , sum)[..x..]),
geom="text" ) +
facet_wrap( ~ group )
In the plot above, the y value is the numerator of your attempted proportion and the label value is the denominator of your attempted proportion. I think all you need to do is mess around some more with your tapply() function calls until you have the right combination of y and label.

Related

Reordering within 3-factor grouped plot

Using ggplot2, I'm attempting to reorder a data representation with 3 factors: condition, sex, and time.
library(ggplot2)
library(dplyr)
DF <- data.frame(value = rnorm(100, 20, sd = 0.1),
cond = c(rep("a",25),rep("b",25),rep("a",25),rep("b",25)),
sex = c(rep("M",50),rep("F",50)),
time = rep(c("1","2"),50)
)
ggplot(data=DF, aes( x = time,
y = value,
fill = cond,
colour = sex,
)
) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40")) +
ggtitle("aF,aM,bF,bM") +
theme(legend.position = "top")
Badly ordered plot.
The way ggplot2 automatically orders condition first and interleaves sex poses the issue. It defaults to an interleaved "aF,aM,bF,bM" order regardless of which factor I assign to which aesthetic.
For analysis purposes, my preferred order is "aM,bM,aF,bF". Order sex first and interleave condition. I tried to fix it by converting the 2x2 factor assignments to one group with 4 levels, which gives me complete control over the order:
DF %>% mutate(grp = as.factor(paste0(cond,sex))) -> DF
level_order <- c("aM", "bM", "aF", "bF")
ggplot(data=DF, aes( x = time,
y = value,
fill = factor(grp, level=level_order),
colour = sex
)
) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080","#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40", "grey40", "grey10")) +
ggtitle("aM,bM,aF,bF") +
theme(legend.position = "top")
Ordering OK, bad representation.
However artificial grouping like this has its downsides, subjects are not assigned to a group, they are male/female (can't be changed) and assigned to some condition. Also the plot legend is unnecessarily cluttered, it has 6 keys instead of 4. It doesn't convey that it's 2x2 repeated measures design all that well.
I'm not sure if what I'm trying to do makes sense (I hope this isn't some massive brain fart), any help would be appreciated.
The order in which you place the aesthetics controls the priority of its groupings. Thus if you switch the position of fill and colour you will get the result you are looking for (e.i. you want colour to be grouped first, and then fill)
ggplot(data=DF, aes( x = time,
y = value,
colour = sex,
fill = cond)) +
geom_boxplot(size = 1, outlier.shape = NA) +
scale_fill_manual(values=c("#69b3a2", "#404080")) +
scale_color_manual(values=c("grey10", "grey40")) +
theme(legend.position = "top")

Adding a single label per group in ggplot with stat_summary and text geoms

I would like to add counts to a ggplot that uses stat_summary().
I am having an issue with the requirement that the text vector be the same length as the data.
With the examples below, you can see that what is being plotted is the same label multiple times.
The workaround to set the location on the y axis has the effect that multiple labels are stacked up. The visual effect is a bit strange (particularly when you have thousands of observations) and not sufficiently professional for my purposes. You will have to trust me on this one - the attached picture doesn't fully convey the weirdness of it.
I was wondering if someone else has worked out another way. It is for a plot in shiny that has dynamic input, so text cannot be overlaid in a hardcoded fashion.
I'm pretty sure ggplot wasn't designed for the kind of behaviour with stat_summary that I am looking for, and I may have to abandon stat_summary and create a new summary dataframe, but thought I would first check if someone else has some wizardry to offer up.
This is the plot without setting the y location:
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_x <- df_x %>%
group_by(Group) %>%
mutate(w_count = n())
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(label = w_count)) +
coord_flip() +
theme_classic()
and this is with my hack
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
geom_text(aes(y = 1, label = w_count)) +
coord_flip() +
theme_classic()
Create a df_text that has the grouped info for your labels. Then use annotate:
library(dplyr)
library(ggplot2)
set.seed(123)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
df_text <- df_x %>%
group_by(Group) %>%
summarise(avg = mean(Value),
n = n()) %>%
ungroup()
yoff <- 0.0
xoff <- -0.1
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
annotate("text",
x = 1:2 + xoff,
y = df_text$avg + yoff,
label = df_text$n) +
coord_flip() +
theme_classic()
I found another way which is a little more robust for when the plot is dynamic in its ordering and filtering, and works well for faceting. More robust, because it uses stat_summary for the text.
library(dplyr)
library(ggplot2)
df_x <- data.frame("Group" = c(rep("A",1000), rep("B",2) ),
"Value" = rnorm(1002))
counts_df <- function(y) {
return( data.frame( y = 1, label = paste0('n=', length(y)) ) )
}
ggplot(df_x, aes(x = Group, y = Value)) +
stat_summary(fun.data="mean_cl_boot", size = 1.2) +
coord_flip() +
theme_classic()
p + stat_summary(geom="text", fun.data=counts_df)

How to add percentage values to strata in alluvial plot with ggalluvial?

I'm looking for the most convenient way for adding rounded percentage labels to strata of an alluvial plot.
There are 50 cases in the following example. Independently of stages 1 or 2, each case belongs to one group of A, B or C. I'd like to display the relative group affiliation during each stage.
library(ggplot2)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5)
Is there a way to add rounded percentage labels (including "%") to the strata (bar segments) without calculating a percentage column in the initial data frame? If I'm not completely mistaken, geom_text doesn't work the same way here as in geom_bar().
The standard ggplot2 solution to this question is to use "calculated aesthetics". These are aesthetic specifications that come not from the data passed to ggplot() but from the output of the statistical transformation (the stat_*()), which is used to render the graphical elements (the geom_*()). The columns of this output (which are rarely seen by the user) are called "computed variables". The documentation on this topic is limited and a bit out of date, using stat() instead of after_stat() to call them. Since ggalluvial did not support computed variables, the answer from #bencekd was correct at the time.
As of today, v0.12.0 is on CRAN with support and documentation for computed variables. In particular, three computed variables are available that correspond to variables with the same names used by stat_bin() or stat_count(): n, count (a weighted version of n), and prop (a within-axis proportion calculated from count). It looks like you'd want to use prop, as illustrated below:
library(ggplot2)
library(scales)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum",
aes(label = percent(after_stat(prop), accuracy = .1)))
Created on 2020-07-14 by the reprex package (v0.3.0)
Unfortunately I don't think you can do it without calculating the percentage column in the initial data frame yet. But that can be done easily and also gives more flexibility with the labeling:
library(ggplot2)
library(ggalluvial)
df <- data.frame('id' = rep(1:50,2),
'stage' = c(rep(1,50), rep(2,50)),
'group' = sample(c('A','B','C'), 100, replace = TRUE))
# the list needs to be reversed, as stratums are displayed reversed in the alluvial by default
stratum_list <- df %>%
group_by(stage, group) %>%
summarize(s = n()) %>%
group_by(stage) %>%
mutate(s = percent(s/sum(s), accuracy=0.1)) %>%
arrange(stage, -as.numeric(group)) %>%
.$s
ggplot(df,
aes(x = stage, stratum = group, alluvium = id, fill = group)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", label=stratum_list)
UPDATE [13/04/2020]
Added stratum_list reversion as Yonghao suggested

is it possible to ggplot grouped partial boxplots w/o facets w/ a single `geom_boxplot()`?

I needed to add some partial boxplots to the following plot:
library(tidyverse)
foo <- tibble(
time = 1:100,
group = sample(c("a", "b"), 100, replace = TRUE) %>% as.factor()
) %>%
group_by(group) %>%
mutate(value = rnorm(n()) + 10 * as.integer(group)) %>%
ungroup()
foo %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE)
I would add a grid of (2 x 4 = 8) boxplots (4 per group) to the plot above. Each boxplot should consider a consecutive selection of 25 (or n) points (in each group). I.e., the firsts two boxplots represent the points between the 1st and the 25th (one boxplot below for the group a, and one boxplot above for the group b). Next to them, two other boxplots for the points between the 26th and 50th, etcetera. If they are not in a perfect grid (which I suppose would be both more challenging to obtain and uglier) it would be even better: I prefer if they will "follow" their corresponding smooth line!
That all without using facets (because I have to insert them in a plot which is already facetted :-))
I tried to
bar <- foo %>%
group_by(group) %>%
mutate(cut = 12.5 * (time %/% 25)) %>%
ungroup()
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(x = cut))
but it doesn't work.
I tried to call geom_boxplot() using group instead of x
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(group = cut))
But it draws the boxplots without considering the groups and loosing even the colors (and add a redundant call including color = group doesn't help)
Finally, I decided to try it roughly:
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(data = filter(bar, group == "a"), aes(group = cut)) +
geom_boxplot(data = filter(bar, group == "b"), aes(group = cut))
And it works (maintaining even the correct colors from the main aes)!
Does someone know if it is possible to obtain it using a single call to geom_boxplot()?
Thanks!
This was interesting! I haven't tried to use geom_boxplot with a continuous x before and didn't know how it behaved. I think what is happening is that setting group overrides colour in geom_boxplot, so it doesn't respect either the inherited or repeated colour aesthetic. I think this workaround does the trick; we combine the group and cut variables into group_cut, which takes 8 different values (one for each desired boxplot). Now we can map aes(group = group_cut) and get the desired output. I don't think this is particularly intuitive and it might be worth raising it on the Github, since usually we expect aesthetics to combine nicely (e.g. combining colour and linetype works fine).
library(tidyverse)
bar <- tibble(
time = 1:100,
group = sample(c("a", "b"), 100, replace = TRUE) %>% as.factor()
) %>%
group_by(group) %>%
mutate(
value = rnorm(n()) + 10 * as.integer(group),
cut = 12.5 * ((time - 1) %/% 25), # modified this to prevent an extra boxplot
group_cut = str_c(group, cut)
) %>%
ungroup()
bar %>%
ggplot(aes(x = time, y = value, colour = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(group = group_cut), position = "identity")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2019-08-13 by the reprex package (v0.3.0)

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Resources