ggplot faceted cumulative histogram - r

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!

I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)

Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).

Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Related

ggplot2: geom_bar with facet-wise proportion and fill argument

I'm trying to plot proportions with geom_bar() combining fill and facet_grid.
library(tidyverse)
set.seed(123)
df <- data_frame(val_num = c(rep(1, 60), rep(2, 40), rep(1, 30), rep(2, 70)),
val_cat = ifelse(val_num == 1, "cat", "mouse"),
val_fill = sample(c("black", "white", "gray"), 200, replace = TRUE),
group = rep(c("A", "B"), each = 100))
ggplot(df) +
stat_count(mapping = aes(x = val_cat, y = ..count../tapply(..count.., ..x.. , sum)[..x..],
fill = val_fill),
position = position_dodge2(preserve = "single")) +
facet_grid(.~ group)
However, it seems that proportions are calculated for all cats (or all mices) in categories A and B together. In other words, sum of proportions in the first three columns is not 1.
It should be solved with adding group = group into the mapping. However:
ggplot(df) +
stat_count(mapping = aes(x = val_cat, y = ..count../tapply(..count.., ..x.. , sum)[..x..],
fill = val_fill, group = group),
position = position_dodge2(preserve = "single")) +
facet_grid(.~ group)
plot ignores fill argument (and moreover does not solve the issue). I tried to specify group with different choices including interaction() but without any real success.
I would like to solve problem within ggplot and I would like to avoid data manipulation before plotting.
So it wasn't as easy as I thought because I don't tend to use the stat_xxx() functions. Although you seem persistent in not manipulating the data before hand, here is an approach you can use.
grouped.df <- df %>%
group_by( group, val_fill ) %>%
count( val_cat ) %>%
ungroup() %>%
group_by( group, val_cat ) %>%
mutate( prop=n/sum(n) ) %>%
ungroup()
grouped.df %>%
ggplot() +
geom_col( aes(x=val_cat,y=prop,fill=val_fill), position="dodge" ) +
facet_wrap( ~ group )
to produce
But getting back to your "no data manipulation approach", I think your error is within your y variable. For example, consider the following code and output.
df2 %>%
ggplot() +
stat_count( aes(x=val_cat,y=..count..,color=val_fill,label=tapply(..count.., ..x.. , sum)[..x..]),
geom="text" ) +
facet_wrap( ~ group )
In the plot above, the y value is the numerator of your attempted proportion and the label value is the denominator of your attempted proportion. I think all you need to do is mess around some more with your tapply() function calls until you have the right combination of y and label.

How to show a range of scale values when plotting means on bar chart in R

I am creating bar charts in R displaying individuals' mean scores and their cohort score in various scales. However, I also want to show the range for individual item scores for the scale.
For example, if I have a mean score of 3 on a scale consisting of 5 items with a 5-point Likert scale, I want to also see whether my individual item ratings ranged from 1-5 (meaning I was all over the place, averaging 3), or 2-4 (meaning I was more consistently scoring near 3).
Initially I was just going to do error bars, but I need it to show the range of scores, not standard error or confidence interval. I'm using ggplot2 and need to ideally stick with this R package due to some of the other coding I need it for.
# Make some data
dat <- data.frame(Item1=c(1,2,2,4,5), Item2=c(3,3,1,5,2), Item3=c(1,5,5,4,5),
Item4=c(1,4,3,4,2), Item5=c(3,2,3,4,3))
# Find mean scores
for (i in 1:nrow(dat)) {
dat$ScaleMean[i] <- round(rowMeans(dat, na.rm=T)[i],2)
}
# Find cohort mean
dat$CoScaleMean <- round(mean(dat$ScaleMean, na.rm=T), 2)
# Add participant IDs
dat$ID <- c(1,2,3,4,5)
# Create long data format
dat.long3 <- melt(dat, id.vars=c('ID'),
measure.vars=c('ScaleMean', 'CoScaleMean'))
# Rename variable and value columns
colnames(dat.long3)[c(2,3)] <- c('Scale', 'Score')
# Bar chart
dat.long3 %>%
filter(ID == 1) %>%
ggplot(aes(x=Scale, y=Score)) +
geom_bar(aes(x=Scale, y=Score, fill=Scale), stat = 'identity', width=.9, alpha=1, position='dodge') +
coord_flip() +
ggtitle(label='Scale Name') +
scale_y_continuous(breaks=seq(0, 5, 1), limits = c(0, 5), expand=c(0,0)) +
scale_x_discrete(labels=element_blank()) +
scale_fill_manual(label=paste0(c('Your Score', 'Cohort Score')), values=c('gold', 'darkorange')) +
guides(fill = guide_legend(nrow = 2, reverse = TRUE)) +
geom_text(aes(x=Scale, label=Score, y=1), size=5, color='#000000')
This is for another individual who is requesting a visual representation in the chart (not a table) and needs to keep it in bar chart form (no box plots, etc.). I'm not sure what other options there are to achieve this, if any? Your help is greatly appreciated!
You can create a custom label with the glue package and call that in geom_text. Starting after dat$ID <- c(1, 2, 3, 4, 5):
library(dplyr)
library(tidyr)
library(purrr)
library(glue)
dat <- dat %>%
nest(items = Item1:Item5) %>%
mutate(min = map_dbl(items, min),
max = map_dbl(items, max),
ScaleMean = glue("{ScaleMean} [{min}-{max}]"),
CoScaleMean = glue("{CoScaleMean} [{min}-{max}]")) %>%
select(ID, ScaleMean, CoScaleMean)
# Create long data format
dat.long3 <- melt(dat, id.vars=c('ID'),
measure.vars=c('ScaleMean', 'CoScaleMean'))
dat.long3 <- dat.long3 %>%
separate(value, into = c("value", "range"), sep = " ") %>%
mutate(value = as.numeric(value))
# Rename variable and value columns
colnames(dat.long3)[c(2,3)] <- c('Scale', 'Score')
# Bar chart
dat.long3 %>%
mutate(label = glue("{Score} {range}")) %>%
dplyr::filter(ID == 1) %>%
ggplot(aes(x=Scale, y=Score)) +
geom_bar(aes(x=Scale, y=Score, fill=Scale), stat = 'identity', width=.9, alpha=1, position='dodge') +
coord_flip() +
ggtitle(label='Scale Name') +
scale_y_continuous(breaks=seq(0, 5, 1), limits = c(0, 5), expand=c(0,0)) +
scale_x_discrete(labels=element_blank()) +
scale_fill_manual(label=paste0(c('Your Score', 'Cohort Score')), values=c('gold', 'darkorange')) +
guides(fill = guide_legend(nrow = 2, reverse = TRUE)) +
geom_text(aes(x=Scale, label=label, y=1), size=5, color='#000000')

Within a function, how to create a discrete axis with _repeated and ordered_ labels

I want to create a function that makes a heatmap where the y axis will have unique breaks, but repeated and ordered labels. I know that this is might not be a great practice. I am also aware that similar questions have been asked before. For example: ggplot in R, reordering the bars. But I want to achieve these repeated and ordered labels through sorting within a function, not by typing them manually. I am aware of solutions for reordering axes based on the values of factor (e.g., Order Bars in ggplot2 bar graph), but I don't think they apply or can't see how to apply these to my case, where the breaks are unique but the labels repeat.
Here is some code to reproduce the problem and some of my attempts:
Libraries and data
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(4)
id <- LETTERS[1:10]
lab <- paste(c("AB", "CD"), 1:5, sep = "_") %>%
sample(., size = 10, replace = TRUE)
val <- sample.int(n = 6, size = 10, replace = TRUE)
tes <- ifelse(val >= 4, 1, 0)
dat <- data.frame(id, lab, val, tes)
A heatmap with unique breaks on the y axis
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
A heatmap where the y axis is labeled with repeated labels instead of the unique breaks
This works, to the point that labels are used instead of unique ids, but the y axis is not ordered by the labels. Also, I am not sure about setting breaks and labels from the data frame in wide format (dat), rather than the data frame in long format used by ggplot (dat2).
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=dat$id, labels=dat$lab)
Mapping the vector of with repeated values on the y axis obviously doesn't work
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = lab, fill = value), color="white", size=1)
Repeated and ordered labels, try 1
As expected, merely sorting the input data by the non-unique lab variable does not work.
dat2 <- dat %>% gather(kind, value, val:tes) %>%
arrange(lab)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=id, label=lab)
Repeated and ordered labels, try 2
Try to create a named breaks vector ordered by the (repeating) labels. This gets me nowhere. Half the labels are missing and they are still not sorted.
dat2 <- dat %>% gather(kind, value, val:tes)
brks <- setNames(dat$id, dat$lab)[sort(dat$lab)]
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = brks, labels = names(brks))
Repeated and ordered labels, try 3
Starting with the data frame sorted by label, try to create an ordered factor for lab. Then sort the table by this ordered factor. No luck.
dat2 <- dat %>% gather(kind, value, val:tes) %>% arrange(lab)
dat2 <- mutate(dat2, lab_f=factor(lab, levels=sort(unique(lab)), ordered = TRUE))
dat2 <- arrange(dat2, lab_f)
# check
dat2$lab_f
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = dat2$id, labels = dat2$lab_f)
A workaround, which I can use if I have to, but I am trying to avoid
We can create a combination of id and lab which will be unique and use it for the y axis
dat2 <- dat %>% gather(kind, value, val:tes) %>%
mutate(id_lab=paste(lab, id, sep="_"))
ggplot(dat2) +
geom_tile(aes(x = kind, y = id_lab, fill = value), color="white", size=1)
I must be missing something. Any help is much appreciated.
The goal is to have a function that will take an arbitrarily long table and plot a y axis with unique breaks but (possibly) repeated and ordered labels.
heat <- function(dat) {
dat2 <- dat %>% gather(kind, value, val:tes)
# any other manipulation here
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
# scale_y_discrete() (if needed)
}
The plot I am looking for is something like this (created in inkscape)
Using limits instead of breaks sets the order:
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
geom_text(aes(x = 1, y = id, label = id), col = 'white') +
scale_y_discrete(limits = dat$id[order(dat$lab)], labels = sort(dat$lab))

is it possible to ggplot grouped partial boxplots w/o facets w/ a single `geom_boxplot()`?

I needed to add some partial boxplots to the following plot:
library(tidyverse)
foo <- tibble(
time = 1:100,
group = sample(c("a", "b"), 100, replace = TRUE) %>% as.factor()
) %>%
group_by(group) %>%
mutate(value = rnorm(n()) + 10 * as.integer(group)) %>%
ungroup()
foo %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE)
I would add a grid of (2 x 4 = 8) boxplots (4 per group) to the plot above. Each boxplot should consider a consecutive selection of 25 (or n) points (in each group). I.e., the firsts two boxplots represent the points between the 1st and the 25th (one boxplot below for the group a, and one boxplot above for the group b). Next to them, two other boxplots for the points between the 26th and 50th, etcetera. If they are not in a perfect grid (which I suppose would be both more challenging to obtain and uglier) it would be even better: I prefer if they will "follow" their corresponding smooth line!
That all without using facets (because I have to insert them in a plot which is already facetted :-))
I tried to
bar <- foo %>%
group_by(group) %>%
mutate(cut = 12.5 * (time %/% 25)) %>%
ungroup()
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(x = cut))
but it doesn't work.
I tried to call geom_boxplot() using group instead of x
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(group = cut))
But it draws the boxplots without considering the groups and loosing even the colors (and add a redundant call including color = group doesn't help)
Finally, I decided to try it roughly:
bar %>%
ggplot(aes(x = time, y = value, color = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(data = filter(bar, group == "a"), aes(group = cut)) +
geom_boxplot(data = filter(bar, group == "b"), aes(group = cut))
And it works (maintaining even the correct colors from the main aes)!
Does someone know if it is possible to obtain it using a single call to geom_boxplot()?
Thanks!
This was interesting! I haven't tried to use geom_boxplot with a continuous x before and didn't know how it behaved. I think what is happening is that setting group overrides colour in geom_boxplot, so it doesn't respect either the inherited or repeated colour aesthetic. I think this workaround does the trick; we combine the group and cut variables into group_cut, which takes 8 different values (one for each desired boxplot). Now we can map aes(group = group_cut) and get the desired output. I don't think this is particularly intuitive and it might be worth raising it on the Github, since usually we expect aesthetics to combine nicely (e.g. combining colour and linetype works fine).
library(tidyverse)
bar <- tibble(
time = 1:100,
group = sample(c("a", "b"), 100, replace = TRUE) %>% as.factor()
) %>%
group_by(group) %>%
mutate(
value = rnorm(n()) + 10 * as.integer(group),
cut = 12.5 * ((time - 1) %/% 25), # modified this to prevent an extra boxplot
group_cut = str_c(group, cut)
) %>%
ungroup()
bar %>%
ggplot(aes(x = time, y = value, colour = group)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_boxplot(aes(group = group_cut), position = "identity")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Created on 2019-08-13 by the reprex package (v0.3.0)

Using ggplot to group in two different ways

I have data that looks kinda like this:
id = rep(1:33,3)
year = rep(1:3,33)
group = sample(c(1:3),99, replace=T)
test_result = sample(c(TRUE,FALSE), size=99, replace = T)
df = data.frame(id, year, group, test_result)
df$year = as.factor(year)
df$group = as.factor(group)
My goal is to visualize it so that I can see how group number and year relate to test_result.
df %>%
group_by(id,year) %>%
summarize(x=sum(test_result)) %>%
ggplot() +
geom_histogram(aes(fill = year,
x = x),
binwidth = 1,
position='dodge') +
theme_minimal()
gets me almost all the way there. What I want is to be able to add something like facet_wrap(group~.) to the end of this to show how these change by group but obviously group is not part of the aggregated dataframe.
Right now my best solution is just to show multiple plots like
df %>% filter(group==1) # Replace group number here
group_by(id,year) %>%
summarize(x=sum(test_result)) %>%
ggplot() +
geom_histogram(aes(fill = year,
x = x),
binwidth = 1,
position='dodge') +
theme_minimal()
but I'd love to figure out how to put them all in one figure and I'm wondering if maybe the way to do that is to put more of the grouping logic into ggplot?

Resources