Order x axis in stacked bar by subset of fill - r

There are multiple questions (here for instance) on how to arrange the x axis by frequency in a bar chart with ggplot2. However, my aim is to arrange the categories on the X-axis in a stacked bar chart by the relative frequency of a subset of the fill. For instance, I would like to sort the x-axis by the percentage of category B in variable z.
This was my first try using only ggplot2
library(ggplot2)
library(tibble)
library(scales)
factor1 <- as.factor(c("ABC", "CDA", "XYZ", "YRO"))
factor2 <- as.factor(c("A", "B"))
set.seed(43)
data <- tibble(x = sample(factor1, 1000, replace = TRUE),
z = sample(factor2, 1000, replace = TRUE))
ggplot(data = data, aes(x = x, fill = z, order = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)
When that didn't work I created a summarised data frame using dplyr and then spread the data and sort it by B and then gather it again. But plotting that didn't work either.
library(dplyr)
library(tidyr)
data %>%
group_by(x, z) %>%
count() %>%
spread(z, n) %>%
arrange(-B) %>%
gather(z, n, -x) %>%
ggplot(aes(x = reorder(x, n), y = n, fill = z)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = percent)
I would prefer a solution with ggplot only in order not to be dependent of the order in the data frame created by dplyr/tidyr. However, I'm open for anything.

If you want to sort by absolute frequency:
lvls <- names(sort(table(data[data$z == "B", "x"])))
If you want to sort by relative frequency:
lvls <- names(sort(tapply(data$z == "B", data$x, mean)))
Then you can create the factor on the fly inside ggplot:
ggplot(data = data, aes(factor(x, levels = lvls), fill = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)

A solution using tidyverse would be:
data %>%
mutate(x = forcats::fct_reorder(x, as.numeric(z), fun = mean)) %>%
ggplot(aes(x, fill = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)

Related

ggplot: How to show density instead of count in grouped bar plot with facet_wrap?

The dataframe consists of two factor variables: cls with 3 leveles and subset with 2 levels. I want to compare how much of each class (cls) is there in both groups of subset. I want to show percentages on y-axis. They should be computed within certain subset group, not whole dataset.
library(tidyverse)
data = data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
This was my attempt to show percentages, but it failed because they are computed within whole dataset instead of subset group:
ggplot(data, aes(x = cls, fill = cls)) + geom_bar(aes(y = ..count.. / sum(..count..))) + facet_wrap(~subset)
How can I fix it?
Edit related to the accepted answer:
plot_train_vs_test = function(data, var, subset_colname){
plot_data = data %>%
count(var, eval(subset_colname)) %>%
group_by(eval(subset_colname)) %>%
mutate(perc = n/sum(n))
ggplot(plot_data, aes(x = var, y = perc, fill = var)) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(~eval(subset_colname))
}
plot_train_vs_test(data, "cls", "subset")
Results in errors.
One option and easy fix would be to compute the percentages outside of ggplot and plot the summarized data:
library(ggplot2)
library(dplyr, warn = FALSE)
set.seed(123)
data <- data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
data_sum <- data %>%
count(cls, subset) %>%
group_by(subset) %>%
mutate(pct = n / sum(n))
ggplot(data_sum, aes(x = cls, y = pct, fill = cls)) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(~subset)
EDIT One approach to put the code in a function may look like so:
plot_train_vs_test <- function(.data, x, facet) {
.data_sum <- .data %>%
count({{ x }}, {{ facet }}) %>%
group_by({{ facet }}) %>%
mutate(pct = n / sum(n))
ggplot(.data_sum, aes(x = {{ x }}, y = pct, fill = {{ x }})) +
geom_col() +
scale_y_continuous(labels = scales::label_percent()) +
facet_wrap(vars({{ facet }}))
}
plot_train_vs_test(data, cls, subset)
For more on the details and especially the {{ operator see Programming with dplyr, Programming with ggplot2 and Best practices for programming with ggplot2

How can I change the size of a bar in a grouped bar chart when one group has no data? [duplicate]

Is there a way to set a constant width for geom_bar() in the event of missing data in the time series example below? I've tried setting width in aes() with no luck. Compare May '11 to June '11 width of bars in the plot below the code example.
colours <- c("#FF0000", "#33CC33", "#CCCCCC", "#FFA500", "#000000" )
iris$Month <- rep(seq(from=as.Date("2011-01-01"), to=as.Date("2011-10-01"), by="month"), 15)
colours <- c("#FF0000", "#33CC33", "#CCCCCC", "#FFA500", "#000000" )
iris$Month <- rep(seq(from=as.Date("2011-01-01"), to=as.Date("2011-10-01"), by="month"), 15)
d<-aggregate(iris$Sepal.Length, by=list(iris$Month, iris$Species), sum)
d$quota<-seq(from=2000, to=60000, by=2000)
colnames(d) <- c("Month", "Species", "Sepal.Width", "Quota")
d$Sepal.Width<-d$Sepal.Width * 1000
g1 <- ggplot(data=d, aes(x=Month, y=Quota, color="Quota")) + geom_line(size=1)
g1 + geom_bar(data=d[c(-1:-5),], aes(x=Month, y=Sepal.Width, width=10, group=Species, fill=Species), stat="identity", position="dodge") + scale_fill_manual(values=colours)
Some new options for position_dodge() and the new position_dodge2(), introduced in ggplot2 3.0.0 can help.
You can use preserve = "single" in position_dodge() to base the widths off a single element, so the widths of all bars will be the same.
ggplot(data = d, aes(x = Month, y = Quota, color = "Quota")) +
geom_line(size = 1) +
geom_col(data = d[c(-1:-5),], aes(y = Sepal.Width, fill = Species),
position = position_dodge(preserve = "single") ) +
scale_fill_manual(values = colours)
Using position_dodge2() changes the way things are centered, centering each set of bars at each x axis location. It has some padding built in, so use padding = 0 to remove.
ggplot(data = d, aes(x = Month, y = Quota, color = "Quota")) +
geom_line(size = 1) +
geom_col(data = d[c(-1:-5),], aes(y = Sepal.Width, fill = Species),
position = position_dodge2(preserve = "single", padding = 0) ) +
scale_fill_manual(values = colours)
The easiest way is to supplement your data set so that every combination is present, even if it has NA as its value. Taking a simpler example (as yours has a lot of unneeded features):
dat <- data.frame(a=rep(LETTERS[1:3],3),
b=rep(letters[1:3],each=3),
v=1:9)[-2,]
ggplot(dat, aes(x=a, y=v, colour=b)) +
geom_bar(aes(fill=b), stat="identity", position="dodge")
This shows the behavior you are trying to avoid: in group "B", there is no group "a", so the bars are wider. Supplement dat with a dataframe with all the combinations of a and b:
dat.all <- rbind(dat, cbind(expand.grid(a=levels(dat$a), b=levels(dat$b)), v=NA))
ggplot(dat.all, aes(x=a, y=v, colour=b)) +
geom_bar(aes(fill=b), stat="identity", position="dodge")
I had the same problem but was looking for a solution that works with the pipe (%>%). Using tidyr::spread and tidyr::gather from the tidyverse does the trick. I use the same data as #Brian Diggs, but with uppercase variable names to not end up with double variable names when transforming to wide:
library(tidyverse)
dat <- data.frame(A = rep(LETTERS[1:3], 3),
B = rep(letters[1:3], each = 3),
V = 1:9)[-2, ]
dat %>%
spread(key = B, value = V, fill = NA) %>% # turn data to wide, using fill = NA to generate missing values
gather(key = B, value = V, -A) %>% # go back to long, with the missings
ggplot(aes(x = A, y = V, fill = B)) +
geom_col(position = position_dodge())
Edit:
There actually is a even simpler solution to that problem in combination with the pipe. Use tidyr::complete gives the same result in one line:
dat %>%
complete(A, B) %>%
ggplot(aes(x = A, y = V, fill = B)) +
geom_col(position = position_dodge())

Ordering geom_bar() without a y defined variable

Is there a way to order the bars in geom_bar() when y is just the count of x?
Example:
ggplot(dat) +
geom_bar(aes(x = feature_1))
I tried using reorder() but it requires a defined y variable within aes().
Made up data:
dfexmpl <- data.frame(stringsAsFactors = FALSE,
group = c("a","a","a","a","a","a",
"a","a","a","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b"))
plot code - reorder is doing the work of arranging by count:
dfexmpl %>%
group_by(group) %>%
mutate(count = n()) %>%
ggplot(aes(x = reorder(group, -count), y = count)) +
geom_bar(stat = "identity")
results in:

Within a function, how to create a discrete axis with _repeated and ordered_ labels

I want to create a function that makes a heatmap where the y axis will have unique breaks, but repeated and ordered labels. I know that this is might not be a great practice. I am also aware that similar questions have been asked before. For example: ggplot in R, reordering the bars. But I want to achieve these repeated and ordered labels through sorting within a function, not by typing them manually. I am aware of solutions for reordering axes based on the values of factor (e.g., Order Bars in ggplot2 bar graph), but I don't think they apply or can't see how to apply these to my case, where the breaks are unique but the labels repeat.
Here is some code to reproduce the problem and some of my attempts:
Libraries and data
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(4)
id <- LETTERS[1:10]
lab <- paste(c("AB", "CD"), 1:5, sep = "_") %>%
sample(., size = 10, replace = TRUE)
val <- sample.int(n = 6, size = 10, replace = TRUE)
tes <- ifelse(val >= 4, 1, 0)
dat <- data.frame(id, lab, val, tes)
A heatmap with unique breaks on the y axis
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
A heatmap where the y axis is labeled with repeated labels instead of the unique breaks
This works, to the point that labels are used instead of unique ids, but the y axis is not ordered by the labels. Also, I am not sure about setting breaks and labels from the data frame in wide format (dat), rather than the data frame in long format used by ggplot (dat2).
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=dat$id, labels=dat$lab)
Mapping the vector of with repeated values on the y axis obviously doesn't work
dat2 <- dat %>% gather(kind, value, val:tes)
ggplot(dat2) +
geom_tile(aes(x = kind, y = lab, fill = value), color="white", size=1)
Repeated and ordered labels, try 1
As expected, merely sorting the input data by the non-unique lab variable does not work.
dat2 <- dat %>% gather(kind, value, val:tes) %>%
arrange(lab)
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks=id, label=lab)
Repeated and ordered labels, try 2
Try to create a named breaks vector ordered by the (repeating) labels. This gets me nowhere. Half the labels are missing and they are still not sorted.
dat2 <- dat %>% gather(kind, value, val:tes)
brks <- setNames(dat$id, dat$lab)[sort(dat$lab)]
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = brks, labels = names(brks))
Repeated and ordered labels, try 3
Starting with the data frame sorted by label, try to create an ordered factor for lab. Then sort the table by this ordered factor. No luck.
dat2 <- dat %>% gather(kind, value, val:tes) %>% arrange(lab)
dat2 <- mutate(dat2, lab_f=factor(lab, levels=sort(unique(lab)), ordered = TRUE))
dat2 <- arrange(dat2, lab_f)
# check
dat2$lab_f
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
scale_y_discrete(breaks = dat2$id, labels = dat2$lab_f)
A workaround, which I can use if I have to, but I am trying to avoid
We can create a combination of id and lab which will be unique and use it for the y axis
dat2 <- dat %>% gather(kind, value, val:tes) %>%
mutate(id_lab=paste(lab, id, sep="_"))
ggplot(dat2) +
geom_tile(aes(x = kind, y = id_lab, fill = value), color="white", size=1)
I must be missing something. Any help is much appreciated.
The goal is to have a function that will take an arbitrarily long table and plot a y axis with unique breaks but (possibly) repeated and ordered labels.
heat <- function(dat) {
dat2 <- dat %>% gather(kind, value, val:tes)
# any other manipulation here
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1)
# scale_y_discrete() (if needed)
}
The plot I am looking for is something like this (created in inkscape)
Using limits instead of breaks sets the order:
ggplot(dat2) +
geom_tile(aes(x = kind, y = id, fill = value), color="white", size=1) +
geom_text(aes(x = 1, y = id, label = id), col = 'white') +
scale_y_discrete(limits = dat$id[order(dat$lab)], labels = sort(dat$lab))

ggplot faceted cumulative histogram

I have the following data
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(100, 6, 1))
gender = rep(c("Male", "Female"), each=100)
mydata = data.frame(x=x, gender=gender)
and I want to plot two cumulative histograms (one for males and the other for females) with ggplot.
I have tried the code below
ggplot(data=mydata, aes(x=x, fill=gender)) + stat_bin(aes(y=cumsum(..count..)), geom="bar", breaks=1:10, colour=I("white")) + facet_grid(gender~.)
but I get this chart
that, obviously, is not correct.
How can I get the correct one, like this:
Thanks!
I would pre-compute the cumsum values per bin per group, and then use geom_histogram to plot.
mydata %>%
mutate(x = cut(x, breaks = 1:10, labels = F)) %>% # Bin x
count(gender, x) %>% # Counts per bin per gender
mutate(x = factor(x, levels = 1:10)) %>% # x as factor
complete(x, gender, fill = list(n = 0)) %>% # Fill missing bins with 0
group_by(gender) %>% # Group by gender ...
mutate(y = cumsum(n)) %>% # ... and calculate cumsum
ggplot(aes(x, y, fill = gender)) + # The rest is (gg)plotting
geom_histogram(stat = "identity", colour = "white") +
facet_grid(gender ~ .)
Like #Edo, I also came here looking for exactly this. #Edo's solution was the key for me. It's great. But I post here a few additions that increase the information density and allow comparisons across different situations.
library(ggplot2)
set.seed(123)
x = c(rnorm(100, 4, 1), rnorm(50, 6, 1))
gender = c(rep("Male", 100), rep("Female", 50))
grade = rep(1:3, 50)
mydata = data.frame(x=x, gender=gender, grade = grade)
ggplot(mydata, aes(x,
y = ave(after_stat(density), group, FUN = cumsum)*after_stat(width),
group = interaction(gender, grade),
color = gender)) +
geom_line(stat = "bin") +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~grade)
I rescale the y so that the cumulative plot always ends at 100%. Otherwise, if the groups are not the same size (like they are in the original example data) then the cumulative plots have different final heights. This obscures their relative distribution.
Secondly, I use geom_line(stat="bin") instead of geom_histogram() so that I can put more than one line on a panel. This way I can compare them easily.
Finally, because I also want to compare across facets, I need to make sure the ggplot group variable uses more than just color=gender. We set it manually with group = interaction(gender, grade).
Answering a million years later....
I was looking for a solution for the same problem and I got here..
Eventually I figured it out by myself, so I'll drop it here in case other people will ever need it.
As required: no pre-work is necessary!
ggplot(mydata) +
geom_histogram(aes(x = x, y = ave(..count.., group, FUN = cumsum),
fill = gender, group = gender),
colour = "gray70", breaks = 1:10) +
facet_grid(rows = "gender")

Resources