Ordering geom_bar() without a y defined variable - r

Is there a way to order the bars in geom_bar() when y is just the count of x?
Example:
ggplot(dat) +
geom_bar(aes(x = feature_1))
I tried using reorder() but it requires a defined y variable within aes().

Made up data:
dfexmpl <- data.frame(stringsAsFactors = FALSE,
group = c("a","a","a","a","a","a",
"a","a","a","b","b","b","b","b","b","b","b","b",
"b","b","b","b","b","b"))
plot code - reorder is doing the work of arranging by count:
dfexmpl %>%
group_by(group) %>%
mutate(count = n()) %>%
ggplot(aes(x = reorder(group, -count), y = count)) +
geom_bar(stat = "identity")
results in:

Related

Difference between geom_col() and geom_point() for same value

So, I'm trying to plot missing values here over time (longitudinal data).
I would prefer placing them in a geom_col() to fill up with colours of certain treatments afterwards. But for some weird reason, geom_col() gives me weird values, while geom_point() gives me the correct values using the same function. I'm trying to wrap my head around why this is happening. Take a look at the y-axis.
Disclaimer:
I know the missing values dissappear on day 19-20. This is why I'm making the plot.
Sorry about the lay-out of the plot. Not polished yet.
For the geom_point:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_point()
Picture: geom_point
For the geom_col:
gaussian_transformed %>% group_by(factor(time)) %>% mutate(missing = sum(is.na(Rose_width))) %>% ggplot(aes(x = factor(time), y = missing)) + geom_col()
Picture: geom_col
The problem is that you're using mutate and creating several rows for your groups. You cannot see that, but you will have plenty of points overlapping in your geom_point plot.
One way is to either use summarise, or you use distinct
Compare
library(tidyverse)
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_point()
The points look ugly because there is a lot of over plotting.
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
distinct(order, .keep_all = TRUE) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
msleep %>% group_by(order) %>%
mutate(missing = sum(is.na(sleep_cycle))) %>%
ggplot(aes(x = order, y = missing)) +
geom_col()
Created on 2021-06-02 by the reprex package (v2.0.0)
So after some digging:
What happens was that the geom_col() function sums up all the missing values while geom_point() does not. Hence the large values for y. Why this is happening, I do not know. However doing the following worked fine for me:
gaussian_transformed$time <- as.factor(gaussian_transformed$time)
gaussian_transformed %>% group_by(time) %>% summarise(missing = sum(is.na(Rose_width))) -> gaussian_transformed
gaussian_transformed %>% ggplot(aes(x = time, y = missing)) + geom_col(fill = "blue", alpha = 0.5) + theme_minimal() + labs(title = "Missing values in Gaussian Outcome over the days", x = "Time (in days)", y = "Amount of missing values") + scale_y_continuous(breaks = seq(0, 10, 1))
With the plot: GaussianMissing

GGplot order legend using last values on x-axis

I have some time series data plotted using ggplot. I'd like the legend, which appears to the right of the plot, to be in the same order as the line on the most recent date/value on the plot's x-axis. I tried using the case_when function, but I'm obviously using it wrong. Here is an example.
df <- tibble(
x = runif(100),
y = runif(100),
z = runif(100),
year = sample(seq(1900, 2010, 10), 100, T)
) %>%
gather(variable, value,-year) %>%
group_by(year, variable) %>%
summarise(mean = mean(value))
df %>%
ggplot(aes(year, mean, color = variable)) +
geom_line()
## does not work
df %>%
mutate(variable = fct_reorder(variable, case_when(mean ~ year == 2010)))
ggplot(aes(year, mean, color = variable)) +
geom_line()
We may add one extra line
ungroup() %>% mutate(variable = fct_reorder(variable, mean, tail, n = 1, .desc = TRUE))
before plotting, or use
df %>%
mutate(variable = fct_reorder(variable, mean, tail, n = 1, .desc = TRUE)) %>%
ggplot(aes(year, mean, color = variable)) +
geom_line()
In this way we look at the last values of mean and reorder variable accordingly.
There's another way without adding a new column using fct_reorder2():
library(tidyverse)
df %>%
ggplot(aes(year, mean, color = fct_reorder2(variable, year, mean))) +
geom_line() +
labs(color = "variable")
Although it's not recommendable in your case, to order the legend based on the first (earliest) values in your plot you can set
df %>%
ggplot(aes(year, mean, color = fct_reorder2(variable, year, mean, .fun = first2))) +
geom_line() +
labs(color = "variable")
The default is .fun = last2 (see also https://forcats.tidyverse.org/reference/fct_reorder.html)

Using expand_limits in ggplot when counts are not a column in the dataset (R)

I have a dataset and I need to plot a bar chart of the counts of the different outcomes of a certain column. For this example I am using the mtcars dataset.
When I first attempted this, I found that the labels on the bars were getting cut off at the top, so I used the expand_limits argument to give them more space. As I want to be able to use this code for refreshed data, the limits might change, which is why I've used the max() function.
mtcars_cyl_counts <- as.data.frame(table(mtcars$cyl))
colnames(mtcars_cyl_counts)[1:2] <- c("cyl", "counts")
mtcars_cyl_counts %>%
arrange(desc(counts)) %>%
ggplot(aes(x = reorder(cyl, -counts), y = counts)) +
geom_bar(stat = "identity") +
geom_text(aes(label = comma(counts), vjust = -0.5), size = 3) +
expand_limits(y = max((mtcars_cyl_counts$counts) * 1.05))
This works fine, but it seemed unnecessarily cumbersome to create a separate table of counts, and made some of the future code more complicated, so I redid this:
mtcars %>%
group_by(cyl) %>%
summarize(counts = n()) %>%
arrange(-counts) %>%
mutate(cyl = factor(cyl, cyl)) %>%
ggplot() +
geom_bar(aes(x = cyl, y = counts), stat = "identity") +
geom_text(aes(x = cyl, y = counts, label = comma(counts), vjust = -0.5), size = 3) +
expand_limits(y = max((counts) * 1.05))
However, this returns the following error:
Error in data.frame(..., stringsAsFactors = FALSE) :
object 'counts' not found
I get that 'counts' is not technically in the mtcars dataset (which is why it also doesn't work if I use mtcars$counts), but it's what I've used elsewhere in the code for y definitions.
So, is there a way to write this so that it works, or an alternative way to expand the vertical limits in a way that will adapt to different datasets?
(NB: with these examples, the bar labels don't get cut off because they aren't very big, but for the purpose of this I just need the limits expanded so that detail is not critically important to the working...)
If this helps Megan,
mtcars %>%
count(cyl) %>%
arrange(-n) %>%
mutate(cyl = factor(cyl, cyl)) %>%
ggplot(aes(cyl, n)) +
geom_text(vjust = -0.5, aes(label = n)) +
geom_bar(stat = "identity") +
expand_limits(y = max(table(mtcars$cyl) * 1.05))

Order x axis in stacked bar by subset of fill

There are multiple questions (here for instance) on how to arrange the x axis by frequency in a bar chart with ggplot2. However, my aim is to arrange the categories on the X-axis in a stacked bar chart by the relative frequency of a subset of the fill. For instance, I would like to sort the x-axis by the percentage of category B in variable z.
This was my first try using only ggplot2
library(ggplot2)
library(tibble)
library(scales)
factor1 <- as.factor(c("ABC", "CDA", "XYZ", "YRO"))
factor2 <- as.factor(c("A", "B"))
set.seed(43)
data <- tibble(x = sample(factor1, 1000, replace = TRUE),
z = sample(factor2, 1000, replace = TRUE))
ggplot(data = data, aes(x = x, fill = z, order = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)
When that didn't work I created a summarised data frame using dplyr and then spread the data and sort it by B and then gather it again. But plotting that didn't work either.
library(dplyr)
library(tidyr)
data %>%
group_by(x, z) %>%
count() %>%
spread(z, n) %>%
arrange(-B) %>%
gather(z, n, -x) %>%
ggplot(aes(x = reorder(x, n), y = n, fill = z)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = percent)
I would prefer a solution with ggplot only in order not to be dependent of the order in the data frame created by dplyr/tidyr. However, I'm open for anything.
If you want to sort by absolute frequency:
lvls <- names(sort(table(data[data$z == "B", "x"])))
If you want to sort by relative frequency:
lvls <- names(sort(tapply(data$z == "B", data$x, mean)))
Then you can create the factor on the fly inside ggplot:
ggplot(data = data, aes(factor(x, levels = lvls), fill = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)
A solution using tidyverse would be:
data %>%
mutate(x = forcats::fct_reorder(x, as.numeric(z), fun = mean)) %>%
ggplot(aes(x, fill = z)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = percent)

ggplot2() bar chart fill argument

I've got a data frame with two categorical variables called verified and procedure.
I'd like to make a bar chart with procedure on the x-axis, and the corresponding percentages rather than counts on the y-axis. Furthermore, I'd like for verified to be the fill of the bars.
The problem's that when I've tried using the fill argument it hasn't worked. My current code gets me bars that are all grey with a black line (despite the absence of a fill argument the black line seems to indicate the levels of verified???). Instead I'd like the levels to be in different colours.
Thanks!
starting point (df):
df <- data.frame(verified=c("small","large","small","small","large","small","small","large","small"),procedure=c(1,2,1,2,1,2,2,2,2))
current code:
library(dplyr)
library(gglot2)
df %>%
count(procedure,verified) %>%
mutate(prop = round((n / sum(n))*100),2) %>%
group_by(procedure) %>%
ggplot(aes(x = procedure, y = prop)) +
geom_bar(stat = "identity",colour="black")
just add fill = verified to your initial aes or within your geom_bar
# common elements
g_df <- df %>%
count(procedure, verified) %>%
mutate(prop = round((n / sum(n)) * 100), 2) %>%
group_by(procedure)
# fill added to initial aes
g1 <- ggplot(g_df, aes(x = procedure, y = prop, fill = verified)) +
geom_bar(stat = "identity", colour = "black")
# fill added to geom_bar
g2 <- ggplot(aes(x = procedure, y = prop)) +
geom_bar(aes(fill = verified), stat = "identity", colour = "black")
Both g1 and g2 produce the same plot below
As suggested by eipi10 in the comments to my answer, you could clean up the xaxis by making it a factor, a modification of their code below.
df %>%
count(procedure, verified) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = factor(procedure), y = prop, fill = verified)) +
geom_bar(stat = "identity", colour = "black") +
labs(x = "procedure", y = "percent")
to produce

Resources