I'm having trouble understanding what the 'fct_reorder2()' function does - r

I'm trying to understand what the fct_reorder2() does but the R documentation is not clear on this. Below is copied from help for this function:
df <- tibble::tribble(
~color, ~a, ~b,
"blue", 1, 2,
"green", 6, 2,
"purple", 3, 3,
"red", 2, 3,
"yellow", 5, 1
)
df$color <- factor(df$color)
#> Levels: blue green purple red yellow
fct_reorder(df$color, df$a, min)
#>Levels: blue red purple yellow green
fct_reorder2(df$color, df$a, df$b)
#> Levels: purple red blue green yellow
Can someone help me understand why fct_reorder2() sorts the data this way?

The example in the docs is poor, because there is only one row of data for each factor level.
What helped me was a chapter of R4DS: tidyverse and beyond, which is based on an example in the original R4DS, which I reproduce below.
fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.
library(tidyverse)
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
Created on 2021-03-03 by the reprex package (v1.0.0)

The fct_reorder2 is using the .fun as last2. If we check the source code
last2
function (.x, .y)
{
.y[order(.x, na.last = FALSE)][length(.y)]
}
and now using last2
last2(df$a, df$b)
#[1] 2
df$b[order(df$a, na.last = FALSE)][length(df$b)]
#[1] 2
Now, if we check the source code of fct_reorder, it is calling lvls_reorder, which calls refactor
s1 <- tapply(seq_along(df$a), df$color, function(i) last2(df$a[i], df$b[i]))
forcats:::refactor(df$color, levels(df$color)[order(s1, decreasing = TRUE)])
#[1] blue green purple red yellow
#Levels: purple red blue green yellow
Code lineage
fct_reorder2
function (.f, .x, .y, .fun = last2, ..., .desc = TRUE)
{
f <- check_factor(.f)
stopifnot(length(f) == length(.x), length(.x) == length(.y))
ellipsis::check_dots_used()
summary <- tapply(seq_along(.x), f, function(i) .fun(.x[i],
.y[i], ...))
if (is.list(summary)) {
stop("`fun` must return a single value per group", call. = FALSE)
}
lvls_reorder(.f, order(summary, decreasing = .desc))
}
lvls_reorder
function (f, idx, ordered = NA)
{
f <- check_factor(f)
if (!is.numeric(idx)) {
stop("`idx` must be numeric", call. = FALSE)
}
if (!setequal(idx, lvls_seq(f)) || length(idx) != nlevels(f)) {
stop("`idx` must contain one integer for each level of `f`",
call. = FALSE)
}
refactor(f, levels(f)[idx], ordered = ordered)
}

I was also confused by this function and it toke me a lot of time to understand. Here below is my understanding. I hope it helps.
Imagine that you are trying to draw a line plot with multiple lines. The lines are distinguished by their colors. Allow me to use code from #nacnudus, for clarity.
library(tidyverse)
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
# here we use a customized sequential palette
my_palette = c("#7a0177", "#c51b8a", "#f768a1", "#fa9fb5", "#fcc5c0", "#feebe2")
scales::show_col(my_palette)
my_palette
ggplot(by_age, aes(age, prop, colour = marital)) +
scale_color_manual(values = my_pallete) +
geom_line(na.rm = TRUE)
line plot without reorder
The color without reorder the marital variable looks messy. To make colors match the data, we generally hope the color is sequentially correspond to the value that on the rightest end of each line, which means the larger of the value at the rightest point, the darker the color.
So our next mission is to reorder the variable marital according to their vlaue that in the rightest point (i.e, the value of prop at each marital's largest age).
new_order = by_age %>%
group_by(marital) %>% # for each marital
arrange(age, prop) %>% # group first by age, and then prop
dplyr::filter(age == max(age)) %>% # find the largest age in each marital
ungroup() %>%
arrange(desc(prop)) %>% # reorder the data by prop
pull(marital) %>% # get the new order
as.character() # convert the new order to general character
# Then we plot the reordered line plot
ggplot(by_age, aes(age, prop, colour = fct_relevel(marital, new_order))) + # now we reorder the marital by new orders
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 1
Now it's colors look pretty reasonable.
And it's EXACTLY what fct_reorder2() does for you.
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + # now we use fct_reorder2 to reorder marital
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 2
What it does is for each marital, first order the data by age, then by prop, and then pick the last prop value (In this case, there is only one value). The picked values are used to reorder the factor level (by default .fun = last2()). If you define .fun = first2(), the function will reorder marital by the prop vlaue at each marital's smallest age:
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop, .fun = first2))) +
geom_line() +
scale_color_manual(values = my_palette) +
labs(colour = "marital")
reordered line plot 3
I hope it helps.

Related

How to add a gradient fill to a geom_density chart

I have a dataset where I'd like to plot the density of one column and add a gradient fill that is associated with another column.
For example, this code creates the following plot
library(datasets)
library(tidyverse)
df <- airquality
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = Temp, fill = avgWind)) +
geom_density()
What I'd like is for the plot to have a gradient fill that indicates what the average wind (avgWind) was at each temperature along the x-axis.
I've seen some examples that allow me to create a gradient fill that is associated with the values on the x-axis itself (in this case, Temp) or by percentile/quantiles, but I'd like the gradient fill to be associated with an additional variable.
It's sort of like this, but instead of a bar plot, I'd like to keep it as a smoothed density chart:
df %>%
group_by(Temp) %>%
mutate(count = n(),
avgWind = mean(Wind)) %>%
ggplot(aes(x = (Temp), fill = avgWind, group = Temp)) +
geom_bar(aes(y = (..count..)/sum(..count..)))
You can't do gradient fills in geom_polygon so the usual solution is to draw lots of line segments. For example you could do something like this:
library("datasets")
library("tidyverse")
library("viridis")
df <- airquality
df <- df %>%
group_by(Temp) %>%
mutate(count = n(), avgWind = mean(Wind))
## Since we (presumably) want continuous fill, we need to interpolate to
## get avgWind at each Temp value.
## The edges are grey because KDE is estimating density
## Where we don't know the relationship between temp and avgWind
d2fun <- approxfun(df$Temp, df$avgWind)
#> Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
#> 'x' values
dens <- density(df$Temp)
dens_df <- data.frame(x = dens$x, y = dens$y, fill = d2fun(dens[["x"]]))
ggplot(dens_df) +
geom_segment(aes(x = x, xend = x, y = 0, yend = y, color = fill)) +
scale_color_viridis()

ggplot2 - Two color series in area chart

I've got a question regarding an edge case with ggplot2 in R.
They don't like you adding multiple legends, but I think this is a valid use case.
I've got a large economic dataset with the following variables.
year = year of observation
input_type = *labor* or *supply chain*
input_desc = specific type of labor (eg. plumbers OR building supplies respectively)
value = percentage of industry spending
And I'm building an area chart over approximately 15 years. There are 39 different input descriptions and so I'd like the user to see the two major components (internal employee spending OR outsourcing/supply spending)in two major color brackets (say green and blue), but ggplot won't let me group my colors in that way.
Here are a few things I tried.
Junk code to reproduce
spec_trend_pie<- data.frame("year"=c(2006,2006,2006,2006,2007,2007,2007,2007,2008,2008,2008,2008),
"input_type" = c("labor", "labor", "supply", "supply", "labor", "labor","supply","supply","labor","labor","supply","supply"),
"input_desc" = c("plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck", "plumber" ,"manager", "pipe", "truck"),
"value" = c(1,2,3,4,4,3,2,1,1,2,3,4))
spec_broad <- ggplot(data = spec_trend_pie, aes(y = value, x = year, group = input_type, fill = input_desc)) + geom_area()
Which gave me
Error in f(...) : Aesthetics can not vary with a ribbon
And then I tried this
sff4 <- ggplot() +
geom_area(data=subset(spec_trend_pie, input_type="labor"), aes(y=value, x=variable, group=input_type, fill= input_desc)) +
geom_area(data=subset(spec_trend_pie, input_type="supply_chain"), aes(y=value, x=variable, group=input_type, fill= input_desc))
Which gave me this image...so closer...but not quite there.
To give you an idea of what is desired, here's an example of something I was able to do in GoogleSheets a long time ago.
It's a bit of a hack but forcats might help you out. I did a similar post earlier this week:
How to factor sub group by category?
First some base data
set.seed(123)
raw_data <-
tibble(
x = rep(1:20, each = 6),
rand = sample(1:120, 120) * (x/20),
group = rep(letters[1:6], times = 20),
cat = ifelse(group %in% letters[1:3], "group 1", "group 2")
) %>%
group_by(group) %>%
mutate(y = cumsum(rand)) %>%
ungroup()
Now, use factor levels to create gradients within colors
df <-
raw_data %>%
# create factors for group and category
mutate(
group = fct_reorder(group, y, max),
cat = fct_reorder(cat, y, max) # ordering in the stack
) %>%
arrange(cat, group) %>%
mutate(
group = fct_inorder(group), # takes the category into account first
group_fct = as.integer(group), # factor as integer
hue = as.integer(cat)*(360/n_distinct(cat)), # base hue values
light_base = 1-(group_fct)/(n_distinct(group)+2), # trust me
light = floor(light_base * 100) # new L value for hcl()
) %>%
mutate(hex = hcl(h = hue, l = light))
Create a lookup table for scale_fill_manual()
area_colors <-
df %>%
distinct(group, hex)
Lastly, make your plot
ggplot(df, aes(x, y, fill = group)) +
geom_area(position = "stack") +
scale_fill_manual(
values = area_colors$hex,
labels = area_colors$group
)

How can I organize my legend into subgroups?

The legend for my bar graph currently lists all the items in the graph in one long list. I would like to have the legend group itself by each column.
The number of columns is dynamic so the legend must be able to adjust accordingly.
library("phyloseq"); packageVersion("phyloseq")
library(ggplot2)
library(scales)
data("GlobalPatterns")
TopNOTUs <- names(sort(taxa_sums(GlobalPatterns), TRUE)[1:50])
gp.ch <- prune_species(TopNOTUs, GlobalPatterns)
gp.ch = subset_taxa(gp.ch, Genus != "NA")
mdf = psmelt(gp.ch)
# Create a ggplot similar to
library("ggplot2")
mdf$group <- paste0(mdf$Phylum, "-", mdf$Genus, sep = "")
colours <-ColourPalleteMulti(mdf, "Phylum", "Genus")
# Plot resultss
ggplot(mdf, aes(Phylum)) +
geom_bar(aes(fill = group), colour = "grey", position = "stack")
Right now the legend prints the items:
Actinobacteria-Bifidobacterium
Actinobacteria-Rothia
Bacteriodetes-Alistipes
Bacteriodetes-Bacteroides
...
I would like it to print:
Actinobacteria
-Bifidobacterium
-Rothia
Bacteriodetes
-Alistipes
-Bacteroides
...
This is hacky but might work for you. First, using mtcars dataset, I add dummy rows to the data representing the groupings, then assign a factor level to each of the groupings and component categories. Finally, I hack the alpha in the legend so that grouping headers have transparent colors and look hidden.
# Fake data sample
library(tidyverse)
cars_sample <- mtcars %>%
rownames_to_column(var = "name") %>%
mutate(make = word(name, end = 1),
model = word(name, start = 2, end = -1)) %>%
filter(make %in% c("Mazda", "Merc", "Hornet")) %>%
select(name, make, model, mpg, wt)
# Add rows for groups and make a factor for each group and each component
cars_sample_fct <- cars_sample %>%
bind_rows( cars_sample %>% count(make) %>% mutate(model = make, name = "")) %>%
arrange(make, name) %>%
mutate(name_fct = fct_inorder(if_else(name == "", make, paste0("- ", model))))
# Plot with transparent grouping legend labels
ggplot(cars_sample_fct, aes(wt, mpg, color = name_fct)) +
geom_point() +
scale_color_discrete(name = "Car") +
guides(color = guide_legend(
override.aes = list(size = 5,
alpha = cars_sample_fct$name != "")))

Apply Geom Layer Conditionally - Separate Points & Lines

I have a data set similar to the one below where I have a lot of data for certain groups and then only single observations for other groups. I would like my single observations to show up as points but the other groups with multiple observations to show up as lines (no points). My code is below:
EDIT: I'm attempting to find a way to do this without using multiple datasets in the geom_* calls because of the issues it causes with the legend. There was an answer that has since been deleted that was able to handle the legend but didn't get rid of the points on the lines. I would potentially like a single legend with points only showing up if they are a single observation.
library(tidyverse)
dat <- tibble(x = runif(10, 0, 5),
y = runif(10, 0, 20),
group = c(rep("Group1", 4),
rep("Group2", 4),
"Single Point 1",
"Single Point 2")
)
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point() +
geom_line()
Created on 2019-04-02 by the reprex package (v0.2.1)
Only plot the data with 1 point in geom_point() and the data with >1 point in geom_line(). These can be precalculated in mutate().
dat = dat %>%
group_by(group) %>%
mutate(n = n() )
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) )
Having the legend match this is trickier. This is the sort of thing that that override.aes argument in guide_legend() can be useful for.
In your case I would separately calculate the number of observations in each group first, since that is what the line vs point is based on.
sumdat = dat %>%
group_by(group) %>%
summarise(n = n() )
The result is in the same order as the factor levels in the legend, which is why this works.
Now we need to remove lines and keep points whenever the group has only a single observation. 0 stands for a blank line and NA stands for now shape. I use an ifelse() statement for linetype and shape for override.aes, based on the number of observations per group.
dat %>%
ggplot(aes(x = x, y = y, color = group)) +
geom_point(data = filter(dat, n == 1) ) +
geom_line(data = filter(dat, n > 1) ) +
guides(color = guide_legend(override.aes = list(linetype = ifelse(sumdat$n == 1, 0, 1),
shape = ifelse(sumdat$n == 1, 19, NA) ) ) )

Stacked bar graphs in plotly: how to control the order of bars in each stack

I'm trying to order a stacked bar chart in plotly, but it is not respecting the order I pass it in the data frame.
It is best shown using some mock data:
library(dplyr)
library(plotly)
cars <- sapply(strsplit(rownames(mtcars), split = " "), "[", i = 1)
dat <- mtcars
dat <- cbind(dat, cars, stringsAsFactors = FALSE)
dat <- dat %>%
mutate(carb = factor(carb)) %>%
distinct(cars, carb) %>%
select(cars, carb, mpg) %>%
arrange(carb, desc(mpg))
plot_ly(dat) %>%
add_trace(data = dat, type = "bar", x = carb, y = mpg, color = cars) %>%
layout(barmode = "stack")
The resulting plot doesn't respect the ordering, I want the cars with the largest mpg stacked at the bottom of each cylinder group. Any ideas?
As already pointed out here, the issue is caused by having duplicate values in the column used for color grouping (in this example, cars). As indicated already, the ordering of the bars can be remedied by grouping your colors by a column of unique names. However, doing so will have a couple of undesired side-effects:
different model cars from the same manufacturer would be shown with different colors (not what you are after - you want to color by manufacturer)
the legend will have more entries in it than you want i.e. one per model of car rather than one per manufacturer.
We can hack our way around this by a) creating the legend from a dummy trace that never gets displayed (add_trace(type = "bar", x = 0, y = 0... in the code below), and b) setting the colors for each category manually using the colors= argument. I use a rainbow pallette below to show the principle. You may like to select sme more attractive colours yourself.
dat$unique.car <- make.unique(as.character(dat$cars))
dat2 <- data.frame(cars=levels(as.factor(dat$cars)),color=rainbow(nlevels(as.factor(dat$cars))))
dat2[] <- lapply(dat2, as.character)
dat$color <- dat2$color[match(dat$cars,dat2$cars)]
plot_ly() %>%
add_trace(data=dat2, type = "bar", x = 0, y = 0, color = cars, colors=color, showlegend=T) %>%
add_trace(data=dat, type = "bar", x = carb, y = mpg, color = unique.car, colors=color, showlegend=F, marker=list(line=list(color="black", width=1))) %>%
layout(barmode = "stack", xaxis = list(range=c(0.4,8.5)))
One way to address this is to give unique names to all models of car and use that in plotly, but it's going to make the legend messier and impact the color mapping. Here are a few options:
dat$carsID <- make.unique(as.character(dat$cars))
# dat$carsID <- apply(dat, 1, paste0, collapse = " ") # alternative
plot_ly(dat) %>%
add_trace(data = dat, type = "bar", x = carb, y = mpg, color = carsID) %>%
layout(barmode = "stack")
plot_ly(dat) %>%
add_trace(data = dat, type = "bar", x = carb, y = mpg, color = carsID,
colors = rainbow(length(unique(carsID)))) %>%
layout(barmode = "stack")
I'll look more tomorrow to see if I can improve the legend and color mapping.

Resources