Alluvial plot with 2 different sources but a converging/shared variable [R]

Alluvial plot with 2 different sources but a converging/shared variable [R] - r

I have experience with making alluvial plots using the ggalluvial package. However, I have run in to an issue where I am trying to create an alluvial plot with two different sources that converge onto 1 variable.
here is example data
library(dplyr)
library(ggplot2)
library(ggalluvial)
data <- data.frame(
unique_alluvium_entires = seq(1:10),
label_1 = c("A", "B", "C", "D", "E", rep(NA, 5)),
label_2 = c(rep(NA, 5), "F", "G", "H", "I", "J"),
shared_label = c("a", "b", "c", "c", "c", "c", "c", "a", "a", "b")
)
here is the code I use to make the plot
#prep the data
data <- data %>%
group_by(shared_label) %>%
mutate(freq = n())
data <- reshape2::melt(data, id.vars = c("unique_alluvium_entires", "freq"))
data$variable <- factor(data$variable, levels = c("label_1", "shared_label", "label_2"))
#ggplot
ggplot(data,
aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
y = freq, fill = value, label = value)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
geom_text(stat = "stratum", size = 4) +
theme_void() +
theme(
axis.text.x = element_text(size = 12, face = "bold")
)
(apparently I cannot embed images yet)
As you can see, I can remove the NA values, but the shared_label does not properly "stack". Each unique row should stack on top of each other in the shared_label column. This would also fix the sizing issue so that they are equal size along the y axis.
Any ideas how to fix this? I have tried ggsankey but the same issue arises and I cannot remove NA values. Any tips is greatly appreciated!

This plot is the expected result of the "flow" statistical transformation, which is the default for the "flow" graphical object. (That is, geom_flow() = geom_flow(stat = "flow").) It looks like what you want is to specify the "alluvium" statistical transformation instead. Below i've used all your code but only copied and edited the ggplot() call.
#ggplot
ggplot(data,
aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
y = freq, fill = value, label = value)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow(stat = "alluvium") + # <-- specify alternate stat
geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
geom_text(stat = "stratum", size = 4) +
theme_void() +
theme(
axis.text.x = element_text(size = 12, face = "bold")
)
#> Warning: Removed 2 rows containing missing values (geom_text).
Created on 2021-12-10 by the reprex package (v2.0.1)

Related

ggplot2: enforcing empty space for some missing levels in a plot

In the following example (using the iris dataset), I am creating a factor class variable in which one of the species does not contain values of level C. When I make the plot, I cannot find a way to make ggplot not drop the empty level (virginica-C). In a previous post (from 10 years ago), it indicates to use the argument drop = FALSE, but it is not working for me. any suggestions?
require(dplyr)
require(ggplot2)
iris %>%
mutate(fct_x = factor(x = sample(x = c("A", "B", "C"), size = nrow(.), replace = TRUE),
levels = c("A", "B", "C"))) %>%
filter(!(Species == "virginica" & fct_x == "C")) %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = fct_x)) +
geom_boxplot() +
scale_fill_discrete(drop = FALSE)
In other words, the code shown above generates the following graphic. As you can see, the virginica group does NOT show an empty space for group C (because there are no elements of type virginica-C) and that is exactly what I want to achieve: to show that empty space in the figure.
PS: There is also another similar post (from 6 years ago) in which they suggest placing values outside the limits. It is not a bad idea when you have to make a point plot, but in my case I am making a script that generates automatic plots from incoming information and, therefore, I cannot limit the y-axis since the script itself defines the ylim according to the values that appear.

You can specify the position function in the geom_boxplot call. In dodge2 (the default position parameter) you can set preserve="single" so the width of all the single columns is the same.
iris %>%
mutate(fct_x = factor(x = sample(x = c("A", "B", "C"), size = nrow(.), replace = TRUE),
levels = c("A", "B", "C"))) %>%
filter(!(Species == "virginica" & fct_x == "C")) %>%
mutate(fct_x = factor(fct_x, levels = c("A", "B", "C"))) %>%
ggplot(aes(x = Species, y = Sepal.Length, fill = fct_x)) +
geom_boxplot(position=position_dodge2(preserve="single"))
See the definition of position_dodge2(): https://ggplot2.tidyverse.org/reference/position_dodge.html

You could get the empty slot by faceting with scales = "free_x" and using scale_x_discrete(drop = FALSE):
(The strip labels could be moved to the bottom, and the fct_x labels & gaps between facets removed, if preferred per the second example.)
require(dplyr)
require(ggplot2)
iris %>%
mutate(fct_x = factor(
x = sample(x = c("A", "B", "C"), size = nrow(.), replace = TRUE),
levels = c("A", "B", "C")
)) %>%
filter(!(Species == "virginica" & fct_x == "C")) %>%
ggplot(aes(x = fct_x, y = Sepal.Length, fill = fct_x)) +
geom_boxplot() +
facet_wrap(~ Species, scales = "free_x") +
scale_x_discrete(drop = FALSE)
Created on 2022-06-16 by the reprex package (v2.0.1)
# Mimicing the original plot
require(dplyr)
require(ggplot2)
iris %>%
mutate(fct_x = factor(
x = sample(x = c("A", "B", "C"), size = nrow(.), replace = TRUE),
levels = c("A", "B", "C")
)) %>%
filter(!(Species == "virginica" & fct_x == "C")) %>%
ggplot(aes(x = fct_x, y = Sepal.Length, fill = fct_x)) +
geom_boxplot() +
facet_wrap(~ Species, scales = "free_x", strip.position = "bottom") +
scale_x_discrete(drop = FALSE) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.background = element_blank(),
panel.spacing = unit(0, "lines")) +
labs(x = "Species")
Created on 2022-06-16 by the reprex package (v2.0.1)

I can't have a really back-to-back bar plot [duplicate]

I am wanting to plot back-to-back barplot, however each side is on an independent axes. I can plot them back to back by taking the negative of one set, but that leaves them on the same access and because pvalues are smaller their bars are barely represented.
library(ggplot2)
df <-structure(list(Description = c("a", "b", "c", "d", "e", "f",
"g", "h", "a", "b", "c", "d", "e", "f", "g", "h"), test = c("size",
"size", "size", "size", "size", "size", "size", "size", "p",
"p", "p", "p", "p", "p", "p", "p"), value = c(0.1, 0.1, 0.125,
0.1, 0.075, 0.1, 0.075, 0.125, 0.000230705311441713, 0.000314488619269942,
0.00106639822095382, 0.00108290238851994, 0.00114723539549198,
0.00160204850890075, 0.0019276388745184, 0.00320371567547557)), .Names = c("Description",
"test", "value"), row.names = c(NA, -16L), class = "data.frame")
df$value[df$test == 'p'] <- -(df$value[df$test == 'p'])
ggplot(df, aes(x=Description, y= value, group=test, fill=test)) + geom_col() +coord_flip()
Ideally I would like each group on independent axes so that the bars meet at zero (in the middle of the plot region) but be on different scales for this example ylim would be something like ylim(0,0.13) and for pvalue c(0, 0.0035)

You can do this by using facets, and tweaking to remove the spacing between facets:
ggplot(df, aes(x=Description, y= value, fill=test)) +
facet_wrap(~ test, scales = "free_x") +
geom_col() +
coord_flip() +
scale_y_continuous(expand = c(0, 0)) +
theme(panel.spacing.x = unit(0, "mm"))
It might create some issues with axis labels, and these would be a bit tricky to solve. In that case, it might be easier to keep some space between the facets, at the expense of not having the bars meet in the middle.
Output:
PS: you can also remove the negative axis labels with something like:
scale_y_continuous(
expand = c(0, 0),
labels = function(x) signif(abs(x), 3)
)

I have adapted this elegant solution to my needs. Kudos to Lingyun Zhang.
library(dplyr)
library(ggplot2)
set.seed(123)
ten_positive_rand_numbers <- abs(rnorm(10)) + 0.1
the_prob <- ten_positive_rand_numbers / sum(ten_positive_rand_numbers)
fk_data <- data.frame(job_type = sample(LETTERS[1:10], 1000,
replace = TRUE, prob = the_prob),
gender = sample(c("Male", "Female"), 1000,
replace = TRUE))
# prepare data for plotting
plotting_df <-
fk_data %>%
group_by(job_type, gender) %>%
summarise(Freq = n()) %>%
# a trick!
mutate(Freq = if_else(gender == "Male", -Freq, Freq))
## find the order
temp_df <-
plotting_df %>%
filter(gender == "Female") %>%
arrange(Freq)
the_order <- temp_df$job_type
# plot
p <-
plotting_df %>%
ggplot(aes(x = job_type, y = Freq, group = gender, fill = gender)) +
geom_bar(stat = "identity", width = 0.75) +
coord_flip() +
scale_x_discrete(limits = the_order) +
# another trick!
scale_y_continuous(breaks = seq(-150, 150, 50),
labels = abs(seq(-150, 150, 50))) +
labs(x = "Job type", y = "Count", title = "Back-to-back bar chart") +
theme(legend.position = "bottom",
legend.title = element_blank(),
plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "grey90")) +
# reverse the order of items in legend
# guides(fill = guide_legend(reverse = TRUE)) +
# change the default colors of bars
scale_fill_manual(values = c("red", "blue"),
name = "",
breaks = c("Male", "Female"),
labels = c("Male", "Female"))
print(p)
It can be improved with other minor details, including geom_hline(yintercept = 0, colour = "black").

#Marius solution is easier than this solution but this allows more control of each graph independently.
I have to removed the plot margins on the right of p1 and and left of p2. For some reason there is padding on the left margin so needed -3.5pt to bring it flush, not sure whether this will be consistent across all plots. The other manual thing is changing the breaks on one axis so 0 isn't plotted on top of each other.
I also don't need to negative the p values just use scale_y_reverse
p1 <- ggplot(df[df$test == 'p',], aes(x=Description, y= value)) + geom_col(fill='red') + theme_minimal()+
coord_flip() + scale_y_reverse(name= "axis1",expand = expand_scale(mult= c(c(0.05,0)))) +
theme(panel.spacing.x = unit(0, "mm")) +theme(plot.margin = unit(c(5.5, 0, 5.5, 5.5), "pt"))
p2 <- ggplot(df[df$test != 'p',], aes(x=Description, y= value)) + geom_col(fill='blue') +
scale_y_continuous(name = "axis2", breaks = seq(0.025, 0.125, 0.025) ,expand = expand_scale(mult= c(c(0,0.05)))) +
coord_flip() +
theme(panel.spacing.x = unit(0, "mm"))+ theme_minimal() +
theme(axis.title.y=element_blank(), axis.text.y=element_blank(),
axis.line.y = element_blank(), axis.ticks.y=element_blank(),
plot.margin = unit(c(5.5, 5.5, 5.5, -3.5), "pt"))
grid.newpage()
grid.draw(cbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
I have also have used theme_minimal but that was just for my aesthetic preference.

ggplot custom legend instead of default

I've searched and tried a bunch of suggestions to be able to display a custom legend instead of the default one in a grouped scatter ggplot. I've tried this and this and following this among others.
For instance, let's say I have a df like this one:
df = data.frame(id = c("A", "A", "B", "C", "C", "C"),
value = c(1,2,1,2,3,4),
ref = c(1.5, 1.5, 1, 2,2,2),
min = c(0.5, 0.5, 1,2,2,2))
and I want to display the values of each id as round dots, but also put the reference values and minimum values for each id as a differently shaped dot, as follows:
p = ggplot(data = df) +
geom_point(aes(x = id, y = value, color = factor(id)), shape = 19, size = 6) +
geom_point(aes(x = id, y = ref, color = factor(id)), shape = 0, size = 8) +
geom_point(aes(x = id, y = min, color = factor(id)), shape = 2, size = 8) +
xlab("") +
ylab("Value")
#print(p)
Now all is fine, but my legend doesn't add anything to the interpretation of the plot, as the X axis and colors are enough to understand it. I know I can remove the legend via theme(legend.position = "none").
Instead, I would like to have a legend of what the actual shapes of each dot represent (e.g., filled round dot = value, triangle = min, square = ref).
Among trying to manually set the scale values via scale_fill_manual and something along those lines
override.shape = shapes$shape
override.linetype = shapes$pch
guides(colour = guide_legend(override.aes = list(shape = override.shape, linetype = override.linetype)))...
....
I've also tried making a secondary plot, but not display it, using something suggested in one of the links pasted above:
shapes = data.frame(shape = c("value", "reference", "minimum"), pch = c(19,0,2), col = c("gray", "gray", "gray"))
p2 = ggplot(shapes, aes(shape, pch)) + geom_point()
#print(p2)
g_legend <- function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)
}
legend <- g_legend(p2)
library(gridExtra)
pp <- arrangeGrob(p1 ,legend,
widths=c(5/4, 1/4),
ncol = 2)
but then I get the error:
> legend <- g_legend(p2)
Error in tmp$grobs[[leg]] :
attempt to select less than one element in get1index
for which I did not find a working solution.. so yeah.. any suggestion on how I could only show a legend related to the different dot shapes would be welcome.
Thank you

You can manually build a shape legend using scale_shape_manual:
library(ggplot2)
ggplot(data = df) +
geom_point(aes(x = id, y = value, color = factor(id), shape = 'value'), size = 6) +
geom_point(aes(x = id, y = ref, color = factor(id), shape = 'ref'), size = 8) +
geom_point(aes(x = id, y = min, color = factor(id), shape = 'min'), size = 8) +
scale_shape_manual(values = c('value' = 19, 'ref' = 0, 'min' = 2)) +
xlab("") +
ylab("Value")
Created on 2020-04-15 by the reprex package (v0.3.0)
But a better way to do this would be to reshape the df to a long format, and map each aes to a variable:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id) %>%
ggplot() +
geom_point(aes(x = id, y = value, color = factor(id), shape = name, size = name)) +
scale_shape_manual(values = c('value' = 19, 'ref' = 0, 'min' = 2)) +
scale_size_manual(values = c('value' = 6, 'ref' = 8, 'min' = 8)) +
xlab("") +
ylab("Value")
Created on 2020-04-15 by the reprex package (v0.3.0)
To remove the legend for the color use guide_none():
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(-id) %>%
ggplot() +
geom_point(aes(x = id, y = value, color = factor(id), shape = name, size = name)) +
scale_shape_manual(values = c('value' = 19, 'ref' = 0, 'min' = 2)) +
scale_size_manual(values = c('value' = 6, 'ref' = 8, 'min' = 8)) +
guides(color = guide_none()) +
xlab("") +
ylab("Value")
Created on 2020-04-16 by the reprex package (v0.3.0)
Data:
df = data.frame(id = c("A", "A", "B", "C", "C", "C"),
value = c(1,2,1,2,3,4),
ref = c(1.5, 1.5, 1, 2,2,2),
min = c(0.5, 0.5, 1,2,2,2))

You can tidy your data first using tidyr, and then map the aes shape to the new variable
library(tidyr)
df2 <- pivot_longer(df, -id)
ggplot(data = df2) +
geom_point(aes(x = id, y = value, shape = name), size = 6) +
xlab("") +
ylab("Value")

Connecting points from two datasets with lines in ggplot2 in R

I want to connect datapoints from two datasets with a vertical line. The points that should be connected vertically have the same identifier (V), but I was hoping to keep the datasets separate.
Here is my figure so far:
d1 <- data.frame (V = c("A", "B", "C", "D", "E", "F", "G", "H"),
O = c(9,2.5,7,8,7,6,7,7.5),
S = c(6,5,3,5,3,4,5,6))
d2 <- data.frame (V = c("A", "B", "C", "D"),
O = c(10,3,7.5,8.2),
S = c(6,5,3,5))
scaleFUN <- function(x) sprintf("%.0f", x)
p<-ggplot(data=d1, aes(x=S, y=O), group=factor(V), shape=V) +
geom_point(size = 5, aes(fill=V),pch=21, alpha = 0.35)+
theme_bw()+
geom_point(data = d2, size=5, aes(fill=V), pch=22,colour="black")+
theme(legend.title=element_blank())+
xlab(expression(italic("S"))) + theme(text = element_text(size=25))+
ylab(expression(italic("O")))+ theme(text = element_text(size=25))+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
theme(axis.text.y=element_text(angle=90, hjust=1))+
theme(legend.position="none") # remove legend
print(p)
So the final figure would look something like this:
Can I do this with geom_line() without combining datasets (so the other formatting can be separate for each dataset)?

As bouncyball pointed out, you can use a separate data set (merged from d1 and d2) with geom_segment.
See the following:
ggplot(data = d1, aes(x = S, y = O), group = factor(V), shape = V) +
geom_point(size = 5, aes(fill = V), pch = 21, alpha = 0.35) +
geom_point(data = d2, size = 5, aes(fill = V), pch = 22, colour = "black") +
geom_segment(data = merge(d1, d2, by = 'V'),
aes(x = S.x, xend = S.y, y = O.x, yend = O.y)) +
guides(fill = FALSE)
Which yields:
You can add your themes also.

“for” loop only adding one of the layers in ggplot

I recognize that this has been an issue that's been asked in many other instances, but none of the solutions provided worked for my particular problem.
Here, I have the following data:
library(tidyverse)
library(scales)
mydata <- tibble(Category = c("A", "B", "C", "D"),
Result = c(0.442, 0.537, 0.426, 0.387),
A = c(NA, "A", NA, NA),
B = rep(NA, 4),
C = c(NA, "C", NA, NA),
D = c("D", "D", NA, NA))
mydata$Category <- factor(mydata$Category)
And I have the following vector for the colors:
colors_vct <- c(A = "#0079c0", B = "#cc9900", C = "#252525", D = "#c5120e")
With this information, I can create the following plot:
p <- ggplot(data = mydata , aes(x = Category, y = Result, fill = Category)) +
geom_bar(stat = "identity") + geom_text(aes(label = percent(Result), color = Category), hjust = -.25) +
coord_flip() + scale_y_continuous(limits = c(0,1), labels = percent) +
scale_colour_manual(values = colors_vct) + scale_fill_manual(values = colors_vct)
p
And I'd like to have little triangles appear after the labels based on whether a certain category is mentioned in the last 4 columns of mydata, colored by that category's color, as so:
p <- p + geom_text(data = filter(mydata, mydata[,3] == "A"), aes(label = sprintf("\u25b2")), colour = colors_vct["A"], hjust = -4)
#p <- p + geom_text(data = filter(mydata, mydata[,4] == "B"), aes(label = sprintf("\u25b2")), colour = colors_vct["B"], hjust = -5) #This is commented out because there are no instances where the layer ends up being applied.
p <- p + geom_text(data = filter(mydata, mydata[,5] == "C"), aes(label = sprintf("\u25b2")), colour = colors_vct["C"], hjust = -6)
p <- p + geom_text(data = filter(mydata, mydata[,6] == "D"), aes(label = sprintf("\u25b2")), colour = colors_vct["D"], hjust = -7)
p
This is what I want the final chart to look like (more or less, see bonus question below). Now, I'd like to iterate the last bit of code using a for loop. And this is where I'm running into trouble. It just ends up adding one layer only. How do I make this work? Here is my attempt:
#Set the colors into another table for matching:
colors_tbl <- tibble(Category = levels(mydata$Category),
colors = c("#0079c0", "#cc9900", "#252525", "#c5120e"))
for (i in seq_along(mydata$Category)) {
if (is_character(mydata[[i]])) { #This makes the loop skip if there is nothing to be applied, as with category B.
#Filters to just the specific categories I need to have the triangles shown.
triangles <- filter(mydata, mydata[,(i+2)] == levels(mydata$Category)[i])
#Matches up with the colors_tbl to determine which color to use for that triangle.
triangles <- mutate(triangles, colors = colors_tbl$colors[match(levels(triangles$Category)[i], colors_tbl$Category)])
#Sets a particular position for that triangle for the hjust argument below.
pos <- -(i+3)
#Adding the layer to the plot object
p <- p + geom_text(data = triangles, aes(label = sprintf("\u25b2")), color = triangles$colors, hjust = pos)
}
}
p
:(
Bonus question: Is there a way I can avoid gaps in between the triangles, as per the 2nd chart?
EDIT: As per #baptiste 's suggestion, I re-processed the data as such:
mydata2 <- mydata %>% gather(key = comp, value = Present, -Result, -Category)
mydata2 <- mydata2 %>% mutate(colors = colors_tbl$colors[match(mydata2$Present, colors_tbl$Category)]) %>%
filter(!is.na(mydata2$Present)) %>% select(-comp)
mydata2 <- mydata2 %>% mutate(pos = if_else(Present == "A", -4, if_else(Present == "B", -5, if_else(Present == "C", -6, -7))))
p <- p + geom_text(data = mydata2, aes(x = Category, label = sprintf("\u25b2")), colour = mydata2$colors, hjust = mydata2$pos)
p
Ok, I got it to work. my bonus question still stands.