Displaying multiple factors with Sina plots - r

NOTE: I have updated this post following discussion with Z. Lin. Originally, I had simplified my problem to a two factor design (see section "Original question"). However, my actual data consists of four factors, requiring facet_grid. I am therefore providing an example for a four factor design further below (see section "Edit").
Original question
Let's assume I have a two factor design with dv as my dependent variable and iv.x and iv.y as my factors/independent variables. Some quick sample data:
DF <- data.frame(dv = rnorm(900),
iv.x = sort(rep(letters[1:3], 300)),
iv.y = rep(sort(rep(rev(letters)[1:3], 100)), 3))
My goal is to display each condition separately as can nicely be done with violin plots:
ggplot(DF, aes(iv.x, dv, colour=iv.y)) + geom_violin()
I have recently come across Sina plots and would like to do the same here. Unfortunately Sina plots don't do this, collapsing the data instead.
ggplot(DF, aes(iv.x, dv, colour=iv.y)) + geom_sina()
An explicit call to position dodge doesn't help either, as this produces an error message:
ggplot(DF, aes(iv.x, dv, colour=iv.y)) + geom_sina(position = position_dodge(width = 0.5))
The authors of Sina plots have already been made aware of this issue in 2016:
https://github.com/thomasp85/ggforce/issues/47
My problem is more in terms of time. We soon want to submit a manuscript and Sina plots would be a great way to display our data. Can anyone think of a workaround for Sina plots such that I can still display two factors as in the example with violin plots above?
Edit
Sample data for a four factor design:
DF <- data.frame(dv=rnorm(400),
iv.w=sort(rep(letters[1:2],200)),
iv.x=rep(sort(rep(letters[3:4],100)), 2),
iv.y=rep(sort(rep(rev(letters)[1:2],50)),4),
iv.z=rep(sort(rep(letters[5:6],25)),8))
An example with violin plots of what I would like to create using Sina plots:
ggplot(DF, aes(iv.x, dv, colour=iv.y)) +
facet_grid(iv.w ~ iv.z) +
geom_violin(aes(y = dv, fill = iv.y),
position = position_dodge(width = 1))+
stat_summary(aes(y = dv, fill = iv.y), fun.y=mean, geom="point",
colour="black", show.legend = FALSE, size=.2,
position=position_dodge(width=1))+
stat_summary(aes(y = dv, fill = iv.y), fun.data=mean_cl_normal, geom="errorbar",
position=position_dodge(width=1), width=.2, show.legend = FALSE,
colour="black", size=.2)

Edited solution, since OP clarified that facets are required:
ggplot(DF, aes(x = interaction(iv.y, iv.x),
y = dv, fill = iv.y, colour = iv.y)) +
facet_grid(iv.w ~ iv.z) +
geom_sina() +
stat_summary(fun.y=mean, geom="point",
colour="black", show.legend = FALSE, size=.2,
position=position_dodge(width=1))+
stat_summary(fun.data=mean_cl_normal, geom="errorbar",
position=position_dodge(width=1), width=.2,
show.legend = FALSE,
colour="black", size=.2) +
scale_x_discrete(name = "iv.x",
labels = c("c", "", "d", "")) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(hjust = -4),
axis.ticks.x = element_blank())
Instead of using facets to simulate dodging between colours, this approach creates a new variable interaction(colour.variable, x.variable) to be mapped to the x-axis.
The rest of the code in scale_x_discrete() & theme() are there to hide the default x-axis labels / ticks / grid lines.
axis.text.x = element_text(hjust = -4) is a hack that shifts x-axis labels to approximately the right position. It's ugly, but considering the use case is for a manuscript submission, I assume the size of plots will be fixed, and you just need to tweak it once.
Original solution:
Assuming your plots don't otherwise require facetting, you can simulate the appearance with facets:
ggplot(DF, aes(x = iv.y, y = dv, colour = iv.y)) +
geom_sina() +
facet_grid(~iv.x, switch = "x") +
labs(x = "iv.x") +
theme(axis.text.x = element_blank(), # hide iv.y labels
axis.ticks.x = element_blank(), # hide iv.y ticks
strip.background = element_blank(), # make facet strip background transparent
panel.spacing.x = unit(0, "mm")) # remove horizontal space between facets

Related

Why is the transformation of my y-axis to logarithmic scale not showing the correct values in a column plot made with ggplot2? [duplicate]

I'm forgetting something very fundamental which would explain why I'm seeing very inflated y values after a log10 transformation of the y-axis.
I have the following stacked ggplot + geom_histogram.
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy")
The y-values are true to form, exactly what I expect. However, it's so skewed that to the naked eye I'm finding this very unsatisfying. I'd rather see a transformed plot.
I tried transforming x, quickly to realise that transforming along the binned axis was very difficult to interpret. So I transformed the frequency on the y axis:
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy") +
scale_y_log10()
Visually, the plot makes sense. However, I'm struggling to come to terms with the y-axis labels! Why are they so huge after a log10 transformation?
I'm going to make a case against using a stacked position on a log transformed y axis.
Consider the following data.
df <- data.frame(
x = c(1, 1),
y = c(10, 10),
z = c("A", "B")
)
It's just two equal observations from two groups sharing an x position. If we were to plot this in a stacked bar chart, it would look like the following:
library(ggplot2)
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack")
And this does exactly what you expect it would do. However, if we now transform the y-axis, we get the following:
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack") +
scale_y_continuous(trans = "log10")
In the plot above, it seems that group B has the value 10, which is correct and group A has the value 90, which is incorrect. The reason this happens is because position adjustments happen after statistical transformation, so instead of log10(A + B), you are getting log10(A) + log10(B), which is the same as log10(A * B), as top height.
Instead, I'd recommend to not stack histograms if you plan on transforming the y-axis, but use the fill's alpha to tease them apart. Example below:
df <- data.frame(
x = c(rnorm(100, 1), rnorm(100, 2)),
z = rep(c("A", "B"), each = 100)
)
ggplot(df, aes(x, fill = z)) +
geom_histogram(position = "identity", alpha = 0.5) +
scale_y_continuous(trans = "log10")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Transformation introduced infinite values in continuous y-axis
Yes, the 0s will become -Inf but at least the y-axis is now correct.
EDIT: If you want to filter out the -Inf observations, one nice thing in the scales v1.1.1 package is the oob_censor_any() function used as follows:
scale_y_continuous(trans = "log10", oob = scales::oob_censor_any)
I'm guessing that you should transform the data manually as described here https://ggplot2-book.org/scales.html#continuous-position-scales:
"Note that there is nothing preventing you from performing the transformation manually. For example, instead of using scale_x_log10() to transform the scale, you could transform the data instead and plot log10(x). The appearance of the geom will be the same, but the tick labels will be different. Specifically, if you use a transformed scale, the axes will be labelled in the original data space; if you transform the data, the axes will be labelled in the transformed space. Regardless of which method you use, the transformation occurs before any statistical summaries. To transform after statistical computation use coord_trans(). See Section 14.1 for more details."

Stacking multiple figures together in ggplot

I am attempting to make publication ready figures where the bottom axis (with tick marks) of one figure is cleanly combined with the top axis of the figure below it. Here is an example of what it might look like, although this one doesn't have tick marks on each panel:
Here is my attempt to do so, by simply using grid.arrange:
#Libraries:
library(ggplot2)
library(dplyr)
library(gridExtra)
#Filter to create two separate data sets:
dna1 <- DNase %>% filter(Run == 1)
dna2 <- DNase %>% filter(Run == 2)
#Figure 1:
dna1_plot <- ggplot(dna1, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(axis.title.x = element_blank())
#Figure 2:
dna2_plot <- ggplot(dna2, aes(x = conc, y = density)) + geom_point() + theme_classic()
#Using grid.arrange to combine:
dna <- grid.arrange(dna1_plot, dna2_plot, nrow = 2)
And an attempt with some adjustments to the plot margins, although this didn't seem to work:
dna1_plot_round2 <- ggplot(dna1, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(axis.title.x = element_blank(),
plot.margin = (0,0,0,0), "cm")
dna2_plot_round2 <- ggplot(dna2, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(plot.margin = unit(c(-0.5,-1,0,0), "cm"))
dna_round2 <- grid.arrange(dna1_plot_round2, dna2_plot_round2, nrow = 2)
Does anyone know the best way to stack figures like this in ggplot? Is there a better way than using grid.arrange? If possible it would be great to see how to do it with/without tick marks on each x axis as well.
Thank you!
You don't need any non-native ggplot stuff. Keep your data in one data frame and use facet_grid.
dna <- DNase %>% filter(Run %in% 1:2)
ggplot(dna, aes(x = conc, y = density)) +
geom_point() +
theme_bw() +
facet_grid(rows = vars(Run)) +
theme(panel.spacing = unit(0, "mm"))
The R package deeptime has a function called ggarrange2 that can achieve this. Instead of just pasting the plots together like grid.arrange (and ggarrange), it lines up all of the axes and axis labels from all of the plots.
# remove bottom axis elements, reduce bottom margin, add panel border
dna1_plot_round2 <- ggplot(dna1, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.x = element_blank(),
plot.margin = margin(0,0,-.05,0, "cm"), panel.border = element_rect(fill = NA))
# reduce top margin (split the difference so the plots are the same height), add panel border
dna2_plot_round2 <- ggplot(dna2, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(plot.margin = margin(-.05,0,0,0, "cm"), panel.border = element_rect(fill = NA))
dna_round2 <- ggarrange2(dna1_plot_round2, dna2_plot_round2, nrow = 2)
You might also try the fairly recent patchwork package, although I don't have much experience with it.
Note that while Gregor's answer may be fine for this specific example, this answer might be more appropriate for other folks that come across this question (and see the example at the top of the question).
For your purposes, I believe Gregor Thomas' answer is best. But if you are in a situation where facets aren't the best option for combining two plots, the newish package {{patchwork}} handles this more elegantly than any alternatives I've seen.
Patchwork also provides lots of options for adding annotations surrounding the combined plot. The readME and vignettes will get you started.
library(patchwork)
(dna1_plot / dna2_plot) +
plot_annotation(title = "Main title for combined plots")
Edit to better address #Cameron's question.
According to the package creator, {{patchwork}} does not add any space between the plots. The white space in the example above is due to the margins around each individual ggplot. These margins can be adjusted using the plot.margin argument in theme(), which takes a numeric vector of the top, right, bottom, and left margins.
In the example below, I set the bottom margin of dna1_plot to 0 and strip out all the bottom x-axis ticks and text. I also set the top margin of dna2_plot to 0. Doing this nearly makes the y-axis lines touch in the two plots.
dna1_plot <- ggplot(dna1, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
plot.margin = unit(c(1,1,0,1), "mm"))
#Figure 2:
dna2_plot <- ggplot(dna2, aes(x = conc, y = density)) + geom_point() + theme_classic() +
theme(plot.margin = unit(c(0,1,1,1), "mm"))
(dna1_plot / dna2_plot)

How to log transform the y-axis of R geom_histogram in the right direction?

I'm forgetting something very fundamental which would explain why I'm seeing very inflated y values after a log10 transformation of the y-axis.
I have the following stacked ggplot + geom_histogram.
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy")
The y-values are true to form, exactly what I expect. However, it's so skewed that to the naked eye I'm finding this very unsatisfying. I'd rather see a transformed plot.
I tried transforming x, quickly to realise that transforming along the binned axis was very difficult to interpret. So I transformed the frequency on the y axis:
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy") +
scale_y_log10()
Visually, the plot makes sense. However, I'm struggling to come to terms with the y-axis labels! Why are they so huge after a log10 transformation?
I'm going to make a case against using a stacked position on a log transformed y axis.
Consider the following data.
df <- data.frame(
x = c(1, 1),
y = c(10, 10),
z = c("A", "B")
)
It's just two equal observations from two groups sharing an x position. If we were to plot this in a stacked bar chart, it would look like the following:
library(ggplot2)
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack")
And this does exactly what you expect it would do. However, if we now transform the y-axis, we get the following:
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack") +
scale_y_continuous(trans = "log10")
In the plot above, it seems that group B has the value 10, which is correct and group A has the value 90, which is incorrect. The reason this happens is because position adjustments happen after statistical transformation, so instead of log10(A + B), you are getting log10(A) + log10(B), which is the same as log10(A * B), as top height.
Instead, I'd recommend to not stack histograms if you plan on transforming the y-axis, but use the fill's alpha to tease them apart. Example below:
df <- data.frame(
x = c(rnorm(100, 1), rnorm(100, 2)),
z = rep(c("A", "B"), each = 100)
)
ggplot(df, aes(x, fill = z)) +
geom_histogram(position = "identity", alpha = 0.5) +
scale_y_continuous(trans = "log10")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Transformation introduced infinite values in continuous y-axis
Yes, the 0s will become -Inf but at least the y-axis is now correct.
EDIT: If you want to filter out the -Inf observations, one nice thing in the scales v1.1.1 package is the oob_censor_any() function used as follows:
scale_y_continuous(trans = "log10", oob = scales::oob_censor_any)
I'm guessing that you should transform the data manually as described here https://ggplot2-book.org/scales.html#continuous-position-scales:
"Note that there is nothing preventing you from performing the transformation manually. For example, instead of using scale_x_log10() to transform the scale, you could transform the data instead and plot log10(x). The appearance of the geom will be the same, but the tick labels will be different. Specifically, if you use a transformed scale, the axes will be labelled in the original data space; if you transform the data, the axes will be labelled in the transformed space. Regardless of which method you use, the transformation occurs before any statistical summaries. To transform after statistical computation use coord_trans(). See Section 14.1 for more details."

Manually change order of y axis items on complicated stacked bar chart in ggplot2

I've been stuck on an issue and can't find a solution. I've tried many suggestions on Stack Overflow and elsewhere about manually ordering a stacked bar chart, since that should be a pretty simple fix, but those suggestions don't work with the huge complicated mess of code I plucked from many places. My only issue is y-axis item ordering.
I'm making a series of stacked bar charts, and ggplot2 changes the ordering of the items on the y-axis depending on which dataframe I am trying to plot. I'm trying to make 39 of these plots and want them to all have the same ordering. I think ggplot2 only wants to plot them in ascending order of their numeric mean or something, but I'd like all of the bar charts to first display the group "Bird Advocates" and then "Cat Advocates." (This is also the order they appear in my data frame, but that ordering is lost at the coord_flip() point in plotting.)
I think that taking the data frame through so many changes is why I can't just add something simple at the end or use the reorder() function. Adding things into aes() also doesn't work, since the stacked bar chart I'm creating seems to depend on those items being exactly a certain way.
Here's one of my data frames where ggplot2 is ordering my y-axis items incorrectly, plotting "Cat Advocates" before "Bird Advocates":
Group,Strongly Opposed,Opposed,Slightly Opposed,Neutral,Slightly Support,Support,Strongly Support
Bird Advocates,0.005473026,0.010946052,0.012509773,0.058639562,0.071149335,0.31118061,0.530101642
Cat Advocates,0.04491726,0.07013396,0.03624901,0.23719464,0.09141056,0.23404255,0.28605201
And here's all the code that takes that and turns it into a plot:
library(ggplot2)
library(reshape2)
library(plotly)
#Importing data from a .csv file
data <- read.csv("data.csv", header=TRUE)
data$s.Strongly.Opposed <- 0-data$Strongly.Opposed-data$Opposed-data$Slightly.Opposed-.5*data$Neutral
data$s.Opposed <- 0-data$Opposed-data$Slightly.Opposed-.5*data$Neutral
data$s.Slightly.Opposed <- 0-data$Slightly.Opposed-.5*data$Neutral
data$s.Neutral <- 0-.5*data$Neutral
data$s.Slightly.Support <- 0+.5*data$Neutral
data$s.Support <- 0+data$Slightly.Support+.5*data$Neutral
data$s.Strongly.Support <- 0+data$Support+data$Slightly.Support+.5*data$Neutral
#to percents
data[,2:15]<-data[,2:15]*100
#melting
mdfr <- melt(data, id=c("Group"))
mdfr<-cbind(mdfr[1:14,],mdfr[15:28,3])
colnames(mdfr)<-c("Group","variable","value","start")
#remove dot in level names
mylevels<-c("Strongly Opposed","Opposed","Slightly Opposed","Neutral","Slightly Support","Support","Strongly Support")
mdfr$variable<-droplevels(mdfr$variable)
levels(mdfr$variable)<-mylevels
pal<-c("#bd7523", "#e9aa61", "#f6d1a7", "#999999", "#c8cbc0", "#65806d", "#334e3b")
ggplot(data=mdfr) +
geom_segment(aes(x = Group, y = start, xend = Group, yend = start+value, colour = variable,
text=paste("Group: ",Group,"<br>Percent: ",value,"%")), size = 5) +
geom_hline(yintercept = 0, color =c("#646464")) +
coord_flip() +
theme(legend.position="top") +
theme(legend.key.width=unit(0.5,"cm")) +
guides(col = guide_legend(ncol = 12)) + #has 7 real columns, using to adjust legend position
scale_color_manual("Response", labels = mylevels, values = pal, guide="legend") +
theme(legend.title = element_blank()) +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(axis.text.x = element_blank()) +
theme(legend.key = element_rect(fill = "white")) +
scale_y_continuous(breaks=seq(-100,100,100), limits=c(-100,100)) +
theme(panel.background = element_rect(fill = "#ffffff"),
panel.grid.major = element_line(colour = "#CBCBCB"))
The plot:
I think this works, you may need to play around with the axis limits/breaks:
library(dplyr)
mdfr <- mdfr %>%
mutate(group_n = as.integer(case_when(Group == "Bird Advocates" ~ 2,
Group == "Cat Advocates" ~ 1)))
ggplot(data=mdfr) +
geom_segment(aes(x = group_n, y = start, xend = group_n, yend = start + value, colour = variable,
text=paste("Group: ",Group,"<br>Percent: ",value,"%")), size = 5) +
scale_x_continuous(limits = c(0,3), breaks = c(1, 2), labels = c("Cat", "Bird")) +
geom_hline(yintercept = 0, color =c("#646464")) +
theme(legend.position="top") +
theme(legend.key.width=unit(0.5,"cm")) +
coord_flip() +
guides(col = guide_legend(ncol = 12)) + #has 7 real columns, using to adjust legend position
scale_color_manual("Response", labels = mylevels, values = pal, guide="legend") +
theme(legend.title = element_blank()) +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(axis.ticks = element_blank()) +
theme(axis.text.x = element_blank()) +
theme(legend.key = element_rect(fill = "white"))+
scale_y_continuous(breaks=seq(-100,100,100), limits=c(-100,100)) +
theme(panel.background = element_rect(fill = "#ffffff"),
panel.grid.major = element_line(colour = "#CBCBCB"))
produces this plot:
You want to factor the 'Group' variable in the order by which you want the bars to appear.
mdfr$Group <- factor(mdfr$Group, levels = c("Bird Advocates", "Cat Advocates")

Create a dodged barplot with ggplot2

I have the dataset below:
Database<-c("Composite","DB","TC","RH","DGI","DCH","DCH","DCH","LDP")
Unique_Drugs<-c(12672,5130,1425,3090,6100,2019,250,736,1182)
Unique_Targets<-c(3987,2175,842,2308,2413,1441,198,327,702)
db<-data.frame(Database,Unique_Drugs,Unique_Targets)
and I would like to create a dodged bar chart like the picture below:
This plot came from a dataframe like:
The difference is that in the x-axis I want the 7 unique Database names and the fill argument should be the Unique_Drugs and Unique_Targets in order to create 2 colored bars that will display their values. Im not sure how to make it work.
My code is:
p <- ggplot(data = db, aes(Database)) +
geom_bar(position = position_dodge(preserve = "single"), stat="count", aes(fill = colnames(db[2:4])), color = "black")+
coord_flip()+
theme(legend.position="top",
legend.title=element_blank(),
axis.title.x=element_text(size=18, face="bold", color="#000000"), # this changes the x axis title
axis.text.x = element_text(size=14, face="bold", color="#000000"), #This changes the x axis ticks text
axis.title.y=element_text(size=18, face="bold", color="#000000"), # this changes the y axis title
axis.text.y = element_text(size=14, face="bold", color="#000000"))+ #This changes the y axis ticks text
labs(x = "Database") +
labs(y = "Value") +
scale_x_discrete(limits = rev(factor(Database))) +
scale_fill_manual("Databases", values = c("tomato","steelblue3"))
Here's one way to achieve what you want:
library(reshape2)
ggplot(melt(db), aes(x = Database, y = value, fill = variable)) +
geom_col(position = "dodge") + ylab(NULL) + theme_minimal() +
scale_fill_discrete(NULL, labels = c("Drugs", "Targets"))
If you wanted a bar plot only for drugs, there would be no need for melt as you could use y = Unique_Drugs to specify the bar heights (note that since we have heights we use geom_col). In this case, however, we want to specify two kinds of heights. Your words that fill argument should be the Unique_Drugs and Unique_Targets precisely suggest that we need some transformations because ggplot doesn't accept two variables for the same aesthetic. So, using melt we get all the heights as a single variable and get a single variable for fill.

Resources