I am pretty sure that this is easy to do but I can't seem to find a proper way to query this question into google or stack, so here we are:
I have a plot made in ggplot2 which makes use of geom_jitter(), efficiently creating one row for each element in a factor and plotting its values.
I would like to add a complementary geom_violin() to the plot, but just adding the extra geom_ function to the plot code returns two layers: the jitter and the violin, one on top of the other (as usually expected).
EDIT:
This is how the plot looks like:
How can I have the violin as a separate row, without generating a second plot?
Side quest: how I can I have the jitter and the violin geoms interleaved? (i.e. element A jitter row followed by element A violin row, and then element B jitter row followed by element B violin row)
This is the minimum required code to make it (without all the theme() embellishments):
P1 <- ggplot(data=TEST_STACK_SUB, aes(x=E, y=C, col=A)) +
theme(... , aspect.ratio=0.3) +
geom_point(position = position_jitter(w = 0.30, h = 0), alpha=0.2, size=0.5) +
geom_violin(data=TEST_STACK_SUB, mapping=aes(x=E, y=C), position="dodge") +
scale_x_discrete() +
scale_y_continuous(limits=c(0,1), breaks=seq(0,1,0.1),
labels=c(seq(0,1,0.1))) +
scale_color_gradient2(breaks=seq(0,100,20),
limits=c(0,100),
low="green3",
high="darkorchid4",
midpoint=50,
name="") +
coord_flip()
options(repr.plot.width=8, repr.plot.height=2)
plot(P1)
Here is a subset of the data to generate it (for you to try):
data
How about manipulating your factor as a continuous variable and nudging the entries across the aes() calls like so:
library(dplyr)
library(ggplot2)
set.seed(42)
tibble(x = rep(c(1, 3), each = 10),
y = c(rnorm(10, 2), rnorm(10))) -> plot_data
ggplot(plot_data) +
geom_jitter(aes(x = x - 0.5, y = y), width = 0.25) +
geom_violin(aes(x = x + 0.5, y = y, group = x), width = 0.5) +
coord_flip() +
labs(x = "x") +
scale_x_continuous(breaks = c(1, 3),
labels = paste("Level", 1:2),
trans = scales::reverse_trans())
Related
I want to create a graph that looks something like this:
However, I would like to incorporate density based on the connected lines (and not individual plot points, as the graph above using geom_density_2d does). The data, in reality, looks something like this:
Where I am showing gene expression over a 4-point time series (y = gene expression value, x = time) In both examples, the centre line was created using LOESS curve fitting.
How can I create a density or contour plot based on the actual individual connecting lines that span from time=1 to time=4?
This is what have done so far:
# make a dataset
test <- data.frame(gene=rep(c((1:500)), each=4),
time=rep(c(1:4), 125),
value=rep(c(1,2,3,1), 125))
# add random noise to dataset
test$value <- jitter(test$value, factor=1,amount=2)
# first graph created as follows:
ggplot(data=test, aes(x=time, y=value)) +
geom_density_2d(colour="grey") +
scale_x_continuous(limits = c(0,5),
breaks = seq(1,4),
minor_breaks = seq(1)) +
scale_y_continuous(limits = c(-3,8)) +
guides(fill=FALSE) +
theme_classic()
# second plot created as follows
ggplot(test, aes(time, value)) +
geom_line(aes(group = gene),
size = 0.5,
alpha = 0.3,
color = "snow3") +
geom_point() +
scale_y_continuous(limits = c(-3, 8)) +
scale_x_continuous(breaks = seq(1,4), minor_breaks = seq(1)) +
theme_classic()
Thanks in advance for your help!
I have data which comes from a statistical test (gene set enrichment analysis, but that's not important), so I obtain p-values for statistics that are normally distributed, i.e., both positive and negative values:
The test is run on several categories:
set.seed(1)
df <- data.frame(col = rep(1,7),
category = LETTERS[1:7],
stat.sign = sign(rnorm(7)),
p.value = runif(7, 0, 1),
stringsAsFactors = TRUE)
I want to present these data in a geom_tile ggplot such that I color code the df$category by their df$p.value multiplied by their df$stat.sign (i.e, the sign of the statistic)
For that I first take the log10 of df$p.value:
df$sig <- df$stat.sign*(-1*log10(df$p.value))
Then I order the df by df$sig for each sign of df$sig:
library(dplyr)
df <- rbind(dplyr::filter(df, sig < 0)[order(dplyr::filter(df, sig < 0)$sig), ],
dplyr::filter(df, sig > 0)[order(dplyr::filter(df, sig > 0)$sig), ])
And then I ggplot it:
library(ggplot2)
df$category <- factor(df$category, levels=df$category)
ggplot(data = df,
aes(x = col, y = category)) +
geom_tile(aes(fill=sig)) +
scale_fill_gradient2(low='darkblue', mid='white', high='darkred') +
theme_minimal() +
xlab("") + ylab("") + labs(fill="-log10(P-Value)") +
theme(axis.text.y = element_text(size=12, face="bold"),
axis.text.x = element_blank())
which gives me:
Is there a way to manipulate the legend such that the values of df$sig are represented by their absolute value but everything else remains unchanged? That way I still get both red and blue shades and maintain the order I want.
If you check ggplot's documentation, scale_fill_gradient2, like other continuous scales, accepts one of the following for its labels argument:
NULL for no labels
waiver() for the default labels computed for the transofrmation object
a character vector giving labels (must be same length as breaks)
a function that takes the breaks as input and returns labels as output
Since you only want the legend values to be absolute, I assume you're satisfied with the default breaks in the legend colour bar (-0.1 to 0.4 with increments in 0.1), so all you really need is to add a function that manipulates the labels.
I.e. instead of this:
scale_fill_gradient2(low = 'darkblue', mid = 'white', high = 'darkred') +
Use this:
scale_fill_gradient2(low = 'darkblue', mid = 'white', high = 'darkred',
labels = abs) +
I'm not sure I did understood what you're looking for. Do you meant that you wan't to change the labels within legends? If you want to change labels manipulating breaks and labels given by scale_fill_gradient2() shall do it.
ggplot(data=df,aes(x=col,y=category)) +
geom_tile(aes(fill=sig)) +
scale_fill_gradient2(low='darkblue',mid='white',high='darkred',
breaks = order(unique(df$sig)),
labels = abs(order(unique(df$sig)))) +
theme_minimal()+xlab("")+ylab("")+labs(fill="-log10(P-Value)") +
theme(axis.text.y=element_text(size=12,face="bold"),axis.text.x=element_blank())
For what you're looking for maybe you could display texts inside the figure to show the values, try stacking stat_bin_2d() like this:
ggplot(data=df,aes(x=col,y=category)) +
geom_tile(aes(fill=sig)) +
scale_fill_gradient2(low='darkblue',mid='white',high='darkred',
breaks = order(unique(df$sig)),
labels = abs(order(unique(df$sig)))) +
theme_minimal()+xlab("")+ylab("")+labs(fill="-log10(P-Value)") +
stat_bin_2d(geom = 'text', aes(label = sig), colour = 'black', size = 16) +
theme(axis.text.y=element_text(size=12,face="bold"),axis.text.x=element_blank())
You might want to give the size and colour arguments some tries.
This is a continuation of a question I recently asked (Manually assigning colors with scale_fill_manual only works for certain hexagon sizes).
I was unable to plot geom_hex() so that all hexagons were the same size. Someone solved the problem. However, their solution removed the legend key. Now, I am unable to keep all the hexagons the same size while also retaining the legend.
To be specific, I really want to keep the legend labels sensical. In the example below, the legend has values (0,2,4,6,8,20), rather than hexadecimal labels (#08306B, #08519C, etc).
Below is MWE illustrating the problem. At the end, as per the 3 comments, you can see that I am able to 1) Create a plot with consistent hexagon sizes but no legend, 2) Create a plot with legend, but inconsistent hexagon sizes, 3) Attempt to create a plot with consistent hexagon sizes and legend but fail:
library(ggplot2)
library(hexbin)
library(RColorBrewer)
library(reshape)
set.seed(1)
xbins <- 10
x <- abs(rnorm(10000))
y <- abs(rnorm(10000))
minVal <- min(x, y)
maxVal <- max(x, y)
maxRange <- c(minVal, maxVal)
buffer <- (maxRange[2] - maxRange[1]) / (xbins / 2)
bindata = data.frame(x=x,y=y,factor=as.factor(1))
h <- hexbin(bindata, xbins = xbins, IDs = TRUE, xbnds = maxRange, ybnds = maxRange)
counts <- hexTapply (h, bindata$factor, table)
counts <- t (simplify2array (counts))
counts <- melt (counts)
colnames (counts) <- c ("factor", "ID", "counts")
counts$factor =as.factor(counts$factor)
hexdf <- data.frame (hcell2xy (h), ID = h#cell)
hexdf <- merge (counts, hexdf)
my_breaks <- c(2, 4, 6, 8, 20, 1000)
clrs <- brewer.pal(length(my_breaks) + 3, "Blues")
clrs <- clrs[3:length(clrs)]
hexdf$countColor <- cut(hexdf$counts, breaks = c(0, my_breaks, Inf), labels = rev(clrs))
# Has consistent hexagon sizes, but no legend
ggplot(hexdf, aes(x=x, y=y, hexID=ID, counts=counts, fill=countColor)) + geom_hex(stat="identity", fill=hexdf$countColor) + scale_fill_manual(labels = as.character(c(0, my_breaks)), values = rev(clrs), name = "Count") + geom_abline(intercept = 0, color = "red", size = 0.25) + labs(x = "A", y = "C") + coord_fixed(xlim = c(-0.5, (maxRange[2]+buffer)), ylim = c(-0.5, (maxRange[2]+buffer))) + theme(aspect.ratio=1)
# Has legend, but inconsistent hexagon sizes
ggplot(hexdf, aes(x=x, y=y, hexID=ID, counts=counts, fill=countColor)) + geom_hex(data=hexdf, stat="identity", aes(fill=countColor)) + scale_fill_manual(labels = as.character(c(0, my_breaks)), values = rev(clrs), name = "Count") + geom_abline(intercept = 0, color = "red", size = 0.25) + labs(x = "A", y = "C") + coord_fixed(xlim = c(-0.5, (maxRange[2]+buffer)), ylim = c(-0.5, (maxRange[2]+buffer))) + theme(aspect.ratio=1)
# One attempt to create consistent hexagon sizes and retain legend
ggplot(hexdf, aes(x=x, y=y, hexID=ID, counts=counts, fill=countColor)) + geom_hex(data=hexdf, aes(fill=countColor)) + geom_hex(stat="identity", fill=hexdf$countColor) + scale_fill_manual(labels = as.character(c(0, my_breaks)), values = rev(clrs), name = "Count") + geom_abline(intercept = 0, color = "red", size = 0.25) + labs(x = "A", y = "C") + coord_fixed(xlim = c(-0.5, (maxRange[2]+buffer)), ylim = c(-0.5, (maxRange[2]+buffer))) + theme(aspect.ratio=1)
Any suggestions on how to keep the hexagon sizes consistent while retaining the legend would be very helpful!
Wow, this is an interesting one -- geom_hex seems to really dislike mapping color/fill onto categorical variables. I assume that's because it is designed to be a two-dimensional histogram and visualize continuous summary statistics, but if anyone has any insight into what's going on behind the scenes, I would love to know.
For your specific problem, that really throws a wrench in the works, because you're attempting to have categorical colorization that assigns non-linear groups to the individual hexagons. Conceptually, you might consider why you're doing that. There may be a good reason, but you're essentially taking a linear color gradient and mapping it non-linearly onto your data, which can end up being visually misleading.
However, if that is what you want to do, the best approach I could come up with was to create a new continuous variable that mapped linearly onto your chosen colors and then use those to create a color gradient. Let me try to walk you through my thought process.
You essentially have a continuous variable (counts) that you want to map onto colors. That's easy enough with a simple color gradient, which is the default in ggplot2 for continuous variables. Using your data:
ggplot(hexdf, aes(x=x, y=y)) +
geom_hex(stat="identity", aes(fill=counts))
yields something close.
However, the bins with really high counts wash out the gradient for points with much lower counts, so we need to change the way the gradient maps colors onto values. You've already declared the colors you want to use in the clrs variable; we just need to add a column to your data frame to use in conjunction with these colors to create a smooth gradient. I did that as follows:
all_breaks <- c(0, my_breaks)
breaks_n <- 1:length(all_breaks)
get_break_n <- function(n) {
break_idx <- max(which((all_breaks - n) < 0))
breaks_n[break_idx]
}
hexdf$bin <- sapply(hexdf$counts, get_break_n)
We create the bin variable as the index of the break that is nearest the count variable without exceeding it. Now, you'll notice that:
ggplot(hexdf, aes(x=x, y=y)) +
geom_hex(stat="identity", aes(fill=bin))
is getting much closer to the goal.
The next step is to change how the color gradient maps onto that bin variable, which we can do by adding a call to scale_fill_gradientn:
ggplot(hexdf, aes(x=x, y=y)) +
geom_hex(stat="identity", aes(fill=bin)) +
scale_fill_gradientn(colors=rev(clrs[-1])) # odd color reversal to
# match OP's color mapping
This takes a vector of colors between which you want to interpolate a gradient. The way we've set it up, the points along the interpolation will perfectly match up with the unique values of the bin variable, meaning each value will get one of the colors specified.
Now we're cooking with gas, and the only thing left to do is add the various bells and whistles from the original graph. Most importantly, we need to make the legend look the way we want. This requires three things: (1) changing it from the default color bar to a discretized legend, (2) specifying our own custom labels, and (3) giving it an informative title.
# create the custom labels for the legend
all_break_labs <- as.character(all_breaks[1:(length(allb)-1)])
ggplot(hexdf, aes(x=x, y=y)) +
geom_hex(stat="identity", aes(fill=bin)) +
scale_fill_gradientn(colors=rev(clrs[-1]),
guide="legend", # (1) make legend discrete
labels=all_break_labs, # (2) specify labels
name="Count") + # (3) legend title
# All the other prettification from the OP
geom_abline(intercept = 0, color = "red", size = 0.25) +
labs(x = "A", y = "C") +
coord_fixed(xlim = c(-0.5, (maxRange[2]+buffer)),
ylim = c(-0.5, (maxRange[2]+buffer))) +
theme(aspect.ratio=1)
All of this leaves us with the following graph:
Hopefully that helps you out. For completeness, here's the new code in full:
# ... the rest of your code before the plots
clrs <- clrs[3:length(clrs)]
hexdf$countColor <- cut(hexdf$counts,
breaks = c(0, my_breaks, Inf),
labels = rev(clrs))
### START OF NEW CODE ###
# create new bin variable
all_breaks <- c(0, my_breaks)
breaks_n <- 1:length(all_breaks)
get_break_n <- function(n) {
break_idx <- max(which((all_breaks - n) < 0))
breaks_n[break_idx]
}
hexdf$bin <- sapply(hexdf$counts, get_break_n)
# create legend labels
all_break_labs <- as.character(all_breaks[1:(length(all_breaks)-1)])
# create final plot
ggplot(hexdf, aes(x=x, y=y)) +
geom_hex(stat="identity", aes(fill=bin)) +
scale_fill_gradientn(colors=rev(clrs[-1]),
guide="legend",
labels=all_break_labs,
name="Count") +
geom_abline(intercept = 0, color = "red", size = 0.25) +
labs(x = "A", y = "C") +
coord_fixed(xlim = c(-0.5, (maxRange[2]+buffer)),
ylim = c(-0.5, (maxRange[2]+buffer))) +
theme(aspect.ratio=1)
I have data that looks like this
df = data.frame(x=sample(1:5,100,replace=TRUE),y=rnorm(100),assay=sample(c('a','b'),100,replace=TRUE),project=rep(c('primary','secondary'),50))
and am producing a plot using this code
ggplot(df,aes(project,x)) + geom_violin(aes(fill=assay)) + geom_jitter(aes(shape=assay,colour=y),height=.5) + coord_flip()
which gives me this
This is 90% of the way to being what I want. But I would like it if each point was only plotted on top of the violin plot for the matching assay type. That is, the jitterred positions of the points were set such that the triangles were only ever on the upper teal violin plot and the circles in the bottom red violin plot for each project type.
Any ideas how to do this?
In order to get the desired result, it is probably best to use position_jitterdodge as this gives you the best control over the way the points are 'jittered':
ggplot(df, aes(x = project, y = x, fill = assay, shape = assay, color = y)) +
geom_violin() +
geom_jitter(position = position_jitterdodge(dodge.width = 0.9,
jitter.width = 0.5,
jitter.height = 0.2),
size = 2) +
coord_flip()
which gives:
You can use interaction between assay & project:
p <- ggplot(df,aes(x = interaction(assay, project), y=x)) +
geom_violin(aes(fill=assay)) +
geom_jitter(aes(shape=assay, colour=y), height=.5, cex=4)
p + coord_flip()
The labeling can be adjusted by numeric scaled x axis:
# cbind the interaction as a numeric
df$group <- as.numeric(interaction(df$assay, df$project))
# plot
p <- ggplot(df,aes(x=group, y=x, group=cut_interval(group, n = 4))) +
geom_violin(aes(fill=assay)) +
geom_jitter(aes(shape=assay, colour=y), height=.5, cex=4)
p + coord_flip() + scale_x_continuous(breaks = c(1.5, 3.5), labels = levels(df$project))
I am trying to plot a histogram of two overlapping distributions in ggplot2. Unfortunately, the graphic needs to be in black and white. I tried representing the two categories with different shades of grey, with transparency, but the result is not as clear as I would like. I tried adding outlines to the bars with different linetypes, but this produced some strange results.
require(ggplot2)
set.seed(65)
a = rnorm(100, mean = 1, sd = 1)
b = rnorm(100, mean = 3, sd = 1)
dat <- data.frame(category = rep(c('A', 'B'), each = 100),
values = c(a, b))
ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 1) +
scale_fill_grey()
Notice that one of the lines that should appear dotted is in fact solid (at a value of x = 4). I think this must be a result of it actually being two lines - one from the 3-4 bar and one from the 4-5 bar. The dots are out of phase so they produce a solid line. The effect is rather ugly and inconsistent.
Is there any way of fixing this overlap?
Can anyone suggest a more effective way of clarifying the difference between the two categories, without resorting to colour?
Many thanks.
One possibility would be to use a 'hollow histogram', as described here:
# assign your original plot object to a variable
p1 <- ggplot(data = dat, aes(x = values, linetype = category, fill = category)) +
geom_histogram(colour = 'black', position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
# p1
# extract relevant variables from the plot object to a new data frame
# your grouping variable 'category' is named 'group' in the plot object
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
# plot using geom_step
ggplot(data = df, aes(x = xmin, y = y, linetype = factor(group))) +
geom_step()
If you want to vary both linetype and fill, you need to plot a histogram first (which can be filled). Set the outline colour of the histogram to transparent. Then add the geom_step. Use theme_bw to avoid 'grey elements on grey background'
p1 <- ggplot() +
geom_histogram(data = dat, aes(x = values, fill = category),
colour = "transparent", position = 'identity', alpha = 0.4, binwidth = 0.4) +
scale_fill_grey()
df <- ggplot_build(p1)$data[[1]][ , c("xmin", "y", "group")]
df$category <- factor(df$group, labels = c("A", "B"))
p1 +
geom_step(data = df, aes(x = xmin, y = y, linetype = category)) +
theme_bw()
First, I would recommend theme_set(theme_bw()) or theme_set(theme_classic()) (this sets the background to white, which makes it (much) easier to see shades of gray).
Second, you could try something like scale_linetype_manual(values=c(1,3)) -- this won't completely eliminate the artifacts you're unhappy about, but it might make them a little less prominent since linetype 3 is sparser than linetype 2.
Short of drawing density plots instead (which won't work very well for small samples and may not be familiar to your audience), dodging the positions of the histograms (which is ugly), or otherwise departing from histogram conventions, I can't think of a better solution.