Fill specific regions in geom_violin plot - r

How can I fill a geom_violin plot in ggplot2 with different colors based on a fixed cutoff?
For instance, given the setup:
library(ggplot2)
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
dat$f <- with(dat,ifelse(y >= 0,'Above','Below'))
I'd like to take this basic plot:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
and simply have each violin colored differently above and below zero. The naive thing to try, mapping the fill aesthetic, splits and dodges the violin plots:
ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y, fill = f))
which is not what I want. I'd like a single violin plot at each x value, but with the interior filled with different colors above and below zero.

Here's one way to do this.
library(ggplot2)
library(plyr)
#Data setup
set.seed(123)
dat <- data.frame(x = rep(1:3,each = 100),
y = c(rnorm(100,-1),rnorm(100,0),rnorm(100,1)))
First we'll use ggplot::ggplot_build to capture all the calculated variables that go into plotting the violin plot:
p <- ggplot() +
geom_violin(data = dat,aes(x = factor(x),y = y))
p_build <- ggplot2::ggplot_build(p)$data[[1]]
Next, if we take a look at the source code for geom_violin we see that it does some specific transformations of this computed data frame before handing it off to geom_polygon to draw the actual outlines of the violin regions.
So we'll mimic that process and simply draw the filled polygons manually:
#This comes directly from the source of geom_violin
p_build <- transform(p_build,
xminv = x - violinwidth * (x - xmin),
xmaxv = x + violinwidth * (xmax - x))
p_build <- rbind(plyr::arrange(transform(p_build, x = xminv), y),
plyr::arrange(transform(p_build, x = xmaxv), -y))
I'm omitting a small detail from the source code about duplicating the first row in order to ensure that the polygon is closed.
Now we do two final modifications:
#Add our fill variable
p_build$fill_group <- ifelse(p_build$y >= 0,'Above','Below')
#This is necessary to ensure that instead of trying to draw
# 3 polygons, we're telling ggplot to draw six polygons
p_build$group1 <- with(p_build,interaction(factor(group),factor(fill_group)))
And finally plot:
#Note the use of the group aesthetic here with our computed version,
# group1
p_fill <- ggplot() +
geom_polygon(data = p_build,
aes(x = x,y = y,group = group1,fill = fill_group))
p_fill
Note that in general, this will clobber nice handling of any categorical x axis labels. So you will often need to do the plot using a continuous x axis and then if you need categorical labels, add them manually.

Related

How to draw different line segment with different facets

I have a question about using geom_segment in R ggplot2.
For example, I have three facets and two clusters of points(points which have the same y values) in each facets, how do I draw multiple vertical line segments for each clustering with geom_segment?
Like if my data is
x <- (1:24)
y <- (rep(1,2),2,rep(2,2),1,rep(3,2),4, rep(4,1),5,6, ..rep(8,2),7)
facets <-(1,2,3)
factors <-(1,2,3,4,5,6)
xmean <- ( (1+2+3)/3, (4+5+6)/3, ..., (22+23+24)/3)
Note: (1+2+3)/3 is the mean first cluster in the first facet and (4+5+6)/3 is the mean second cluster in the second facet and (7+8+9)/3 is the first cluster in the second facet.
My Code:
ggplot(,aes(x=as.numeric(x),y=as.numeric(y),color=factors)+geom_point(alpha=0.85,size=1.85)+facet_grid(~facets)
+geom_segment(what should I put here to draw this line in different factors?)
Desired result:
Please see the picture!
Please see the updated picture!
Thank you so much! Have a nice day :).
Maybe this is what you are looking for. Instead of working with vectors put your data in a dataframe. Doing so you could easily make an aggregated dataframe with the mean values per facet and cluster which makes it easy to the segments:
Note: Wasn't sure about the setup of your data. You talk about two clusters per facet but your data has 8. So I slightly changed the example data.
library(ggplot2)
library(dplyr)
df <- data.frame(
x = 1:24,
y = rep(1:6, each = 4),
facets = rep(1:3, each = 8)
)
df_sum <- df %>%
group_by(facets, y) %>%
summarise(x = mean(x))
#> `summarise()` has grouped output by 'facets'. You can override using the `.groups` argument.
ggplot(df, aes(x, y, color = factor(y))) +
geom_point(alpha = 0.85, size = 1.85) +
geom_segment(data = df_sum, aes(x = x, xend = x, y = y - .25, yend = y + .25), color = "black") +
facet_wrap(~facets)

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

Scale density plots in ggplot2 to have same x-axis range

I want to overlay two density plots; one of data prior to transformation and one after. I don't care about the x and y values, only the shape of the curve.
I want to superimpose the 2 charts for a given Predictor on top of each other, even though the x-axis is different. I find it hard to look across the two facets. In reality, as well, there will be a lot more plots, so combining the non-transformed and transformed data into the one would be the best solution.
library(tidyverse)
require(caret)
data(BloodBrain)
bbbTrans <- preProcess(select(bbbDescr, adistd, adistm, dpsa3, inthb), method = "YeoJohnson")
bbbTransData <- predict(bbbTrans, select(bbbDescr, adistd, adistm, dpsa3, inthb))
dat <- bbbTransData %>%
gather(Predictor, Value) %>%
mutate(Transformation = "Yeo-Johnson") %>%
bind_rows(data.frame(gather(select(bbbDescr, adistd, adistm, dpsa3, inthb), Predictor, Value), Transformation = "NA", stringsAsFactors = FALSE))
# For the predictor adistd, I would like the x-axis range to be 0:12.5 for the
# "Yeo-Johnson" transformation and 0:250 for no transformation. In this plot, it
# is hard to see the shape of the transformed variables due to the different x-value range.
dat %>% ggplot(aes(x = Value, color = Transformation)) +
geom_density(aes(y = ..scaled..), position = "dodge") +
facet_wrap(~Predictor, scales = "free")
# i.e., I want to superimpose the 2 charts for a given Predictor on top of each other, even though the x-axis is different
# I find it hard to look across the two facets. In reality, as well, there will be a lot more plots, so combining the non-transformed and transformed data into the one plot using colour would be the best solution.
filter(dat, Transformation != 'NA') %>% ggplot(aes(x = Value, y = ..scaled..)) +
geom_density() +
facet_wrap(~Predictor, scales = "free")
filter(dat, Transformation == 'NA') %>% ggplot(aes(x = Value, y = ..scaled..)) +
geom_density() +
facet_wrap(~Predictor, scales = "free")
Edit: The algorithm I think I need is (and prefer to do using tidyverse):
Group by predictor/transformation
Get density for each
Transform x of density to (x-xmin)/(xmax-xmin) so that between 0 to 1
Plot transformed density$x, density$y
Solution that scales (base::scale) and calculates density (stats::density). density function outputs same number of equally spaced points so we can arrange them from 0 to 1 (as OP wants).
# How many points we want
nPoints <- 1e3
# Final result
res <- list()
# Using simple loop to scale and calculate density
combinations <- expand.grid(unique(dat$Predictor), unique(dat$Transformation))
for(i in 1:nrow(combinations)) {
# Subset data
foo <- subset(dat, Predictor == combinations$Var1[i] & Transformation == combinations$Var2[i])
# Perform density on scaled signal
densRes <- density(x = scale(foo$Value), n = nPoints)
# Position signal from 1 to wanted number of points
res[[i]] <- data.frame(x = 1:nPoints, y = densRes$y,
pred = combinations$Var1[i], trans = combinations$Var2[i])
}
res <- do.call(rbind, res)
ggplot(res, aes(x / nPoints, y, color = trans, linetype = trans)) +
geom_line(alpha = 0.5, size = 1) +
facet_wrap(~ pred, scales = "free")

ggplot2: how to add sample numbers to density plot?

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

Plotting a time series where color depends on a category with ggplot

Consider this minimum working example:
library(ggplot2)
x <- c(1,2,3,4,5,6)
y <- c(3,2,5,1,3,1)
data <- data.frame(x,y)
pClass <- c(0,1,1,2,2,0)
plottedGraph <- ggplot(data, aes(x = x, y = y, colour = factor(pClass))) + geom_line()
print(plottedGraph)
I have a time series y = f(x) where x is a timestep. Each timestep should have a color which depends on the category of the timestep, recorded in pClass.
This is the result it gives:
It doesn't make any kind of sense to me why ggplot would connect points with the same color together and not points that follow each other (which is what geom_line should do according to the documentation).
How do I make it plot the following:
You should use group = 1 inside the aes() to tell ggplot that the different colours in fact belong to the same line (ie. group).
ggplot(data, aes(x = x, y = y, colour = factor(pClass), group = 1)) +
geom_line()

Resources