violin_plot() with continuous axis for grouping variable? - r

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.

The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.

Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

Related

How to rescale color mapping in scale_color_distiller (ggplot2)?

My data consists of three numeric variables. Something like this:
set.seed(1)
df <- data.frame(x= rnorm(10000), y= rnorm(10000))
df$col= df$x + df$y + df$x*df$y
Plotting this as a heatplot looks good:
ggplot(df, aes(x, y, col= col)) + geom_point(size= 2) + scale_color_distiller(palette = "Spectral")
But real variables can have some skewness or outliers and this totally changes the plot. After df$col[nrow(df)] <- 100 same ggplot code as above returns this plot:
Clearly, the problem is that this one point changes the scale and we get a plot with little information. My solution is to rank the data with rank() which gives a reasonable color progression for any variable I`ve tried so far. See here:
ggplot(df, aes(x, y, col= rank(col))) + geom_point(size= 2) + scale_color_distiller(palette = "Spectral")
The problem with this solution that the new scale (2,500 to 10,000) is shown as the color label. I want the original scale to be shown as color label (o to 10). Therefor, I want that the color progression corresponds to the ranked data; i.e. I need to somehow map the original values to the ranked color values. Is that possible? I tried to change limits argument to limits= c(0, 10) inside scale_color_distiller() but this does not help.
Sidenotes: I do not want to remove the outlier. Ranking works well. I wan to use scale_color_distiller(). If possible, I want not to use any additional packages than ggplot2.
rescale the rank to the range of your original df$col.
library(tidyverse)
set.seed(1)
df <- data.frame(x = rnorm(10000), y = rnorm(10000))
df %>%
mutate(
col = x + y + x * y,
scaled_rank = scales::rescale(rank(col), range(col))
) %>%
ggplot(aes(x, y, col = scaled_rank)) +
geom_point(size = 2) +
scale_color_distiller(palette = "Spectral")
Created on 2021-11-17 by the reprex package (v2.0.1)

How to draw different line segment with different facets

I have a question about using geom_segment in R ggplot2.
For example, I have three facets and two clusters of points(points which have the same y values) in each facets, how do I draw multiple vertical line segments for each clustering with geom_segment?
Like if my data is
x <- (1:24)
y <- (rep(1,2),2,rep(2,2),1,rep(3,2),4, rep(4,1),5,6, ..rep(8,2),7)
facets <-(1,2,3)
factors <-(1,2,3,4,5,6)
xmean <- ( (1+2+3)/3, (4+5+6)/3, ..., (22+23+24)/3)
Note: (1+2+3)/3 is the mean first cluster in the first facet and (4+5+6)/3 is the mean second cluster in the second facet and (7+8+9)/3 is the first cluster in the second facet.
My Code:
ggplot(,aes(x=as.numeric(x),y=as.numeric(y),color=factors)+geom_point(alpha=0.85,size=1.85)+facet_grid(~facets)
+geom_segment(what should I put here to draw this line in different factors?)
Desired result:
Please see the picture!
Please see the updated picture!
Thank you so much! Have a nice day :).
Maybe this is what you are looking for. Instead of working with vectors put your data in a dataframe. Doing so you could easily make an aggregated dataframe with the mean values per facet and cluster which makes it easy to the segments:
Note: Wasn't sure about the setup of your data. You talk about two clusters per facet but your data has 8. So I slightly changed the example data.
library(ggplot2)
library(dplyr)
df <- data.frame(
x = 1:24,
y = rep(1:6, each = 4),
facets = rep(1:3, each = 8)
)
df_sum <- df %>%
group_by(facets, y) %>%
summarise(x = mean(x))
#> `summarise()` has grouped output by 'facets'. You can override using the `.groups` argument.
ggplot(df, aes(x, y, color = factor(y))) +
geom_point(alpha = 0.85, size = 1.85) +
geom_segment(data = df_sum, aes(x = x, xend = x, y = y - .25, yend = y + .25), color = "black") +
facet_wrap(~facets)

Coloring a geom_histogram by gradient

I'm trying to plot a geom_histogram where the bars are colored by a gradient.
This is what I'm trying to do:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
ggplot(df,aes_string(x="val",y="..count..+1",fill="val"))+geom_histogram(binwidth=1,pad=TRUE)+scale_y_log10()+scale_fill_gradient2("val",low="darkblue",high="darkred")
But getting:
Any idea how to get it colored by the defined gradient?
Not sure you can fill by val because each bar of the histogram represents a collection of points.
You can, however, fill by categorical bins using cut. For example:
ggplot(df, aes(val, fill = cut(val, 100))) +
geom_histogram(show.legend = FALSE)
Just for completeness.
If the colors I'd like to have the gradient on to be manually selected here's what I suggest:
data:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
colors:
bins <- 10
cols <- c("darkblue","darkred")
colGradient <- colorRampPalette(cols)
cut.cols <- colGradient(bins)
cuts <- cut(df$val,bins)
names(cuts) <- sapply(cuts,function(t) cut.cols[which(as.character(t) == levels(cuts))])
plot:
ggplot(df,aes(val,fill=cut(val,bins))) +
geom_histogram(show.legend=FALSE) +
scale_color_manual(values=cut.cols,labels=levels(cuts)) +
scale_fill_manual(values=cut.cols,labels=levels(cuts))
Instead of binning manually another option would be to make use of the bins computed by stat_bin by mapping ..x.. (or factor(..x..) in case of a discrete scale) or after_stat(x) on the fill aesthetic.
An issue with computing the bins manually is that we end up with multiple groups per bin for which the count has to be computed (even if the count is zero most of the time) and which get stacked on top of each other in the histogram. Especially, this gets problematic if one would add labels of counts to the histogram as can be seen in this post, because in that case one ends up with multiple labels per bin.
library(ggplot2)
set.seed(1)
df <- data.frame(id = paste("ID", 1:1000, sep = "."), val = rnorm(1000), stringsAsFactors = F)
ggplot(df, aes(x = val, y = ..count.. + 1, fill = ..x..)) +
geom_histogram(binwidth = .1, pad = TRUE) +
scale_y_log10() +
scale_fill_gradient2(name = "val", low = "darkblue", high = "darkred")
#> Warning: Duplicated aesthetics after name standardisation: pad

ggplot2: how to add sample numbers to density plot?

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

ggplot2 boxplot medians aren't plotting as expected

So, I have a fairly large dataset (Dropbox: csv file) that I'm trying to plot using geom_boxplot. The following produces what appears to be a reasonable plot:
require(reshape2)
require(ggplot2)
require(scales)
require(grid)
require(gridExtra)
df <- read.csv("\\Downloads\\boxplot.csv", na.strings = "*")
df$year <- factor(df$year, levels = c(2010,2011,2012,2013,2014), labels = c(2010,2011,2012,2013,2014))
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
scale_y_continuous(limits = c(0, 15)) +
theme(legend.position = "none"))
d
However, when you dig a little deeper, problems creep in that freak me out. When I labeled the boxplot medians with their values, the following plot results.
df.m <- aggregate(value~year+station, data = df, FUN = function(x) median(x))
d <- d + geom_text(data = df.m, aes(x = year, y = value, label = value))
d
The medians plotted by geom_boxplot aren't at the medians at all. The labels are plotted at the correct y-axis value, but the middle hinge of the boxplots are definitely not at the medians. I've been stumped by this for a few days now.
What is the reason for this? How can this type of display be produced with correct medians? How can this plot be debugged or diagnosed?
The solution to this question is in the application of scale_y_continuous. ggplot2 will perform operations in the following order:
Scale Transformations
Statistical Computations
Coordinate Transformations
In this case, because a scale transformation is invoked, ggplot2 excludes data outside the scale limits for the statistical computation of the boxplot hinges. The medians calculated by the aggregate function and used in the geom_text instruction will use the entire dataset, however. This can result in different median hinges and text labels.
The solution is to omit the scale_y_continuous instruction and instead use:
d <- ggplot(data = df, aes(x = year, y = value)) +
geom_boxplot(aes(fill = station)) +
facet_grid(station~.) +
theme(legend.position = "none")) +
coord_cartesian(y = c(0,15))
This allows ggplot2 to calculate the boxplot hinge stats using the entire dataset, while limiting the plot size of the figure.

Resources