ggplot2: how to add sample numbers to density plot?

ggplot2: how to add sample numbers to density plot? - r

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?

The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

Related

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.

The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.

Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

Plot legend for multiple histograms plotted on top of each other ggplot

I've made this multiple histogram plot in ggplot and now I want to add a legend for both the light purple part and the dark purple part. I know the conventional way is to to it with aes, but I can't seem to figure out how I integrate this feature as one into my multiple histogram plot.
I don't shy manual labour, but more sophisticated solutions are preferred. Anyone help me out?
#dataframe
set.seed(20)
df <- data.frame(expl = rbinom(n=100, size = 1, prob=0.08),
resp = sample(50:100, size = 100, replace = T))
#graph
graph <- ggplot(data = df, aes(x = resp))
graph +
geom_histogram(fill = "#BEBADA", alpha = 0.5, bins = 10) +
geom_histogram(data = subset(df, expl == '1'), fill = "#BEBADA", bins = 10)

Your data is already in the long format that is well suited for ggplot; you just need to map expl to alpha. In general, if you find yourself making multiples of the same geom, you probably want to rethink either the shape of your data or your approach for feeding it into geoms.
library(tidyverse)
set.seed(20)
df <- data.frame(expl = rbinom(n=100, size = 1, prob=0.08),
resp = sample(50:100, size = 100, replace = T))
To map expl onto alpha, make it a factor, and then assign that to alpha inside your aes. Then you can set the alpha scale to values of 0.5 and 1.
ggplot(df, aes(x = resp, alpha = as.factor(expl))) +
geom_histogram(fill = "#bebada", bins = 10) +
scale_alpha_manual(values = c(0.5, 1))
However, differentiating by alpha is a little awkward. You could instead map to fill and use light and dark purples:
ggplot(df, aes(x = resp, fill = as.factor(expl))) +
geom_histogram(bins = 10) +
scale_fill_manual(values = c("0" = "mediumpurple1", "1" = "mediumpurple4"))
Note also that you can adjust the position of the histogram bars if you need to, by assigning geom_histogram(position = ...), where you could fill in with something such as "dodge" if that's what you'd like.

If you want a legend on the alpha value, the idea is to include it as an aesthetic rather than as a direct argument as you tried. In order to do this, a simple solution is to enrich the data frame used by ggplot:
df2 <- rbind(
cbind(df, filter="all lines"),
cbind(subset(df, expl == '1'), filter="expl==1")
)
df2 corresponds to df after appending the lines from your subset of interest (with a field filter telling from which copy each record comes)
Then, this solves your problem
ggplot(df2, aes(resp, alpha=filter)) +
geom_histogram(fill="#BEBADA", bins=10, position="identity") +
scale_alpha_discrete(range=c(.5,1))

ggplot2: Why is it displaying the wrong values when set to log10 axis?

I'm using stat_summary to display the mean and, based off my calculations, "type1, G-" should have a mean of ~10^7.3. And that's the value I get from plotting it without a log10 axis. But when I add in the log10 axis, suddenly "type1, G-" shows a value of 10^6.5.
What's going on?
#Data
Type = rep(c("type1", "type2"), each = 6)
Gen = rep(rep(c("G-", "G+"), each = 3), 2)
A = c(4.98E+05, 5.09E+05, 1.03E+05, 3.08E+05, 5.07E+03, 4.22E+04, 6.52E+05, 2.51E+04, 8.66E+05, 8.10E+04, 6.50E+06, 1.64E+06)
B = c(6.76E+07, 3.25E+07, 1.11E+07, 2.34E+06, 4.10E+04, 1.20E+06, 7.50E+07, 1.65E+05, 9.52E+06, 5.92E+06, 3.11E+08, 1.93E+08)
df = melt(data.frame(Type, Gen, A, B))
#Correct, non-log10 version ("type1 G-" has a value over 1e+07)
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="bar",position="dodge",aes(fill=Gen))+
scale_x_discrete(limits=c("type1"))+
coord_cartesian(ylim=c(10^7,10^7.5))
#Incorrect, log10 version ("type1 G-" has a value under 1e+07)
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="bar",position="dodge",aes(fill=Gen))+
scale_y_log10()

You want coord_trans. As its documentation says:
# The difference between transforming the scales and
# transforming the coordinate system is that scale
# transformation occurs BEFORE statistics, and coordinate
# transformation afterwards.
However, you cannot make a barplot with this, since bars start at 0 and log10(0) is not defined. But barplots are usually not a good visualization anyway.
ggplot(data = df, aes(x =Type,y = value)) +
stat_summary(fun.y="mean",geom="point",position="identity",aes(color=Gen))+
coord_trans(y = "log10", limy = c(1e5, 1e8)) +
scale_y_continuous(breaks = 10^(5:8))
Obviously you should plot some kind of uncertainty information. I'd recommend a boxplot.

Coloring a geom_histogram by gradient

I'm trying to plot a geom_histogram where the bars are colored by a gradient.
This is what I'm trying to do:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
ggplot(df,aes_string(x="val",y="..count..+1",fill="val"))+geom_histogram(binwidth=1,pad=TRUE)+scale_y_log10()+scale_fill_gradient2("val",low="darkblue",high="darkred")
But getting:
Any idea how to get it colored by the defined gradient?

Not sure you can fill by val because each bar of the histogram represents a collection of points.
You can, however, fill by categorical bins using cut. For example:
ggplot(df, aes(val, fill = cut(val, 100))) +
geom_histogram(show.legend = FALSE)

Just for completeness.
If the colors I'd like to have the gradient on to be manually selected here's what I suggest:
data:
library(ggplot2)
set.seed(1)
df <- data.frame(id=paste("ID",1:1000,sep="."),val=rnorm(1000),stringsAsFactors=F)
colors:
bins <- 10
cols <- c("darkblue","darkred")
colGradient <- colorRampPalette(cols)
cut.cols <- colGradient(bins)
cuts <- cut(df$val,bins)
names(cuts) <- sapply(cuts,function(t) cut.cols[which(as.character(t) == levels(cuts))])
plot:
ggplot(df,aes(val,fill=cut(val,bins))) +
geom_histogram(show.legend=FALSE) +
scale_color_manual(values=cut.cols,labels=levels(cuts)) +
scale_fill_manual(values=cut.cols,labels=levels(cuts))

Instead of binning manually another option would be to make use of the bins computed by stat_bin by mapping ..x.. (or factor(..x..) in case of a discrete scale) or after_stat(x) on the fill aesthetic.
An issue with computing the bins manually is that we end up with multiple groups per bin for which the count has to be computed (even if the count is zero most of the time) and which get stacked on top of each other in the histogram. Especially, this gets problematic if one would add labels of counts to the histogram as can be seen in this post, because in that case one ends up with multiple labels per bin.
library(ggplot2)
set.seed(1)
df <- data.frame(id = paste("ID", 1:1000, sep = "."), val = rnorm(1000), stringsAsFactors = F)
ggplot(df, aes(x = val, y = ..count.. + 1, fill = ..x..)) +
geom_histogram(binwidth = .1, pad = TRUE) +
scale_y_log10() +
scale_fill_gradient2(name = "val", low = "darkblue", high = "darkred")
#> Warning: Duplicated aesthetics after name standardisation: pad

Using ggplot2, how can I create a histogram or bar plot where the last bar is the count of all values greater than some number?

I would like to plot a histogram of my data to show its distribution, but I have a few outliers that are really high compared to most of the values, which are < 1.00. Rather than having one or two bars scrunched up at the far left and then nothing until the very far right side of the graph, I'd like to have a histogram with everything except the outliers and then add a bar at the end where the label underneath it is ">100%". I can do that with ggplot2 using geom_bar() like this:
X <- c(rnorm(1000, mean = 0.5, sd = 0.2),
rnorm(10, mean = 10, sd = 0.5))
Data <- data.frame(table(cut(X, breaks=c(seq(0,1, by=0.05), max(X)))))
library(ggplot2)
ggplot(Data, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") +
scale_x_discrete(labels = paste0(c(seq(5,100, by = 5), ">100"), "%"))
The problem is that, for the size I need this to be, the labels end up overlapping or needing to be plotted at an angle for readability. I don't really need all of the bars labeled. Is there some way to either
A) plot this in a different manner other than geom_bar() so that I don't need to manually add that last bar or
B) only label some of the bars?

I will try to answer B.
I don't know if there is a parameter that would let you do B) but you can manually define a function to do that for you. I.e.:
library(ggplot2)
X <- c(rnorm(1000, mean = 0.5, sd = 0.2),
rnorm(10, mean = 10, sd = 0.5))
Data <- data.frame(table(cut(X, breaks=c(seq(0,1, by=0.05), max(X)))))
#the function will remove one label every n labels
remove_elem <- function(x,n) {
for (i in (1:length(x))) {
if (i %% n == 0) {x[i]<-''}
}
return(x)
}
#make inital labels outside ggplot (same way as before).
labels <-paste0(c(seq(5,100, by = 5),'>100'),'%')
Now using that function inside the ggplot function:
ggplot(Data, aes(x = Var1, y = Freq)) + geom_bar(stat = "identity") +
scale_x_discrete(labels = remove_elem(labels,2))
outputs:
I don't know if this is what you are looking for but it does the trick!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2: how to add sample numbers to density plot? - r

Related

violin_plot() with continuous axis for grouping variable?

Plot legend for multiple histograms plotted on top of each other ggplot

ggplot2: Why is it displaying the wrong values when set to log10 axis?

Coloring a geom_histogram by gradient

Using ggplot2, how can I create a histogram or bar plot where the last bar is the count of all values greater than some number?

Categories

Resources