Use stat_summary to annotate plot with number of observations - r

How can I use stat_summary to label a plot with n = x where is x a variable? Here's an example of the desired output:
I can make that above plot with this rather inefficient code:
nlabels <- sapply(1:length(unique(mtcars$cyl)), function(i) as.vector(t(as.data.frame(table(mtcars$cyl))[,2][[i]])))
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
geom_boxplot(fill = "grey80", colour = "#3366FF") +
geom_text(aes(x = 1, y = median(mtcars$mpg[mtcars$cyl==sort(unique(mtcars$cyl))[1]]), label = paste0("n = ",nlabels[[1]]) )) +
geom_text(aes(x = 2, y = median(mtcars$mpg[mtcars$cyl==sort(unique(mtcars$cyl))[2]]), label = paste0("n = ",nlabels[[2]]) )) +
geom_text(aes(x = 3, y = median(mtcars$mpg[mtcars$cyl==sort(unique(mtcars$cyl))[3]]), label = paste0("n = ",nlabels[[3]]) ))
This is a follow up to this question: How to add a number of observations per group and use group mean in ggplot2 boxplot? where I can use stat_summary to calculate and display the number of observations, but I haven't been able to find a way to include n = in the stat_summary output. Seems like stat_summary might be the most efficient way to do this kind of labelling, but other methods are welcome.

You can make your own function to use inside the stat_summary(). Here n_fun calculate place of y value as median() and then add label= that consist of n= and number of observations. It is important to use data.frame() instead of c() because paste0() will produce character but y value is numeric, but c() would make both character. Then in stat_summary() use this function and geom="text". This will ensure that for each x value position and label is made only from this level's data.
n_fun <- function(x){
return(data.frame(y = median(x), label = paste0("n = ",length(x))))
}
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
geom_boxplot(fill = "grey80", colour = "#3366FF") +
stat_summary(fun.data = n_fun, geom = "text")

Most things in R are vectorized, so you can leverage that.
nlabels <- table(mtcars$cyl)
# To create the median labels, you can use by
meds <- c(by(mtcars$mpg, mtcars$cyl, median))
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
geom_boxplot(fill = "grey80", colour = "#3366FF") +
geom_text(data = data.frame(), aes(x = names(meds) , y = meds,
label = paste("n =", nlabels)))
Regarding the nlables:
Instead of your sapply statement you can simply use:
nlabels <- table(mtcars$cyl)
Notice that your current code is taking the above, converting it, transposing it, then iterating over each row only to grab the values one by one, then put them back together into a single vector.
If you really want them as an un-dimensioned integer vector, use c()
nlabels <- c(table(mtcars$cyl))
but of course, even this is not needed to accomplish the above.

Related

ggplot how to edit axis labels in R

I have a dataframe to plot where y_axis variable is a character one. I want to take only the last part of character with '_' as separation.
Here an example with iris dataset. As you can see, all y_axis labels are the same. How can I do it? thanks
iris$trial = paste('hello', 'good_bye', iris$Sepal.Length, sep = '_')
myfun = function(x) {
tail(unlist(strsplit(x, '_')), n = 1)
}
ggplot(iris, aes(x = Species, y = trial, color = Species)) +
geom_point() +
scale_y_discrete(labels = function(x) myfun(x)) +
theme_bw()
It seems to me that you function is only applied to the first row of the column. That value is replicated. Using lapply returns all the unique values. However, I don't know if it makes sense in this example without making it numeric (and sorting it) so you might want to add that as well.
ggplot(iris, aes(x = Species, y = trial, color = Species)) +
geom_point() +
scale_y_discrete(labels = lapply(iris$trial, myfun)) +
theme_bw()
You can make use of regex instead to extract the required value.
library(ggplot2)
#This removes everything until the last underscore
myfun = function(x) sub('.*_', '', x)
ggplot(iris, aes(x = Species, y = trial, color = Species)) +
geom_point() +
scale_y_discrete(labels = myfun) +
theme_bw()
If you want to extract numbers from y-axis value, you can also use scale_y_discrete(labels = readr::parse_number).

Using both stat_bin computed variables and stepping through a vector of the same length as data

I'm trying to label the bars in a histogram. The basic code works fine:
mtcars <- as_tibble(mtcars)
bw <- 2
mtcars %>% ggplot(aes(x = cyl)) + geom_histogram(binwidth = bw) +
stat_bin(binwidth = bw, geom = "text", vjust = -.5, aes(label = ..count..))
However, I'm dealing with a dataset that has some bins resulting in a count of 0. This can be simulated above by changing the binwidth to 1: bw <- 1. Since there are no odd numbers, 5 and 7 will be labeled 0. Because I want 0's to just be blank, I can do something like this:
mtcars %>% ggplot(aes(x = cyl)) + geom_histogram(binwidth = bw) +
stat_bin(binwidth = bw, geom = "text", vjust = -.5, aes(label = ifelse(..count.. > 0, ..count.., "")))
So far so good. I can also use a vector of characters for the label. Here's an example:
bw <- 1
labs <- letters[1:5]
mtcars %>% ggplot(aes(x = cyl)) + geom_histogram(binwidth = bw) +
stat_bin(binwidth = bw, geom = "text", vjust = -.5, label = labs)
Note that 1) the vector must be of the same length as the number of bins and 2) label can no longer be inside aes. Violating either of those will result in Error: Aesthetics must be either length 1 or the same as the data (x): label where x is 5 in the first case and 32 in the second.
Here lies the problem: I want to once again hide labels for bins that have a count 0 and use a vector of characters to label the bins, but I can't use and ifelse expression with ..count.. anymore because it can't be accessed outside aes, nor can I use labs inside aes without having to assign a value to each data point as opposed to each bin. Is there a way to work around this?

Passing argument to facet grid in function -ggplot

I am trying to write a function to plot graphs in a grid. I am using ggplot and facet grid. I am unable to pass the argument for facet grid. I wonder if anybody can point me in the right direction.
The data example:
Year = as.factor(rep(c("01", "02"), each = 4, times = 1))
Group = as.factor(rep(c("G1", "G2"), each = 2, times = 2))
Gender = as.factor(rep(c("Male", "Female"), times = 4))
Percentage = as.integer(c("80","20","50","50","45","55","15","85"))
df1 = data.frame (Year, Group, Gender, Percentage)
The code for the grid plot without function is:
p = ggplot(data=df1, aes(x=Year, y=Percentage, fill = Gender)) + geom_bar(stat = "identity")
p = p + facet_grid(~ Group, scales = 'free')
p
This produces a plot like the ones I want to do. However, when I put it into a function:
MyGridPlot <- function (df, x_axis, y_axis, bar_fill, fgrid){
p = ggplot(data=df1, aes(x=x_axis, y=y_axis, fill = bar_fill)) + geom_bar(stat = "identity")
p = p + facet_grid(~ fgrid, scales = 'free')
return(p)
}
And then run:
MyGridPlot(df1, df1Year, df1$Percentage, df1$Gender, df1$Group)
It comes up with the error:
Error: At least one layer must contain all faceting variables: `fgrid`.
* Plot is missing `fgrid`
* Layer 1 is missing `fgrid
I have tried using aes_string, which works for the x, y and fill but not for the grid.
MyGridPlot <- function (df, x_axis, y_axis, bar_fill, fgrid){
p = ggplot(data=df1, aes_string(x=x_axis, y=y_axis, fill = bar_fill)) + geom_bar(stat = "identity")
p = p + facet_grid(~ fgrid, scales = 'free')
return(p)
}
and then run:
MyGridPlot(df1, Year, Percentage, Gender, Group)
This produces the same error. If I delete the facet grid, both function code runs well, though no grid :-(
Thanks a lot for helping this beginner.
Gustavo
Your problem is that in your function, ggplot is looking for variable names (x_axis, y_axis, etc), but you're giving it objects (df1$year...).
There are a couple ways you could deal with this. Maybe the simplest would be to rewrite the function so that it expects objects. For example:
MyGridPlot <- function(x_axis, y_axis, bar_fill, fgrid){ # Note no df parameter here
df1 <- data.frame(x_axis = x_axis, y_axis = y_axis, bar_fill = bar_fill, fgrid = fgrid) # Create a data frame from inputs
p = ggplot(data=df1, aes(x=x_axis, y=y_axis, fill = bar_fill)) + geom_bar(stat = "identity")
p = p + facet_grid(~ fgrid, scales = 'free')
return(p)
}
MyGridPlot(Year, Percentage, Gender, Group)
Alternatively, you could set up the function with a data frame and variable names. There isn't really much reason to do this if you're working with individual objects the way you are here, but if you're working with a data frame, it might make your life easier:
MyGridPlot <- function(df, x_var, y_var, fill_var, grid_var){
# Need to "tell" R to treat parameters as variable names.
df <- df %>% mutate(x_var = UQ(enquo(x_var)), y_var = UQ(enquo(y_var)), fill_var = UQ(enquo(fill_var)), grid_var = UQ(enquo(grid_var)))
p = ggplot(data = df, aes(x = x_var, y = y_var, fill = fill_var)) + geom_bar(stat = "identity")
p = p + facet_grid(~grid_var, scales = 'free')
return(p)
}
MyGridPlot(df1, Year, Percentage, Gender, Group)

ggplot2: how to add sample numbers to density plot?

I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)

How to add a number of observations per group and use group mean in ggplot2 boxplot?

I am doing a basic boxplot where y=age and x=Patient groups
age <- ggplot(data, aes(factor(group2), age)) + ylim(15, 80)
age + geom_boxplot(fill = "grey80", colour = "#3366FF")
I was hoping you could help me out with a few things:
1) Is it possible to include a number of observations per group above each group boxplot (but NOT on the X axis where my group labels are) without having to do this in paint :)?
I have tried using:
age + annotate("text", x = "CON", y = 60, label = "25")
where CON is the 1st group and y = 60 is ~ just above the boxplot for this group. However, the command didn't work. I assume it has something to do that it reads x as a continuous rather than a categorical variable.
2) Also although there are plenty of questions about using the mean rather than the median for the boxplots, I still haven`t found a code that works for me?
3) On the same matter is there a way you could include the mean group stat in the boxplot? Perhaps using
age + stat_summary(fun.y=mean, colour="red", geom="point")
which however only includes a dot of where the mean lies. Or again using
age + annotate("text", x = "CON", y = 30, label = "30")
where CON is the 1st group and y = 30 is ~ the group age mean.
Knowing how flexible and rich ggplot2 syntax is I was hoping that there is a more elegant way of using the real stats output rather than annotate.
Any suggestions/links would be much appreciated!
Thanks!!
Is this anything like what you're after? With stat_summary, as requested:
# function for number of observations
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
# experiment with the multiplier to find the perfect position
}
# function for mean labels
mean.n <- function(x){
return(c(y = median(x)*0.97, label = round(mean(x),2)))
# experiment with the multiplier to find the perfect position
}
# plot
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
geom_boxplot(fill = "grey80", colour = "#3366FF") +
stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red")
Black number is number of observations, red number is mean value. joran's answer shows you how to put the numbers at the top of the boxes
hat-tip: https://stackoverflow.com/a/3483657/1036500
I think this is what you're looking for maybe?
myboxplot <- ddply(mtcars,
.(cyl),
summarise,
min = min(mpg),
q1 = quantile(mpg,0.25),
med = median(mpg),
q3 = quantile(mpg,0.75),
max= max(mpg),
lab = length(cyl))
ggplot(myboxplot, aes(x = factor(cyl))) +
geom_boxplot(aes(lower = q1, upper = q3, middle = med, ymin = min, ymax = max), stat = "identity") +
geom_text(aes(y = max,label = lab),vjust = 0)
I just realized I mistakenly used the median when you were asking about the mean, but you can obviously use whatever function for the middle aesthetic you please.
Answer to the first problem.
To show value above the box you should provide x values as numeric not as level names. So, to plot the value above first value give x=1.
data(ToothGrowth)
ggplot(ToothGrowth,aes(supp,len))+geom_boxplot()+
annotate("text",x=1,y=32,label=30)

Resources