Annotate ggplot boxplot facets with number of observations per bar/group - r

I already looked through other questions like these (Annotate ggplot2 facets with number of observations per facet), but didn't find the answer for carrying out an annotation of the single bars of a boxplot with facets.
Here's my sample code for creating the boxplot:
require(ggplot2)
require(plyr)
mms <- data.frame(deliciousness = rnorm(100),
type=sample(as.factor(c("peanut", "regular")),
100, replace=TRUE),
color=sample(as.factor(c("red", "green", "yellow", "brown")),
100, replace=TRUE))
ggplot(mms, aes(x=type, y=deliciousness, fill=type)) +
geom_boxplot(notch=TRUE)+
facet_wrap(~ color,nrow=3, scales = "free")+
xlab("")+
scale_fill_manual(values = c("coral1", "lightcyan1", "olivedrab1"))+
theme(legend.position="none")
And here the corresponding plot:
Now I want to annotate individually for each facet of the color the number of observations per group (peanut/regular), as shown in my drawing:
What I already did, was summarizing the number of observations with dpyr per color and per group (peanut/regular) with this code:
mms.cor <- ddply(.data=mms,
.(type,color),
summarize,
n=paste("n =", length(deliciousness)))
However, I do not know how to add this summary of the data to the ggplot. How can this be done?

Try this approach using dplyr and ggplot2. You can build the label with mutate() and then format to have only one value based on max value of deliciousness. After that geom_text() can enable the text as you want. Here the code:
library(dplyr)
library(ggplot2)
#Data
mms <- data.frame(deliciousness = rnorm(100),
type=sample(as.factor(c("peanut", "regular")),
100, replace=TRUE),
color=sample(as.factor(c("red", "green", "yellow", "brown")),
100, replace=TRUE))
#Plot
mms %>% group_by(color,type) %>% mutate(N=n()) %>%
mutate(N=ifelse(deliciousness==max(deliciousness,na.rm=T),paste0('n=',N),NA)) %>%
ggplot(aes(x=type, y=deliciousness, fill=type,label=N)) +
geom_boxplot(notch=TRUE)+
geom_text(fontface='bold')+
facet_wrap(~ color,nrow=3, scales = "free")+
xlab("")+
scale_fill_manual(values = c("coral1", "lightcyan1", "olivedrab1"))+
theme(legend.position="none")
Output:

Related

Trying to change colour of one variable in ggplot geom_bar dependent on the string

I have a for loop to run through a tonne of microbiome data (using phyloseq) and generate plots for multiple experiments.
ggplot(data_M1, aes(x = Sample, y = Abundance, fill = get(i))) +
geom_bar(stat = "identity")+
facet_wrap(vars(Status, Time.Point, Treatment), scales = "free", ncol=2)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
guides(fill = guide_legend(reverse = TRUE, keywidth = 1, keyheight = 1, title = i))+
ylab(yaxisname)+
ggtitle(plotname)+
ggsave(ggsavename, last_plot())
Example outcome:
What I am trying to do though is make all the "_unclassified" samples/ sequencing data grey... so maybe I need some kind of if statement with str_contains?
Happy to dput a reproducible example if required but someone might have a simple solution.
Thank you!
#camille's comment about a minimal reproducible example is germaine. We need know nothing about your facets, guides or call to ggsave to answer your question.
First, generate some test data
library(tidyverse)
d <- tibble(
Species=rep(c("s__reuteri", "s__guilliermondii",
"o__Clostridiales_unclassified", "k__bacteria_unclassified"),
each=4),
Sample=as.factor(rep(1:4, times=4)),
Abundance=runif(16)
)
Generate custom labels and colours
labels <- unique(d$Species)
# Make sure length of availableColours is long enough to accommodate the maximum length of labels
availableColours <- c("red", "blue", "green", "orange", "yellow")
legendColours <- ifelse(str_detect(labels, fixed("unclassified")), "grey", availableColours)
Create the plot
d %>%
ggplot(aes(x=Sample, y=Abundance, fill=Species)) +
geom_bar(stat="identity") +
scale_fill_manual(labels=labels, values=legendColours)
Giving
If you want to "pool" all the unclassified species, then
d1 <- d %>%
mutate(
LegendSpecies=ifelse(
str_detect(
Species,
fixed("unclassified")
),
"Unclassified",
Species
)
)
legendColours <- ifelse(str_detect(unique(d1$LegendSpecies), fixed("Unclassified")), "grey", availableColours)
d1 %>%
ggplot(aes(x=Sample, y=Abundance, fill=LegendSpecies)) +
geom_bar(stat="identity")+
scale_fill_manual(labels=unique(d1$LegendSpecies), values=legendColours)
Giving

Normal curves on multiple histograms on a same plot

My example dataframe:
sample1 <- seq(100,157, length.out = 50)
sample2 <- seq(113, 167, length.out = 50)
sample3 <- seq(95,160, length.out = 50)
sample4 <-seq(88, 110, length.out = 50)
df <- as.data.frame(cbind(sample1, sample2, sample3, sample4))
I have managed to create histograms for these four variables, which share the same y-axis. Now I need an overlay normal curve. Based on previous posts, I've managed a density curve, but this is not what I want. This comes close, but I'd like a smooth line...
This is my current code for plotting:
df <- as.data.table(df)
new.df<-melt(df,id.vars="sample")
names(new.df)=c("sample","type","value")
cdat <- ddply(new.df, "type", summarise, value.mean=mean(value))
ggplot(data = new.df,aes(x=value)) +
geom_histogram(aes(x = value), bins = 15, colour = "black", fill = "gray") +
facet_wrap(~ type) + geom_density(aes(x = value),alpha=.2, fill="#FF6666") +
geom_vline(data=cdat, aes(xintercept=value.mean),
linetype="dashed", size=1, colour="black") +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),aspect.ratio = 0.75 )
And I found the following code, which I hoped would do the trick, but this gives me nothing:
stat_function(fun = dnorm, args = list(mean = mean(df$value), sd = sd(df$value)))
Unfortunately, stat_function doesn't play nicely with facets: it overlays the same function on each facet without taking account of the faceting variable.
One of the most common reasons I see for people posting ggplot questions on Stack Overflow is that they get lost while trying to coerce ggplot to do too much of their data manipulation. Functions like geom_smooth and geom_function are useful helpers for common tasks, but if you want to do something that is complex or uncommon, it is best to produce the data you want to plot, then plot it.
In fact, the main author of ggplot2 recommends this approach for a very similar problem to yours in this thread, saying:
I think you are better off generating the data outside of ggplot2 and then plotting it. See https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse to get started.
Hadley Wickham, 26 April 2018
So here's one way of doing that using tidyverse. You create a data frame of the dnorm for each sample and plot these using plain old geom_line.
Note that your histograms are counts, so you either need to change them to density, or multiply the dnorm output by the number of observations * the binwidth, otherwise you will just get an apparently "flat" line on the x axis, since the dnorm values will all be so small in relation to the counts:
library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
dfn <- df %>%
pivot_longer(everything()) %>%
ddply("name", function(x) {
xvar <- seq(min(x$value), max(x$value), length.out = 100)
data.frame(value = xvar,
y = 5 * nrow(x) * dnorm(xvar, mean(x$value), sd(x$value)))
})
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(mean = mean(value), sd = sd(value)) %>%
ggplot(aes(value)) +
geom_histogram(aes(x = value), binwidth = 5,
colour = "black", fill = "gray") +
facet_wrap(~ name) +
geom_vline(aes(xintercept = mean),
linetype = "dashed", size=1, colour="black") +
geom_line(data = dfn, aes(y = y)) +
theme_classic() +
theme(text = element_text(size = 15), element_line(size = 0.5),
aspect.ratio = 0.75 )
Created on 2020-12-07 by the reprex package (v0.3.0)

Color an ecdf plot that is grouped by one discrete factor, to be colored continuously using a different (continues) factor?

I'm trying to make an ecdf graph (Empirical cumulative distribution function) with a different colored plot for each subject ('A', 'B' or 'C' in this example).
In this example, the X axis describes the RT (response time), and the Y axis describes the cumulative proportion of rt observations.
Using ggplot2 and ecdf function, I managed to plot each subject's ecdf plot with a different discrete color for each of them. The problem starts when I want to color the subject's plot continuously based on a totally different variable, here called 'color_factor', which is different for each subject and is continuous.
Here is my simplified example:
set.seed(125)
dat <- data.frame(
subject = c(rep(c("A"), 10), rep(c("B"), 10), rep(c("C"), 10)),
color_factor = c(rep(0.3, 10), rep(0.6,10), rep(0.9,10)),
rt = sample(1:50, 30, replace =T)
)
dat <- arrange(dat,color_factor,rt)
dat.ecdf <- ddply(dat, .(color_factor), transform, ecdf=ecdf(rt)(rt) )
p <- ggplot( dat.ecdf, aes(rt, ecdf, colour = subject)) + geom_line()
p2 <- ggplot( dat.ecdf, aes(rt, ecdf, colour = color_factor)) + geom_line()
the initial data looks like this:
Plot p works great and looks like this:
But when I try to color the plots using the color_factor variable, it draws only one plot for all subjects and colors it not as intended.
What I intend to do is that the graph will look like graph p, except for the plots colors, which will be, for example colored as such: subject A- light blue, subject B- blue, and subject C- dark blue, corresponding to each subject's color_factor variable.
Anyone has any ideas what I can do? Any help would be greatly appreciated!
Thanks very much,
Yuval
Try any of these options:
library(plyr)
library(ggplot2)
#Data
set.seed(125)
dat <- data.frame(
subject = c(rep(c("A"), 10), rep(c("B"), 10), rep(c("C"), 10)),
color_factor = c(rep(0.3, 10), rep(0.6,10), rep(0.9,10)),
rt = sample(1:50, 30, replace =T)
)
#Transform
dat <- arrange(dat,color_factor,rt)
dat.ecdf <- ddply(dat, .(color_factor), transform, ecdf=ecdf(rt)(rt) )
#Plot 1
ggplot( dat.ecdf, aes(rt, ecdf, colour = subject,group=1)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))
Output:
Or this:
#Plot 2
ggplot( dat.ecdf, aes(rt, ecdf, colour = factor(color_factor),group=subject)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))+
labs(color='Factor')
Output:
Or this:
#Plot 3
ggplot( dat.ecdf, aes(rt, ecdf, colour = subject,group=subject)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))+
labs(color='Subject')
Output:
Here is the answer that does exactly what I wanted, provided by #Lime:
p <- ggplot( dat.ecdf, aes(rt, ecdf, group = subject, colour = color_factor)) + geom_line()
This colors each subject's plot appropriate to his 'color_factor' value:

How to plot lines for the count data in R?

I have data frame like this:
frame <- data.frame("AGE" = seq(18,44,1),
"GROUP1"= c(83,101,159,185,212,276,330,293,330,356,370,325,264,274,214,229,227,154,132,121,83,69,57,32,16,17,8),
"GROUP2"= c(144,210,259,329,391,421,453,358,338,318,270,258,207,186,173,135,106,92,74,56,41,31,25,13,16,5,8))
I want to plot AGE in X-axis and value of GROUP1 and GROUP2 in the Y-axis in the same plot with different colors. And the values should be joined by a smoothened line.
As a first part, I melted the data frame and plotted:
melt <- melt(frame, id.vars = "AGE")
melt <- melt[order(melt$AGE),]
plot(melt$AGE, melt$value)
Here is an alternative solution using dplyr and tidyr packages.
library(dplyr)
library(tidyr)
newframe <- frame %>% gather("variable","value",-AGE)
ggplot(newframe, aes(x=AGE, y=value, color=variable)) +
geom_point() +
geom_smooth()
You could use geom_line() to get lines between the points, but it feels better to use geom_smooth() here. geom_area gives you a shaded area under the lines, but we need to change color to fill.
ggplot(newframe, aes(x=AGE, y=value, fill=variable)) + geom_area()
We can use matplot
matplot(`row.names<-`(as.matrix(frame[-1]), frame[,1]),
ylab='value',type = "l", xlab = "AGE",col = c("red", "blue"), pch = 1)
legend("topright", inset = .05, legend = c("GROUP1", "GROUP2"),
pch = 1, col = c("red", "blue"), horiz = TRUE)
Try,
library(ggplot2)
ggplot(meltdf,aes(x=AGE,y=value,colour=variable,group=variable)) + geom_line()

Add arbitrary series with legend in ggplot2?

I have a bunch of data - three timeseries (model group means), coloured by group, with standard deviation represented by geom_ribbon. By default they have a nice legend on the side. I also have a single timeseries of observations, that I want to overlay over the plot (without the geom_ribbon), like this:
df <- data.frame(year=1991:2010, group=c(rep('group1',20), rep('group2',20), rep('group3',20)), mean=c(cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20))),cumsum(abs(rnorm(20)))),sd=3+rnorm(60))
obs_df <- data.frame(year=1991:2010, value=cumsum(abs(rnorm(20))))
ggplot(df, aes(x=year, y=mean)) + geom_line(aes(colour=group)) + geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +geom_line(data=obs_df, aes(x=year, y=value))
But the observations does appear on the legend, because it's not coloured (I want it black). How can I add the obs to the legend?
First, create a combined data frame of df and obs_df:
dat <- rbind(df, data.frame(year = obs_df$year,
group = "obs", mean = obs_df$value, sd = 0))
Plot:
ggplot(dat, aes(x=year, y=mean)) +
geom_line(aes(colour=group)) +
geom_ribbon(aes(ymax=mean+sd, ymin=mean-sd, fill=group), alpha = 0.2) +
scale_colour_manual(values = c("red", "green", "blue", "black")) +
scale_fill_manual(values = c("red", "green", "blue", NA))
I'm guessing you made an error with your construction of 'obs_df'. If you create it with year = 1991:2010 it makes more sense in the context of the rest of the data and it gives you the plot you are hoping for with the ggplot call unchanged.

Resources