Short:
How do you plot a different, user/data-defined curve in each facet in ggplot2?
Long:
I would like to overlay faceted scatterplots of real data with user-defined curves of predicted data based on a faceting variables, i.e. using different curves for each facet.
Here's a toy example:
We have data on number of hedgehogs played by red or white queens for two years at two sites, with two different rate treatments. We expect those treatments to alter the hedgehog population by an annual exponential rate of either 0.5 or 1.5. So out data look like
queen <- as.factor(c(rep("red", 8), rep("white",8)))
site <- as.factor(c(rep(c(rep(1,4), rep(2,4)),2)))
year <- c(rep(c(rep(1,2), rep(2,2)),4))
rate <- rep(c(0.5,1.5),8)
hedgehogs <- c(8,10,6,14,16,9,8,11,11,9,9,10,8,11,11,6)
toy.data <- data.frame(queen, site, year, rate, hedgehogs)
Using the following this makes four nice facets of site by rate:
library("ggplot2")
ggplot(toy.data, aes(year, hedgehogs)) +
geom_point(aes(colour=queen), size=10) +
scale_colour_manual(values=c("red", "white")) +
facet_grid(rate ~ site, labeller= label_both)
I would like to overlay rate curves onto these plots.
Our prediction curve looks like:
predict.hedgehogs <- function(year, rate){
10*(rate^(year-1))
}
Where the number of hedgehogs predicted based on an exponentiation of the rate and the number of years multiplied by the starting number (here given as 10 hedgehogs).
I've tried all manner of stuffing around with stat_function and produced something on the right track but just not there,
E.g:
Adding facet specific data as per geom_hline (see bottom page here)
facet.data <- data.frame(rate=c(0.5, 0.5, 1.5, 1.5),
site=c(1, 2, 1, 2))
Then plotting
ggplot(toy.data, aes(year, hedgehogs)) +
geom_point(aes(colour = queen), size = 10) +
scale_colour_manual(values = c("red", "white")) +
facet_grid(rate ~ site, labeller = label_both) +
stat_function(mapping = aes(x = year, y = predict.hedgehogs(year,rate)),
fun = predict.hedgehogs,
args = list(r = facet.data$rate), geom = "line")
Or separate stat_function call for each rate (i.e., this strategy):
ggplot(toy.data, aes(year, hedgehogs)) +
geom_point(aes(colour=queen), size=10) +
scale_colour_manual(values=c("red", "white")) +
facet_grid(rate ~ site, labeller= label_both) +
stat_function(fun=predict.hedgehogs, args=list(rate=0.5), geom="line", rate==0.5)+
stat_function(fun=predict.hedgehogs, args=list(rate=1.5), geom="line", rate==1.5)
Error: `mapping` must be created by `aes()`
Any thoughts?
And with many thanks to comment from #Roland
If we add to toy.data predicted data from the function predict.hedgehogs above:
pred.hogs <- predict.hedgehogs(year, rate)
toy.data <- data.frame(toy.data, pred.hogs)
We can plot:
ggplot(toy.data, aes(year, hedgehogs)) +
geom_point(aes(colour=queen), size=10) +
scale_colour_manual(values=c("red", "white")) +
facet_grid(rate ~ site) +
geom_smooth(aes(x=year, y=pred.hogs), stat="identity", colour = "black")
Related
i am currently plotting (long format) data which consists of fluorescence (RFU) on the 1. Y-Axis and Growth (OD600) on the 2. Y-Axis. I have managed to create the plots, but i find it very difficult to log transform the 2. Y-axis (for OD600) and not messing up the entire plot. (The data is all derived from the same data frame)
My question is this: Is there any way to log10 transform only the 2. Y-axis (from 0.01-1) and making perhaps 5 breaks something like:("0.01","0.1","0.5","0.1")?
My code looks like this: (i apologize for ugly code)
for (i in 1:length(unique(lf_combined$media)[grepl("^.+(gfp)$",unique(lf_combined$media))])){
print(i)
coeff <- 1/max(lf_combined_test$normalized_gfp)
p1<-lf_combined_test[lf_combined_test$media %in% unique(lf_combined$media)[grepl("^.+(gfp)$",unique(lf_combined$media))][i], ] %>%
# filter(normalized_gfp>0) %>%
filter(row_number() %% 3 == 1) %>%
ggplot( aes(x=time)) +
geom_bar( aes(y=normalized_gfp), stat="identity", size=.1, fill="green", color="green", alpha=.4)+
geom_line( aes(y=od / coeff), size=2, color="tomato") +
scale_x_continuous(breaks = round(seq(0,92, by = 5),1))+
geom_vline(xintercept = 12, linetype="dotted",
color = "blue", size=1)+
scale_y_continuous(limits = c(0,80000),
name = "Relative Flourescence [RFU]/[OD] ",
sec.axis = sec_axis(~.*coeff, name="[OD600]")
) +
scale_y_log10(limits=c(0.01,1))+
theme_grey() +
theme(
axis.title.y = element_text(color = "green", size=13),
axis.title.y.right = element_text(color = "tomato", size=13)
) +
ggtitle(paste("Relative fluorescence & OD600 time series for",unique(lf_combined$media)[grepl("^.+(gfp)$",unique(lf_combined$media))][i],sep=" "))
print(p1)
)
}
Which gives a plots that looks like this for now:
Thank you very much in advance! :))
Yes, this is certainly possible. Without your data set it is difficult to give you specific code, but here is an example using the built-in mtcars data set. We plot a best-fitting line for mpg against an x axis of wt.
p <- ggplot(mtcars, aes(wt, mpg)) + geom_smooth(aes(color = 'mpg'))
p
Suppose we want to draw the value of disp according to a log scale which we will show on the y axis. We need to carry out the log transform of our data to do this, but also multiply it by 10 to get it on a similar visual scale to the mpg line:
p <- p + geom_smooth(aes(y = 10 * log10(disp), color = 'disp'))
p
To draw the secondary axis in, we need to supply it with the reverse transformation of 10 * log10(x), which is 10^(x/10), and we will supply appropriately logarithmic breaks at 10, 100 and 1000
p + scale_y_continuous(
sec.axis = sec_axis(~ 10^(.x/10), breaks = c(10, 100, 1000), name = 'disp'))
It seems that you are generating the values of your line by using od / coeff, and reversing that transform with .*coeff, which seems appropriate, but to get a log10 axis, you will need to do something like log10(od) * constant and reverse it with 10^(od/constant). Without your data, it's impossible to know what this constant should be, but you can play around with different values until it looks right visually.
I created the plot below using:
ggplot(data_all, aes(x = data_all$Speed, fill = data_all$Season)) +
theme_bw() +
geom_histogram(position = "identity", alpha = 0.2, binwidth=0.1)
As you can see, the difference in the amount of data available is very large. Is there a way to look only at the distribution and not at the total data amount?
You can reference some of the other calculated values from stat functions using a notation that you may have seen before: ..value... I'm not sure the proper name for these or where you can find a list documented, but sometimes these are called "special variables" or "calculated aesthetics".
In this case, the default calculated aesthetic on the y axis for geom_histogram() is ..count... When comparing distributions of different total N size, it's useful to use ..density... You can access ..density.. by passing it to the y aesthetic directly in the geom_histogram() function.
First, here's an example of two histograms with vastly different sizes (similar to OP's question):
library(ggplot2)
set.seed(8675309)
df <- data.frame(
x = c(rnorm(1000, -1, 0.5), rnorm(100000, 3, 1)),
group = c(rep("A", 1000), rep("B", 100000))
)
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
alpha=0.2, color='gray80',
position="identity", bins=80)
And here's the same plot using ..density..:
ggplot(df, aes(x, fill=group)) + theme_classic() +
geom_histogram(
aes(y=..density..), alpha=0.2, color='gray80',
position="identity", bins=80)
I'm preparing an appendix plot for a revised manuscript where I need to give information of the within-year ranges (variability) of several variables between years and sites.
I figured the tidiest way to do this (I have 7 sites, 21 years, and 5 variables...) would be to use a rose plot using coord_polar. However, I stumbled upon something that has always frustrated me about ggplot - the default ordering assumptions. While factors are easily reordered based on some value, this seems to only work in a fixed fashion: as far as I've understood, the order needs to apply throughout the data frame.
In this plot, the ordering needs to depend on a value which changes between years, and therefore the colour and fill values need to change in plotting order within the panel.
To demonstrate, I've created a reproducible example coded below (pictured in the way it should not work)
Basically, I always need the Site with the minimum value within a given Year to be plotted first (in the centre), followed outwards by the increase in value of the other sites, in order of the original value (see order and diff columns of the data frame). In other words, some years Site a will be at the centre, some years Site c will be in the centre, etc.
Any help would be massively appreciated.
library('ggplot2')
library('reshape2')
library("plyr")
## reproducible example of problem: create dummy data
madeup <- data.frame(Year = rep(2000:2015, each=20), Site=rep(c("a","b","c","d"), each=5, times=16),
var1 = rnorm(n=16*20, mean=20, sd=5), var2= rnorm(n=16*20, mean=50, sd=1))
## create ranges of the data by Year and Site
myRange <- function(dat) {range=max(dat, na.rm=TRUE)-min(dat,na.rm = TRUE)}
vardf <- ddply(madeup, .(Site, Year), summarise, var1=myRange(var1),
var2=myRange(var2))
varmelt <- melt(vardf, id.vars = c("Site","Year"))
varmelt$Site <- as.character(varmelt$Site) # this to preserve the new order when rbind called
varmelt <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {x <- x[order(x$value),]
x$order <- 1:nrow(x)
return(x)})
varmelt <- do.call(rbind, varmelt)
## create difference between these values so that each site gets plotted cumulatively on the rose plot
## (otherwise areas close to the centre become uninterpretable)
vartest <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {
x$diff <- c(x$value[1], diff(x$value))
return(x)
})
vartest <- do.call(rbind,vartest)
## plot rose plot to display how ranges in variables vary by year and between sites
## for this test example we'll just take one variable, but the idea is to facet by variable
max1 <- max(vartest$value[vartest$variable=='var1'])
yearlength <- length(2000:2015)
ggplot(vartest[vartest$variable=="var1",], aes(x=factor(Year), y=diff)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
As long as you don't use stacked bars (position = "stack", which is the default for geom_bar), ggplot2 will actually use the order of the rows in your data for the plotting order. So all you need to do, is use the original values for the y-axis (rather than the cumulatively differenced ones) along with position = "identity", and order your data from largest to smallest value before plotting:
ordered_data <- vartest[order(-vartest$value), ]
ggplot(ordered_data, aes(factor(Year), value)) +
geom_col(aes(fill = Site), position = "identity", width = 1) +
coord_polar() +
facet_wrap(~ variable)
Created on 2018-02-17 by the reprex package (v0.2.0).
PS. When generating random data for an example, consider using set.seed so that your results can be reproduced exactly.
You can start with a single plot of the largest site, and then layer smaller sites on top like so:
a <- ggplot(vartest[vartest$variable=="var1"& vartest$order==4,], aes(x=factor(Year), y=value,group=order)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
b <- a + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==3,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c <- b + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==2,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==1,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
This produces the following:
Is that what you wanted?
Using this data.frame
DATA
#import_data
df <- read.csv(url("https://www.dropbox.com/s/1fdi26qy4ozs4xq/df_RMSE.csv?raw=1"))
and this script
library(ggplot2)
ggplot(df, aes( measured, simulated, col = indep_cumulative))+
geom_point()+
geom_smooth(method ="lm", se = F)+
facet_grid(drain~scenario)
I got this plot
I want to add RMSE for each of the two models (independent and accumulative; two values only) to the top left in each facet.
I tried
geom_text(data = df , aes(measured, simulated, label= RMSE))
It resulted in RMSE values being added to each point in the facets.
I will appreciate any help with adding the two RMSE values only to the top left of each facet.
In case you want to plot two numbers per facet you need to do some data preparation to avoid text overlapping.
library(dplyr)
df <- df %>%
mutate(label_vjust=if_else(indep_cumulative == "accumulative",
1, 2.2))
In your question you explicitly told ggplot2 to add label=RMSE at points with x=measured and y=simulated. To add labels at top left corner you could use x=-Inf and y=Inf. So the code will look like this:
ggplot(df, aes(measured, simulated, colour = indep_cumulative)) +
geom_point() +
geom_smooth(method ="lm", se = F) +
geom_text(aes(x=-Inf, y=Inf, label=RMSE, vjust=label_vjust),
hjust=0) +
facet_grid(drain~scenario)
I'm trying to make a graphic that will show three things side-by-side. First is to show change in the individual over time. Next is to show change in their peer group over time. Last is to show change in the overall population over time.
I have four time points on each observation. What I'd like to see is two sets of boxplots next to each other, one for the peer group and one for the population. Overlaid on each of these would the datapoints for a given individual. Each set would show data at time1, time2, time3, and time4. The overlayed points would convey where the individuals had been at each time, so the information can be conveyed in two sets of boxplots.
Here is code to simulate the sort of data I am working with, and my ineffective attempt at creating my plot.
peer <- c(rep(1, 15), rep(2, 41))
year <- rep(c(1, 2), 28)
pct <- rep(1:8, 7)
dat <- data.frame(cbind(peer, year, pct))
ggplot(dat, aes(peer==1, pct)) + geom_boxplot() + facet_grid(. ~ year)
I don't think my ggplot approach is even close to correct. Please help!
Here's a sketch of what I'm trying to do.
Is this close to what you had in mind? There's a boxplot for each value of peer for each year. I've also included the mean values for each group.
# Boxplots for each combination of year and peer, with means superimposed
ggplot(dat, aes(year, pct, group=interaction(year,peer), colour=factor(peer))) +
geom_boxplot(position=position_dodge(width=0.4), width=0.4) +
stat_summary(fun.y=mean, geom="line", position=position_dodge(width=0.4),
aes(group=peer)) +
stat_summary(fun.y=mean, geom="point", position=position_dodge(width=0.4), size=4,
aes(group=peer)) +
scale_x_continuous(breaks=unique(dat$year))
You can add a population boxplot, but then the plot starts to look cluttered:
# Add population boxplot (not grouped by peer)
ggplot(dat, aes(year, pct, group=interaction(year,peer), colour=factor(peer))) +
geom_boxplot(aes(group=year), width=0.05, colour="grey60", fill="#FFFFFF90") +
geom_boxplot(position=position_dodge(width=0.4), width=0.2) +
stat_summary(fun.y=mean, geom="line", position=position_dodge(width=0.4),
aes(group=peer)) +
stat_summary(fun.y=mean, geom="point", position=position_dodge(width=0.4), size=4,
aes(group=peer)) +
scale_x_continuous(breaks=unique(dat$year))
UPDATE: Based on your comment, maybe something like this:
# Add an ID variable to the data
dat$id = rep(1:(nrow(dat)/2), each=2)
library(gridExtra) # For grid.arrange function
pdf("plots.pdf", 7, 5)
for (i in unique(dat$id)) {
p1 = ggplot() +
geom_boxplot(data=dat[dat$peer==unique(dat$peer[dat$id==i]),],
aes(year, pct, group=year)) +
geom_point(data=dat[dat$id==i,], aes(year, pct),
pch=8, colour="red", size=5) +
ggtitle("Your Peers")
p2 = ggplot() +
geom_boxplot(data=dat, aes(year, pct, group=year)) +
geom_point(data=dat[dat$id==i,], aes(year, pct),
pch=8, colour="red", size=5) +
ggtitle("All Participants")
grid.arrange(p1, p2, ncol=2, main=paste0("ID = ", i))
}
dev.off()
Here's what the first plot looks like: