ggplot2: add RMSE values of two models to each facet - r

Using this data.frame
DATA
#import_data
df <- read.csv(url("https://www.dropbox.com/s/1fdi26qy4ozs4xq/df_RMSE.csv?raw=1"))
and this script
library(ggplot2)
ggplot(df, aes( measured, simulated, col = indep_cumulative))+
geom_point()+
geom_smooth(method ="lm", se = F)+
facet_grid(drain~scenario)
I got this plot
I want to add RMSE for each of the two models (independent and accumulative; two values only) to the top left in each facet.
I tried
geom_text(data = df , aes(measured, simulated, label= RMSE))
It resulted in RMSE values being added to each point in the facets.
I will appreciate any help with adding the two RMSE values only to the top left of each facet.

In case you want to plot two numbers per facet you need to do some data preparation to avoid text overlapping.
library(dplyr)
df <- df %>%
mutate(label_vjust=if_else(indep_cumulative == "accumulative",
1, 2.2))
In your question you explicitly told ggplot2 to add label=RMSE at points with x=measured and y=simulated. To add labels at top left corner you could use x=-Inf and y=Inf. So the code will look like this:
ggplot(df, aes(measured, simulated, colour = indep_cumulative)) +
geom_point() +
geom_smooth(method ="lm", se = F) +
geom_text(aes(x=-Inf, y=Inf, label=RMSE, vjust=label_vjust),
hjust=0) +
facet_grid(drain~scenario)

Related

data points misaligned when using a third value with position jitterdodge

Edited with sample data:
When I try to plot a grouped boxplot together with jittered points using position=position_jitterdodge(), and add an additional group indicated by e.g. shape, I end up with a graph where the jittered points are misaligned within the individual groups:
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()
Is there a way to suppress that additional separation?
Thank you!
OP, I think I get what you are trying to explain. It seems the points are grouped according to age, rather than treated as the same for each group. The reason for this is that you have not specified what to group together. In order to jitter the points, they are first grouped together according to some aesthetic, then the jitter is applied. If you don't specify the grouping, then ggplot2 gives it a guess as to how you want to group the points.
In this case, it is grouping according to age and group, since both are defined to be used in the aesthetics (x=, fill=, and color= are assigned to group and shape= is assigned to age).
To define that you only want to group the points by the column group, you can use the group= aesthetic modifier. (reposting your data with a seed so you see the same thing)
set.seed(8675309)
n <- 16
data <- data.frame(
age = factor(rep(c('young', 'old'), each=8)),
group=rep(LETTERS[1:2], n/2),
yval=rnorm(n)
)
ggplot(data, aes(x=group, y=yval))+
geom_boxplot(aes(color=group), outlier.shape = NA)+
geom_point(aes(color=group, shape=age, fill=group, group=group),size = 1.5, position=position_jitterdodge())+
scale_shape_manual(values = c(21,24))+
scale_color_manual(values=c("black", "#015393"))+
scale_fill_manual(values=c("white", "#015393"))+
theme_classic()

plotting geom_text at individual positions in each facet of facet_wrap

I have this dataset
df <- data.frame(groups=factor(c(rep("A",6), rep("B",6), rep("C",6))),
types=factor(c(rep(c(rep("Z",2), rep("Y",2), rep("X",2)),3))),
values=c(10,11,1,2,0.1, 0.2, 12,13, 2,2.5, 0.2, 0.01,
12,14, 2,3,0.1,0.2))
library(ggplot2)
px <- ggplot(df, aes(groups, values)) + facet_wrap(~types, scale="free") + geom_point()
px
i conducted post-hoc analysis (values~groups for each level of types) and created a dataset, which contains significance groups (as letters: a, b, c and so on):
df.text <- data.frame(groups=factor(c(rep("A",3), rep("B",3), rep("C",3))),
label=rep("a", 9))
i proceeded to plot the labels on px:
px + geom_text(data=df.text, aes(x=groups, y=0.1, label=label), size=4, col="red", stat="identity") +theme_bw()
which doesnt look great.
My problem is to define a aes(y) in geom_text, which plots the labels at at a fixed position (e.g. above the x-axis or on top of the panel), without shifting the limits of the y axis too much. With previous datasets, the y-values were quite homogeneous among groups, so i could get away with a very low y-value. This time however the range of y is quite high, so its not easily getting done.
So, the question is how to plot the labels inside df.text at a fixed position in facet_wrap, while keeping scale="free". Best would be above the top panel.border.
You could define a new variable height which determines the height to plot the labels:
library(tidyverse)
df %>%
group_by(types) %>%
mutate(height = max(values) + .3 * sd(values)) %>%
left_join(df.text, by = "groups") %>%
ggplot(aes(groups, values)) +
facet_wrap(~types, scale = "free") +
geom_point() +
geom_text(aes(x = groups, y = height, label = label), size = 4, col = "red", stat = "identity") +
theme_bw()
Here I used the max value plus .3 times the standard deviation but you could change that to whatever you wanted obviously. Not sure how to get the labels on top of the panel strips though.

ggplot2: How to combine histogram, rug plot, and logistic regression prediction in a single graph

I am trying to plot combined graphs for logistic regressions as the function logi.hist.plot but I would like to do it using ggplot2 (aesthetic reasons).
The problem is that only one of the histograms should have the scale_y_reverse().
Is there any way to specify this in a single plot (see code below) or to overlap the two histograms by using coordinates that can be passed to the previous plot?
ggplot(dat) +
geom_point(aes(x=ind, y=dep)) +
stat_smooth(aes(x=ind, y=dep), method=glm, method.args=list(family="binomial"), se=FALSE) +
geom_histogram(data=dat[dat$dep==0,], aes(x=ind)) +
geom_histogram(data=dat[dat$dep==1,], aes(x=ind)) ## + scale_y_reverse()
This final plot is what I have been trying to achieve:
We use geom_segment to create the "bars" for the histogram and also to create the rug plots. Adjust the size parameter to change the "bar" widths in the histogram. In the example below, the bar heights are equal to the percentage of values within a given x range. If you want to change the absolute heights of the bars, just multiply n/sum(n) by a scaling factor when you create the h data frame of histogram counts.
To generate histogram counts for the plot, we pre-summarize the data to create the histogram values. Note the ifelse statement in the mutate function, which adjusts the values of pct in order to get the upward and downward bars in the plot, depending on whether y is 0 or 1, respectively. You can do this in the plot code itself, but then you need two separate calls to geom_segment.
library(dplyr)
# Fake data
set.seed(1926)
dat = data.frame(y = sample(0:1, 1000, replace=TRUE))
dat$x1 = rnorm(1000, 5, 2) * (dat$y+1)
# Summarise data to create histogram counts
h = dat %>% group_by(y) %>%
mutate(breaks = cut(x1, breaks=seq(-2,20,0.5), labels=seq(-1.75,20,0.5),
include.lowest=TRUE),
breaks = as.numeric(as.character(breaks))) %>%
group_by(y, breaks) %>%
summarise(n = n()) %>%
mutate(pct = ifelse(y==0, n/sum(n), 1 - n/sum(n)))
ggplot() +
geom_segment(data=h, size=4, show.legend=FALSE,
aes(x=breaks, xend=breaks, y=y, yend=pct, colour=factor(y))) +
geom_segment(dat=dat[dat$y==0,], aes(x=x1, xend=x1, y=0, yend=-0.02), size=0.2, colour="grey30") +
geom_segment(dat=dat[dat$y==1,], aes(x=x1, xend=x1, y=1, yend=1.02), size=0.2, colour="grey30") +
geom_line(data=data.frame(x=seq(-2,20,0.1),
y=predict(glm(y ~ x1, family="binomial", data=dat),
newdata=data.frame(x1=seq(-2,20,0.1)),
type="response")),
aes(x,y), colour="grey50", lwd=1) +
scale_y_continuous(limits=c(-0.02,1.02)) +
scale_x_continuous(limits=c(-1,20)) +
theme_bw(base_size=12)

How to specify ggplot2 boxplot fill colour for continuous data?

I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).
So far, I have something like this:
# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)
The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?
You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:
library(ggplot2)
library(dplyr)
ggplot(mtcars %>% group_by(carb) %>%
mutate(medMPG = median(mpg)),
aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
geom_boxplot(aes(fill=medMPG)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))
If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:
scale_fill_gradient(limits=c(13.3, 39.8),
low=hcl(15,100,75), high=hcl(195,100,75))
This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.
I built on eipi10's solution and obtained the following code which does what I want:
# "results" is a 55-column data.frame containing
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
mutate(median.gini = median(value)),
aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)
Later, for the second plot:
medians <- lapply(results, median)
color <- colorRampPalette(colors =
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]
color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!

Overlay raw data onto geom_bar

I have a data-frame arranged as follows:
condition,treatment,value
A , one , 2
A , one , 1
A , two , 4
A , two , 2
...
D , two , 3
I have used ggplot2 to make a grouped bar plot that looks like this:
The bars are grouped by "condition" and the colours indicate "treatment." The bar heights are the mean of the values for each condition/treatment pair. I achieved this by creating a new data frame containing the mean and standard error (for the error bars) for all the points that will make up each group.
What I would like to do is superimpose the raw jittered data to produce a bar-chart version of this box plot: http://docs.ggplot2.org/0.9.3.1/geom_boxplot-6.png [I realise that a box plot would probably be better, but my hands are tied because the client is pathologically attached to bar charts]
I have tried adding a geom_point object to my plot and feeding it the raw data (rather than the aggregated means which were used to make the bars). This sort of works, but it plots the raw values at the wrong x axis locations. They appear at the points at which the red and grey bars join, rather than at the centres of the appropriate bar. So my plot looks like this:
I can not figure out how to shift the points by a fixed amount and then jitter them in order to get them centered over the correct bar. Anyone know? Is there, perhaps, a better way of achieving what I'm trying to do?
What follows is a minimal example that shows the problem I have:
#Make some fake data
ex=data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
#Calculate the mean and SD of each condition/treatment pair
agg=aggregate(value~cond*treat, data=ex, FUN="mean") #mean
agg$sd=aggregate(value~cond*treat, data=ex, FUN="sd")$value #add the SD
dodge <- position_dodge(width=0.9)
limits <- aes(ymax=value+sd, ymin=value-sd) #Set up the error bars
p <- ggplot(agg, aes(fill=treat, y=value, x=cond))
#Plot, attempting to overlay the raw data
print(
p + geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25) +
geom_point(data= ex[ex$treat=='one',], colour="green", size=3) +
geom_point(data= ex[ex$treat=='two',], colour="pink", size=3)
)
I found it is unnecessary to create separate dataframes. The plot can be created by providing ggplot with the raw data.
ex <- data.frame(cond=rep(c('a','b','c','d'),each=8),
treat=rep(rep(c('one','two'),4),each=4),
value=rnorm(32) + rep(c(3,1,4,2),each=4) )
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position = position_dodge(width = 1))
You need just one call to geom_point() where you use data frame ex and set x values to cond, y values to value and color=treat (inside aes()). Then add position=dodge to ensure that points are dodgeg. With scale_color_manual() and argument values= you can set colors you need.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(cond,value,color=treat),position=dodge)+
scale_color_manual(values=c("green","pink"))
UPDATE - jittering of points
You can't directly use positions dodge and jitter together. But there are some workarounds. If you save whole plot as object then with ggplot_build() you can see x positions for bars - in this case they are 0.775, 1.225, 1.775... Those positions correspond to combinations of factors cond and treat. As in data frame ex there are 4 values for each combination, then add new column that contains those x positions repeated 4 times.
ex$xcord<-rep(c(0.775,1.225,1.775,2.225,2.775,3.225,3.775,4.225),each=4)
Now in geom_point() use this new column as x values and set position to jitter.
p+geom_bar(position=dodge, stat="identity") +
geom_errorbar(limits, position=dodge, width=0.25)+
geom_point(data=ex,aes(xcord,value,color=treat),position=position_jitter(width =.15))+
scale_color_manual(values=c("green","pink"))
As illustrated by holmrenser above, referencing a single dataframe and updating the stat instruction to "summary" in the geom_bar function is more efficient than creating additional dataframes and retaining the stat instruction as "identity" in the code.
To both jitter and dodge the data points with the bar charts per the OP's original question, this can also be accomplished by updating the position instruction in the code with position_jitterdodge. This positioning scheme allows widths for jitter and dodge terms to be customized independently, as follows:
p <- ggplot(ex, aes(cond,value,fill = treat))
p + geom_bar(position = 'dodge', stat = 'summary', fun.y = 'mean') +
geom_errorbar(stat = 'summary', position = 'dodge', width = 0.9) +
geom_point(aes(x = cond), shape = 21, position =
position_jitterdodge(jitter.width = 0.5, jitter.height=0.4,
dodge.width=0.9))

Resources