Show standard devation using geom_smooth and ggplot - r

We have some data which represents many model runs under different scenarios. For a single scenario, we'd like to display the smoothed mean, with the filled areas representing standard deviation at a particular point in time, rather than the quality of the fit of smooting.
For example:
d <- as.data.frame(rbind(cbind(1:20, 1:20, 1),
cbind(1:20, -1:-20, 2)))
names(d)<-c("Time","Value","Run")
ggplot(d, aes(x=Time, y=Value)) +
geom_line(aes(group=Run)) +
geom_smooth()
This produces a graph with two runs represented, and a smoothed mean, but even though the SD between the runs is increasing, the smoother's bars stay the same size. I'd like to make the surrounds of the smoother represent standard deviation at a given timestep.
Is there a non-labour intensive way of doing this, given many different runs and output variables?

hi i'm not sure if I correctly understand what you want, but for example,
d <- data.frame(Time=rep(1:20, 4),
Value=rnorm(80, rep(1:20, 4)+rep(1:4*2, each=20)),
Run=gl(4,20))
mean_se <- function(x, mult = 1) {
x <- na.omit(x)
se <- mult * sqrt(var(x) / length(x))
mean <- mean(x)
data.frame(y = mean, ymin = mean - se, ymax = mean + se)
}
ggplot( d, aes(x=Time,y=Value) ) + geom_line( aes(group=Run) ) +
geom_smooth(se=FALSE) +
stat_summary(fun.data=mean_se, geom="ribbon", alpha=0.25)
note that mean_se is going to appear in the next version of ggplot2.

The accepted answer just works if measurements are aligned/discretized on x. In case of continuous data you could use a rolling window and add a custom ribbon
iris %>%
## apply same grouping as for plot
group_by(Species) %>%
## Important sort along x!
arrange(Petal.Length) %>%
## calculate rolling mean and sd
mutate(rolling_sd=rollapply(Petal.Width, width=10, sd, fill=NA), rolling_mean=rollmean(Petal.Width, k=10, fill=NA)) %>% # table_browser()
## build the plot
ggplot(aes(Petal.Length, Petal.Width, color = Species)) +
# optionally we could rather plot the rolling mean instead of the geom_smooth loess fit
# geom_line(aes(y=rolling_mean), color="black") +
geom_ribbon(aes(ymin=rolling_mean-rolling_sd/2, ymax=rolling_mean+rolling_sd/2), fill="lightgray", color="lightgray", alpha=.8) +
geom_point(size = 1, alpha = .7) +
geom_smooth(se=FALSE)

Related

Standard Error Bars in wrong position on my graph

I am trying to graph my data in R for my research project and for some reason on the three graphs I have created my error bars look like this. They are all at the bottom of the bars rather than in the correct spot on the top.
Here is my coding for that specific graph:
ggplot(Epiphyte_Biomass,aes(x=Treatment, y=Epiphyte.Biomass,fill=Treatment))+
geom_bar(stat="Identity")+
geom_errorbar(aes(ymin=mean(Epiphyte.Biomass)-sd(Epiphyte.Biomass),
ymax=mean(Epiphyte.Biomass)+ sd(Epiphyte.Biomass)),
width=0.2)+
theme_classic()
When you computed the mean and sd, ggplot didn't automatically subdivide the data by group, so I think you got the overall mean and SD (the mean looks low, but perhaps you have fewer data points in the "NC+N" treatment?)
ggplot2 has some built-in convenience wrappers for functions from the Hmisc package that compute different kinds of ranges, but ±1 SD bars are not included. Try
msd <- function(y) {
my <- mean(y, na.rm = TRUE)
sy <- sd(y, na.rm = TRUE)
data.frame(y = my, ymin = my - sy, ymax = my + sy)
}
## and use this in place of `geom_errorbar()`:
+ stat_summary(fun.data = msd, geom = "errorbar")
Here is an example using mtcars:
ggplot(mtcars, aes(cyl, mpg, fill = factor(cyl))) +
stat_summary(geom = "bar", fun = mean) +
stat_summary(geom = "errorbar", fun.data = msd)
The point is that this way ggplot does all the mean and SD calculations per treatment for you, on the fly, rather than your having to do them separately ...
It looks as though your data set may already have computed the mean of epiphyte biomass per treatment, in which case your SD calculations will be messed up anyway (they will be the SDs across treatment means rather than the within-treatment SDs)
I think the error is in -+sd(Epiphyte.Biomass)... You have to calculate sd for each treatment separately. In your case, sd is the same for both!
Here is an example with the mtcars dataset.
Just take your variables and put them in.
I really appreciate Ben Bolkers answer. It is not trivial to set the errorbars at least if you are not doing it every day.
library(tidyverse)
library(plyr)
# function
data_summary <- function(data, varname, groupnames){
require(plyr)
summary_func <- function(x, col){
c(mean = mean(x[[col]], na.rm=TRUE),
sd = sd(x[[col]], na.rm=TRUE))
}
data_sum<-ddply(data, groupnames, .fun=summary_func,
varname)
data_sum <- rename(data_sum, c("mean" = varname))
return(data_sum)
}
# definition of variables
df <- data_summary(mtcars, varname="mpg",
groupnames=c("cyl"))
# plot
ggplot(df, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_bar(stat="identity", color="black",
position=position_dodge()) +
geom_errorbar(aes(ymin=mpg-sd, ymax=mpg+sd), width=.2,
position=position_dodge(.9))

Why does ggplot geom_jitter plots extra values? [duplicate]

How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)

ggplot2: Different vlines for each graph using facet_wrap [duplicate]

I've poked around, but been unable to find an answer. I want to do a weighted geom_bar plot overlaid with a vertical line that shows the overall weighted average per facet. I'm unable to make this happen. The vertical line seems to a single value applied to all facets.
require('ggplot2')
require('plyr')
# data vectors
panel <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
instrument <-c("V1","V2","V1","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1")
cost <- c(1,4,1.5,1,4,4,1,2,1.5,1,2,1.5,2,1.5,1,2)
sensitivity <- c(3,5,2,5,5,1,1,2,3,4,3,2,1,3,1,2)
# put an initial data frame together
mydata <- data.frame(panel, instrument, cost, sensitivity)
# add a "contribution to" vector to the data frame: contribution of each instrument
# to the panel's weighted average sensitivity.
myfunc <- function(cost, sensitivity) {
return(cost*sensitivity/sum(cost))
}
mydata <- ddply(mydata, .(panel), transform, contrib=myfunc(cost, sensitivity))
# two views of each panels weighted average; should be the same numbers either way
ddply(mydata, c("panel"), summarize, wavg=weighted.mean(sensitivity, cost))
ddply(mydata, c("panel"), summarize, wavg2=sum(contrib))
# plot where each panel is getting its overall cost-weighted sensitivity from. Also
# put each panel's weighted average on the plot as a simple vertical line.
#
# PROBLEM! I don't know how to get geom_vline to honor the facet breakdown. It
# seems to be computing it overall the data and showing the resulting
# value identically in each facet plot.
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(xintercept=sum(contrib)) +
facet_wrap(~ panel) +
ylab("contrib")
If you pass in the presumarized data, it seems to work:
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(data = ddply(mydata, "panel", summarize, wavg = sum(contrib)), aes(xintercept=wavg)) +
facet_wrap(~ panel) +
ylab("contrib") +
theme_bw()
Example using dplyr and facet_wrap incase anyone wants it.
library(dplyr)
library(ggplot2)
df1 <- mutate(iris, Big.Petal = Petal.Length > 4)
df2 <- df1 %>%
group_by(Species, Big.Petal) %>%
summarise(Mean.SL = mean(Sepal.Length))
ggplot() +
geom_histogram(data = df1, aes(x = Sepal.Length, y = ..density..)) +
geom_vline(data = df2, mapping = aes(xintercept = Mean.SL)) +
facet_wrap(Species ~ Big.Petal)
vlines <- ddply(mydata, .(panel), summarize, sumc = sum(contrib))
ggplot(merge(mydata, vlines), aes(sensitivity, weight = contrib)) +
geom_bar(binwidth = 1) + geom_vline(aes(xintercept = sumc)) +
facet_wrap(~panel) + ylab("contrib")

ggplot2 shading envelope of time series

I am plotting the results of 50 - 100 experiments.
Each experiment results in a time series.
I can plot a spaghetti plot of all time series, but
what I'd like to have is sort of a density map for the time series plume.
(something similar to the gray shading in the lower panel
in this figure: http://www.ipcc.ch/graphics/ar4-wg1/jpg/fig-6-14.jpg)
I can 'sort of' do this with 2d binning or binhex but the result could be prettier (see example below).
Here is a code that reproduces a plume plot for mock data (uses ggplot2 and reshape2).
# mock data: random walk plus a sinus curve.
# two envelopes for added contrast.
tt=10*sin(c(1:100)/(3*pi))
rr=apply(matrix(rnorm(5000),100,50),2,cumsum) +tt
rr2=apply(matrix(rnorm(5000),100,50),2,cumsum)/1.5 +tt
# stuff data into a dataframe and melt it.
df=data.frame(c(1:100),cbind(rr,rr2) )
names(df)=c("step",paste("ser",c(1:100),sep=""))
dfm=melt(df,id.vars = 1)
# ensemble average
ensemble_av=data.frame(step=df[,1],ensav=apply(df[,-1],1,mean))
ensemble_av$variable=as.factor("Mean")
ggplot(dfm,aes(step,value,group=variable))+
stat_binhex(alpha=0.2) + geom_line(alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav,size=2))+
theme(legend.position="none")
Does anyone know of a nice way do get a shaded envelope with gradients. I have also tried geom_ribbon but that did not give any indication of density changes along the plume. binhex does that, but not with aesthetically pleasing results.
Compute quantiles:
qs = data.frame(
do.call(
rbind,
tapply(
dfm$value, dfm$step, function(i){quantile(i)})),
t=1:100)
head(qs)
X0. X25. X50. X75. X100. t
1 -0.8514179 0.4197579 0.7681517 1.396382 2.883903 1
2 -0.6506662 1.2019163 1.6889073 2.480807 5.614209 2
3 -0.3182652 2.0480082 2.6206045 4.205954 6.485394 3
4 -0.1357976 2.8956990 4.2082762 5.138747 8.860838 4
5 0.8988975 3.5289219 5.0621513 6.075937 10.253379 5
6 2.0027973 4.5398120 5.9713921 7.015491 11.494183 6
Plot ribbons:
ggplot() +
geom_ribbon(data=qs, aes(x=t, ymin=X0., ymax=X100.),fill="gray30", alpha=0.2) +
geom_ribbon(data=qs, aes(x=t, ymin=X25., ymax=X75.),fill="gray30", alpha=0.2)
This is for two quantile intervals, (0-100) and (25-75). You'll need more args to quantile and more ribbon layers for more quantiles, and need to adjust the colours too.
Based on the idea of Spacedman, I found a way to add more intervals in an automatic way: I first compute the quantiles for each step, group them by pairs of symmetric values and then use geom_ribbon in the right order...
library(tidyr)
library(dplyr)
condquant <- dfm %>% group_by(step) %>%
do(quant = quantile(.$value, probs = seq(0,1,.05)), probs = seq(0,1,.05)) %>%
unnest() %>%
mutate(delta = 2*round(abs(.5-probs)*100)) %>%
group_by(step, delta) %>%
summarize(quantmin = min(quant), quantmax= max(quant))
ggplot() +
geom_ribbon(data = condquant, aes(x = step, ymin = quantmin, ymax = quantmax,
group = reorder(delta, -delta), fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey10", high = "grey95") +
geom_line(data = dfm, aes(x = step, y = value, group=variable), alpha=0.2) +
geom_line(data=ensemble_av,aes(step,ensav),size=2)+
theme(legend.position="none")
Thanks Erwan and Spacedman.
Avoiding 'tidyr' ('dplyr' and 'magrittr') my version of Erwans answer becomes
probs=c(0:10)/10 # use fewer quantiles than Erwan
arr=t(apply(df[,-1],1,quantile,prob=probs))
dfq=data.frame(step=df[,1],arr)
names(dfq)=c("step",colnames(arr))
dfqm=melt(dfq,id.vars=c(1))
# add inter-quantile (per) range as delta
dfqm$delta=dfqm$variable
levels(dfqm$delta)=abs(probs-rev(probs))*100
dfplot=ddply(dfqm,.(step,delta),summarize,
quantmin=min(value),
quantmax=max(value) )
ggplot() +
geom_ribbon(data = dfplot, aes(x = step, ymin = quantmin,
ymax =quantmax,group=rev(delta),
fill = as.numeric(delta)),
alpha = .5) +
scale_fill_gradient(low = "grey25", high = "grey75") +
geom_line(data=ensemble_av,aes(step,ensav),size=2) +
theme(legend.position="none")

Ignore outliers in ggplot2 boxplot

How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)

Resources