I am trying to graph my data in R for my research project and for some reason on the three graphs I have created my error bars look like this. They are all at the bottom of the bars rather than in the correct spot on the top.
Here is my coding for that specific graph:
ggplot(Epiphyte_Biomass,aes(x=Treatment, y=Epiphyte.Biomass,fill=Treatment))+
geom_bar(stat="Identity")+
geom_errorbar(aes(ymin=mean(Epiphyte.Biomass)-sd(Epiphyte.Biomass),
ymax=mean(Epiphyte.Biomass)+ sd(Epiphyte.Biomass)),
width=0.2)+
theme_classic()
When you computed the mean and sd, ggplot didn't automatically subdivide the data by group, so I think you got the overall mean and SD (the mean looks low, but perhaps you have fewer data points in the "NC+N" treatment?)
ggplot2 has some built-in convenience wrappers for functions from the Hmisc package that compute different kinds of ranges, but ±1 SD bars are not included. Try
msd <- function(y) {
my <- mean(y, na.rm = TRUE)
sy <- sd(y, na.rm = TRUE)
data.frame(y = my, ymin = my - sy, ymax = my + sy)
}
## and use this in place of `geom_errorbar()`:
+ stat_summary(fun.data = msd, geom = "errorbar")
Here is an example using mtcars:
ggplot(mtcars, aes(cyl, mpg, fill = factor(cyl))) +
stat_summary(geom = "bar", fun = mean) +
stat_summary(geom = "errorbar", fun.data = msd)
The point is that this way ggplot does all the mean and SD calculations per treatment for you, on the fly, rather than your having to do them separately ...
It looks as though your data set may already have computed the mean of epiphyte biomass per treatment, in which case your SD calculations will be messed up anyway (they will be the SDs across treatment means rather than the within-treatment SDs)
I think the error is in -+sd(Epiphyte.Biomass)... You have to calculate sd for each treatment separately. In your case, sd is the same for both!
Here is an example with the mtcars dataset.
Just take your variables and put them in.
I really appreciate Ben Bolkers answer. It is not trivial to set the errorbars at least if you are not doing it every day.
library(tidyverse)
library(plyr)
# function
data_summary <- function(data, varname, groupnames){
require(plyr)
summary_func <- function(x, col){
c(mean = mean(x[[col]], na.rm=TRUE),
sd = sd(x[[col]], na.rm=TRUE))
}
data_sum<-ddply(data, groupnames, .fun=summary_func,
varname)
data_sum <- rename(data_sum, c("mean" = varname))
return(data_sum)
}
# definition of variables
df <- data_summary(mtcars, varname="mpg",
groupnames=c("cyl"))
# plot
ggplot(df, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_bar(stat="identity", color="black",
position=position_dodge()) +
geom_errorbar(aes(ymin=mpg-sd, ymax=mpg+sd), width=.2,
position=position_dodge(.9))
Related
I want to change the summary statistics shown in the following boxplot:
I have created the boxplot as follows:
ggplot(as.data.frame(beta2), aes(y=var1,x=as.factor(Year))) +
geom_boxplot(outlier.shape = NA)+
ylab(expression(beta[1]))+
xlab("\nYear")+
theme_bw()
The default is for the box is the first and third quantile. I want the box to show the 2.5% and 97.5% quantiles. I know one can easily change what is shown when one boxplot is visualised by adding the following to geom_boxplot:
aes(
ymin= min(var1),
lower = quantile(var1,0.025),
middle = mean(var1),
upper = quantile(var1,0.975),
ymax=max(var1))
However, this does not work for when boxplots are generated by group. Any idea how one would do this? You can use the Iris data set:
ggplot(iris, aes(y=Sepal.Length,x=Species)) +
geom_boxplot(outlier.shape = NA)
EDIT:
The answer accepted does work. My data-frame is really big and as such the method provided takes a bit of time. I found another solution here: SOLUTION that works for large datasets and my specific need.
This could be achieved via stat_summary by setting geom="boxplot". and passing to fun.data a function which returns a data frame with the summary statistics you want to display as ymin, lower, ... in your boxplot:
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
stat_summary(geom = "boxplot", fun.data = function(x) {
data.frame(
ymin = min(x),
lower = quantile(x, 0.025),
middle = mean(x),
upper = quantile(x, 0.975),
ymax = max(x)
)}, outlier.shape = NA)
I have to reorder the bars of the plot with increasing value of y(nbi).
Thanks in advance!
#NBI*Sig_lip
p4 <-ggplot(DF, aes(x=sig_lip, y=nbi, fill=sig_lip)) +
stat_summary(fun.y="mean", geom="bar",show.legend = TRUE) +
stat_summary(func="sd", geom="errorbar") +
theme_minimal()
p4+ coord_flip()
p4 + ggtitle(label = "nbi associated to signaling lipids")
Here your issue to reorder bargraph is that you are calculating the mean and the standard deviation in ggplot2. So, if you pass the "classic" reorder(x, -y), it will set the order based on the individual values of y not the mean.
So, you need to calculate Mean and SD before passing nbi as an argument in ggplot2:
library(dplyr)
library(ggplot2)
DF %>% group_by(sig_lip) %>%
summarise(Mean = mean(nbi, na.rm = TRUE),
SD = sd(nbi, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(sig_lip,-Mean), y = Mean, fill = sig_lip))+
geom_col()+
geom_errorbar(aes(ymin = Mean-SD, ymax = Mean+SD))
Does it answer your question ?
If not, please provide a reproducible example of your dataset by follwoign this guide: How to make a great R reproducible example
I've poked around, but been unable to find an answer. I want to do a weighted geom_bar plot overlaid with a vertical line that shows the overall weighted average per facet. I'm unable to make this happen. The vertical line seems to a single value applied to all facets.
require('ggplot2')
require('plyr')
# data vectors
panel <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
instrument <-c("V1","V2","V1","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1","V1","V2","V1")
cost <- c(1,4,1.5,1,4,4,1,2,1.5,1,2,1.5,2,1.5,1,2)
sensitivity <- c(3,5,2,5,5,1,1,2,3,4,3,2,1,3,1,2)
# put an initial data frame together
mydata <- data.frame(panel, instrument, cost, sensitivity)
# add a "contribution to" vector to the data frame: contribution of each instrument
# to the panel's weighted average sensitivity.
myfunc <- function(cost, sensitivity) {
return(cost*sensitivity/sum(cost))
}
mydata <- ddply(mydata, .(panel), transform, contrib=myfunc(cost, sensitivity))
# two views of each panels weighted average; should be the same numbers either way
ddply(mydata, c("panel"), summarize, wavg=weighted.mean(sensitivity, cost))
ddply(mydata, c("panel"), summarize, wavg2=sum(contrib))
# plot where each panel is getting its overall cost-weighted sensitivity from. Also
# put each panel's weighted average on the plot as a simple vertical line.
#
# PROBLEM! I don't know how to get geom_vline to honor the facet breakdown. It
# seems to be computing it overall the data and showing the resulting
# value identically in each facet plot.
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(xintercept=sum(contrib)) +
facet_wrap(~ panel) +
ylab("contrib")
If you pass in the presumarized data, it seems to work:
ggplot(mydata, aes(x=sensitivity, weight=contrib)) +
geom_bar(binwidth=1) +
geom_vline(data = ddply(mydata, "panel", summarize, wavg = sum(contrib)), aes(xintercept=wavg)) +
facet_wrap(~ panel) +
ylab("contrib") +
theme_bw()
Example using dplyr and facet_wrap incase anyone wants it.
library(dplyr)
library(ggplot2)
df1 <- mutate(iris, Big.Petal = Petal.Length > 4)
df2 <- df1 %>%
group_by(Species, Big.Petal) %>%
summarise(Mean.SL = mean(Sepal.Length))
ggplot() +
geom_histogram(data = df1, aes(x = Sepal.Length, y = ..density..)) +
geom_vline(data = df2, mapping = aes(xintercept = Mean.SL)) +
facet_wrap(Species ~ Big.Petal)
vlines <- ddply(mydata, .(panel), summarize, sumc = sum(contrib))
ggplot(merge(mydata, vlines), aes(sensitivity, weight = contrib)) +
geom_bar(binwidth = 1) + geom_vline(aes(xintercept = sumc)) +
facet_wrap(~panel) + ylab("contrib")
How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?
Edit
Here's an example:
y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")
Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.
An example.
n <- 1e4L
dfr <- data.frame(
y = exp(rlnorm(n)), #really right-skewed variable
f = gl(2, n / 2)
)
p <- ggplot(dfr, aes(f, y)) +
geom_boxplot()
p # big outlier causes quartiles to look too slim
p2 <- ggplot(dfr, aes(f, y)) +
geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2 # no outliers plotted, range shifted
Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.
coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))
(You'll probably still need to use scale_y_continuous to fix the axis breaks.)
Here is a solution using boxplot.stats
# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]
# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:
# Load package and generate data
library(ggplot2)
data <- rnorm(100)
# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3],
upper=stats[4], ymax=stats[5])
# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin,
ymax=ymax)) +
geom_boxplot(stat="identity")
p
The result is a boxplot without outliers.
One idea would be to winsorize the data in a two-pass procedure:
run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...
in a second pass, set the values beyond the given bound to the value of that bound
I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.
gg.layers::geom_boxplot2 is just what you want.
# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)
https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).
# Load package and create a dummy data frame with outliers
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))
# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))
# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)
Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)
The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:
library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10))
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
We have some data which represents many model runs under different scenarios. For a single scenario, we'd like to display the smoothed mean, with the filled areas representing standard deviation at a particular point in time, rather than the quality of the fit of smooting.
For example:
d <- as.data.frame(rbind(cbind(1:20, 1:20, 1),
cbind(1:20, -1:-20, 2)))
names(d)<-c("Time","Value","Run")
ggplot(d, aes(x=Time, y=Value)) +
geom_line(aes(group=Run)) +
geom_smooth()
This produces a graph with two runs represented, and a smoothed mean, but even though the SD between the runs is increasing, the smoother's bars stay the same size. I'd like to make the surrounds of the smoother represent standard deviation at a given timestep.
Is there a non-labour intensive way of doing this, given many different runs and output variables?
hi i'm not sure if I correctly understand what you want, but for example,
d <- data.frame(Time=rep(1:20, 4),
Value=rnorm(80, rep(1:20, 4)+rep(1:4*2, each=20)),
Run=gl(4,20))
mean_se <- function(x, mult = 1) {
x <- na.omit(x)
se <- mult * sqrt(var(x) / length(x))
mean <- mean(x)
data.frame(y = mean, ymin = mean - se, ymax = mean + se)
}
ggplot( d, aes(x=Time,y=Value) ) + geom_line( aes(group=Run) ) +
geom_smooth(se=FALSE) +
stat_summary(fun.data=mean_se, geom="ribbon", alpha=0.25)
note that mean_se is going to appear in the next version of ggplot2.
The accepted answer just works if measurements are aligned/discretized on x. In case of continuous data you could use a rolling window and add a custom ribbon
iris %>%
## apply same grouping as for plot
group_by(Species) %>%
## Important sort along x!
arrange(Petal.Length) %>%
## calculate rolling mean and sd
mutate(rolling_sd=rollapply(Petal.Width, width=10, sd, fill=NA), rolling_mean=rollmean(Petal.Width, k=10, fill=NA)) %>% # table_browser()
## build the plot
ggplot(aes(Petal.Length, Petal.Width, color = Species)) +
# optionally we could rather plot the rolling mean instead of the geom_smooth loess fit
# geom_line(aes(y=rolling_mean), color="black") +
geom_ribbon(aes(ymin=rolling_mean-rolling_sd/2, ymax=rolling_mean+rolling_sd/2), fill="lightgray", color="lightgray", alpha=.8) +
geom_point(size = 1, alpha = .7) +
geom_smooth(se=FALSE)