Plot using mean and standard error values using ggplot2 - r

Here's my data:
year means stder
1 A_1996 4.1291 0.19625
2 B_1997 3.4490 0.18598
3 C_1998 4.1166 0.15977
4 D_1999 3.6500 0.15093
5 E_2000 3.9528 0.14950
6 F_2001 2.7318 0.13212
This is all the data I have. I'd like to plot these using the ggplot2 package, if possible. X axis will be year, and Y axis will be means. Each year will have one point -its corresponding mean value, with the respective standard error values as the "whiskers" around that point. How would I do this using the ggplot() function?
I think I'm mainly confused on how to put the standard error data into the ymin and ymax inputs.
I started looking here, but the beginning data is different, so I'm a little confused.
Plotting means and error bars (ggplot2)

Simple plot using general ggplot2 commands:
library(ggplot2)
df$year <- as.numeric(gsub(".*_", "", df$year))
ggplot(df, aes(year, mean)) +
geom_point() +
geom_errorbar(aes(ymin = mean - stder,
ymax = mean + stder))
Same plot with fancier visuals:
ggplot(df, aes(year, mean)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean - stder,
ymax = mean + stder),
width = 0.5, size = 0.5) +
theme_bw() +
labs(x = "Year",
y = "Mean",
title = "Change in mean over the period")

Related

Additional x axis on ggplot

I'm aware there are similar posts but I could not get those answers to work in my case.
e.g. Here and here.
Example:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut))
Returns a plot:
Since I used scale, those numbers are the zscores or standard deviations away from the mean of each break.
I would like to add as a row underneath the equivalent non scaled raw number that corresponds to each.
Tried:
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(aes(label = price))
Gives:
Error: geom_text requires the following missing aesthetics: y
My primary question is how can I add the raw values underneath -3:3 of each break? I don't want to change those breaks, I still want 6 breaks between -3:3.
Secondary question, how can I get -3 and 3 to actually show up in the chart? They have been trimmed.
[edit]
I've been trying to make it work with geom_text but keep hitting errors:
diamonds %>%
ggplot(aes(x = scale(price) %>% as.vector)) +
geom_density() +
xlim(-3, 3) +
facet_wrap(vars(cut)) +
geom_text(label = price)
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomText, :
object 'price' not found
I then tried changing my call to geom_text()
geom_text(data = diamonds, aes(price), label = price)
This results in the same error message.
You can make a custom labeling function for your axis. This takes each label on the axis and performs a custom transform for you. In your case you could paste the z score, a line break, and the z-score times the standard deviation plus the mean. Because of the distribution of prices in the diamonds data set, this means that z scores below about -1 represent negative prices. This may not be a problem in your own data. For clarity I have drawn in a vertical line representing $0
labeller <- function(x) {
paste0(x,"\n", scales::dollar(sd(diamonds$price) * x + mean(diamonds$price)))
}
diamonds %>%
ggplot(aes(scale(price) %>% as.vector)) +
geom_density() +
geom_vline(aes(xintercept = -0.98580251364833), linetype = 2) +
facet_wrap(vars(cut)) +
scale_x_continuous(label = labeller, limits = c(-3, 3)) +
xlab("price")
We can use the sec_axis functionality in scale_x_continuous. To use this functionality we need to manually scale your data. This will add a secondary axis at the top of the plot, not underneath. So it's not quite exactly what you're looking for.
library(tidyverse)
# manually scale the data
mean_price <- mean(diamonds$price)
sd_price <- sd(diamonds$price)
diamonds$price_scaled <- (diamonds$price - mean_price) / sd_price
# make the plot
ggplot(diamonds, aes(price_scaled))+
geom_density()+
facet_wrap(~cut)+
scale_x_continuous(sec.axis = sec_axis(~ mean_price + (sd_price * .)),
limits = c(-3, 4), breaks = -3:3)
You could cheat a bit by passing some dummy data to geom_text:
geom_text(data = tibble(label = round(((-3:3) * sd_price) + mean_price),
y = -0.25,
x = -3:3),
aes(x, y, label = label))

ggplot2 create barplot with CIs with descriptive data (i.e., without raw data)

in the base version of R it is easy (but cumbersome) to create a plot with error bars based on the descriptive data. With ggplot2 I am struggling to do so and all the examples I have found are based on the raw data.
Specifically, how can I create a barplot with confidence intervals for a simple two-group design? M1 = 3, M2 = 4, SD1 = 1, SD2 = 1.2, n1 = 111, n2 = 222? I started off simply with
ggplot(aes(x=c(1:2), y=c(3, 4))) + geom_bar()
# or
ggplot(aes(y=c(3, 4))) + geom_bar()
but not even this seem to work to create a barplot.
Any suggestions?
What about using ggplot2::stat_summary()? You can let it take care of your mean and se calculations (it relies on library(Hmisc) for most of these summary functions, so look there for more help).
library(ggplot2)
ggplot(mtcars, aes(cyl, mpg)) +
stat_summary(geom = "bar", fun.y = mean) +
stat_summary(geom = "errorbar", fun.data = mean_se)
Adjust width = for skinnier bars or error bars.
You can also use a true confidence interval with mean_cl_normal or mean_cl_boot and for a better visualization of the data dispersion:
ggplot(mtcars, aes(cyl, mpg)) +
stat_summary(geom = "crossbar", fun.data = mean_cl_normal)
Edit:
If your want to recreate a published paper just roll your data into a data.frame first:
datf <- data.frame(
group = c("1", "2"),
means = c(3,4),
sds = c(1,1.2),
ns = c(111, 222)
)
# add your CI calcs as column called upr and lwr
library(tidyverse)
datf <- datf %>% mutate(lwr = means - (qnorm(.975)*(sds/sqrt(ns))),
upr = means + (qnorm(.975)*(sds/sqrt(ns))))
ggplot(datf, aes(group, y = means, ymin = lwr, ymax = upr)) +
geom_crossbar()
Or the traditional standard of columns with error bars if you must like this:
ggplot(datf, aes(group, y = means, ymin = lwr, ymax = upr)) +
geom_col() +
geom_errorbar()
You can draw an error bar to whatever values you want. They have an aesthetic called ymin and ymax that you can set. Here I draw the bars +/- 1 standard devaiation from the mean
dd<-read.table(text="sample mean sd n
1 3 1 111
2 4 1.2 222", header=T)
ggplot(dd, aes(sample)) +
geom_col(aes(y=mean)) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd))

Formating the numbers appearing as text in a boxplot in ggplot. Explaining the positioning of means outside the boxplots

I have developed code in ggplot for a boxplot that displays the mean --calling a custom function. The code is the following:
fun_mean <- function(x){
return(data.frame(y=round(mean(x), digits = 3),label=mean(x,na.rm=T)))}
ggplot(my_data, aes(x = as.factor(viotiko), y = pd_1year, fill = as.factor(viotiko))) + geom_boxplot() +
labs(title="Does the PD differ significantly by 'Viotiko' group?",x="Viotiko Group", y = "PD (pd_1year)") +
coord_cartesian(ylim = c(0,0.05)) + stat_summary(fun.y = mean, geom="point",colour="darkred", size=3) +
stat_summary(fun.data = fun_mean, geom="text", vjust=-0.7)
The boxplot outputted is the following:
As you can see, although I have rounded the means to contain only 3 decimal digits, they appear long and clutter the plot.
What should I do to limit the digits displayed to only 3?
Moreover, I am puzzled by the fact that the means appear outside the distribution depicted by the boxplots in the majority of the groups. How could this be interpreted?
Your advice will be appreciated.
I faked up some data to reproduce this - in the future you should do that yourself before posting:
library(ggplot2)
set.seed(1234)
n <- 100
my_data <- data.frame(viotiko=sample(0:8,n,T),pd_1year=exp(rnorm(n,-4.5,0.8)))
fun_mean <- function(x){
y = mean(x,na.rm=T)
return(data.frame(y=y,label=round(y,3)))
}
ggplot(my_data, aes(x = as.factor(viotiko), y = pd_1year, fill = as.factor(viotiko))) +
geom_boxplot() +
labs(title="Does the PD differ significantly by 'Viotiko' group?",
x="Viotiko Group", y = "PD (pd_1year)") +
coord_cartesian(ylim = c(0,0.05)) +
stat_summary(fun.y = mean, geom="point",colour="darkred", size=3) +
stat_summary(fun.data = fun_mean, geom="text", vjust=-0.7)
Yielding this:
As for your question about why the mean is falling outside the 0.25-0.75 quartile, that is quite common - and to be expected - for long tailed data,even if it does seem a bit counter-intuititive. In this case I used a log-normal distribution and I had 3 of 8 mean values outside those quartiles.

Error when adding errorbars to ggplot

Dear Stackoverflow users,
I would like to draw a grouped barplot with three independent variables with error bars. I based my graph on an example on Stacked Overflow (stacked bars within grouped bars), using ggplot with geom_bar. When I add the geom_errorbar according to examples of the help pages, I get the following error:
Error in if (empty(data)) { : missing value where TRUE/FALSE needed
This is the script I use:
treatment<-rep(c(rep(c(1),8),rep(c(2),8)),2)
origin<-rep(c("A","B"),16)
time<-c(rep(c(5),16),rep(c(10),16))
sulfide<-c(0,10,5,8,9,6,16,18,20,25,50,46,17,58,39,43,20,25,50,46,17,58,39,43,100,120,103,104,150,160,200,180)
Reed<-data.frame(treatment,origin,time,sulfide)
# specify factor types
Reed$treatment<-as.factor(Reed$treatment)
Reed$origin<-as.character(Reed$origin)
Reed$time<-as.factor(Reed$time)
library(ggplot2)
library(scales)
#draw plot
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)")
This is how I added error bars:
ErrorBars <- function(x, y, upper, lower=upper, length=0.03,...{if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))stop("vectors must be same length")arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length, ...)}#function for errorbars
SE<- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x))) #function for SE
Reed$trt<- paste(Reed$treatment,Reed$origin,sep="")#combine treatment and origin to a column
mean_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt,Reed$time),mean,na.rm=TRUE)) #mean
SE_Reed<-data.frame(tapply(Reed$sulfide,list(Reed$trt, Reed$time),SE)) # SE
limits <- aes(ymax = mean_Reed + SE_Reed, ymin=mean_Reed - SE_Reed)# Define the top and bottom of the errorbars
#plot with error bars:
ggplot() +geom_bar(data=Reed, aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +theme_bw() + facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time)"+ geom_errorbar(limits, width=.2,position="dodge")
I really can't find what I'm doing wrong.
I hope you can help me:)
Leaving aside the issue of error bars for the moment, there's a much more serious problem with your plot. You have 2 values each of treatment, time, and origin, for a total of 8 combinations, but 32 values of sulfide - so there are 4 values of sulfide for each combination. When you plot this using, e.g.,
ggplot(data=Reed) +
geom_bar(aes(y = sulfide, x = treatment, fill=origin), stat="identity",position="dodge") +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")
you are plotting bars for all four sulfide values on top of each other all in the same color. This has the effect of displaying only the maximum value. It's a little hard to believe this is what you intended, and even if you did there's a better way to do that. For instance, if you want to plot the mean value of sulfide for each combination of factors, you can do it this way.
ggp <- ggplot(data=Reed, aes(y = sulfide, x = as.factor(treatment), group=origin)) +
geom_bar(aes(fill=origin), stat="summary", fun.y=mean, position="dodge") +
theme_bw() +
facet_grid( ~ time)+xlab("treatment") +ylab("Sulfide")+ggtitle("Time")
ggp
This uses stat="summary" to automatically summarize the result using the aggregating function mean (fun.y=mean).
As similar approach can be used to very simply add the error bars:
se <- function(y) sd(y)/length(y) # to calculate standard error in the mean
ggp+stat_summary(geom="errorbar",position=position_dodge(width=0.85),
fun.data=function(y)c(ymin=mean(y)-se(y),ymax=mean(y)+se(y)), width=0.1)
Notice that there is no need to aggregate the data externally - ggplot does it for you.
Finally, this approach lends itself to the use of many built-in functions for generating confidence limits with more statistical rigor.
ggp+stat_summary(fun.data=mean_cl_normal, conf.int=0.95,
geom="errorbar",position=position_dodge(width=0.85), width=0.1)
So here we use the ggplot built-in function mean_cl_normal to calculate 95% confidence limits on the mean assuming the data follows a normal distribution (and that, hence, the means will follow a t-distribution). We use the argument conf.int=... to specify the desired confidence interval, but the default is 0.95 so it really wasn't necessary in this example.
There are several other functions of this type: see the documentation and links therein for an explanation.
If you want to build your error bars by making a summary dataset, you just need to get that dataset in the correct format. There are lots of options for this; I will use dplyr. Notice I keep all the grouping variables from the plot in this dataset in a "tidy" format, with each variable in a separate column.
library(dplyr)
meandat = Reed %>%
group_by(treatment, time, origin) %>%
summarise(mean = mean(sulfide, na.rm = TRUE), se = SE(sulfide))
Source: local data frame [8 x 5]
Groups: treatment, time [?]
treatment time origin mean se
(fctr) (fctr) (chr) (dbl) (dbl)
1 1 5 A 7.50 3.378856
2 1 5 B 10.50 2.629956
3 1 10 A 31.50 7.858117
4 1 10 B 43.00 6.819091
5 2 5 A 31.50 7.858117
6 2 5 B 43.00 6.819091
7 2 10 A 138.25 23.552689
8 2 10 B 141.00 17.540429
Now error bars can be added via geom_errorbar. You'll see I set the aesthetics globally within ggplot to save myself having to re-type some of these, but you can change this as you want. I use position_dodge to get the error bars placed correctly over each bar.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity", position="dodge") +
theme_bw() +
facet_grid( ~ time)+
xlab("treatment") +
ylab("Sulfide")+
ggtitle("Time")+
geom_errorbar(data = meandat, aes(ymin = mean - se, ymax = mean + se, y = mean),
position = position_dodge(width = .9))
You can actually do all of this via stat_summary, rather than calculating the summary statistics "by hand". An example is here. The code would look like so, and gives the same plot as above.
ggplot(data = Reed, aes(y = sulfide, x = treatment, fill=origin)) +
geom_bar(stat="identity",position="dodge") +
theme_bw() +
facet_grid( ~ time) +
xlab("treatment") +
ylab("Sulfide") +
ggtitle("Time") +
stat_summary(geom = "errorbar", fun.data = mean_cl_normal, mult = 1,
position = position_dodge(width = .9))
I've been using the development version of ggplot2, ggplot2_1.0.1.9003, and found that I needed to add stat_summary function arguments via fun.args. This would look like fun.args = list(mult = 1) to get error bars of 1 standard error.

How to combine stat_ecdf with geom_ribbon?

I am trying to draw an ECDF of some data with a "confidence interval" represented via a shaded region using ggplot2. I am having trouble combining geom_ribbon() with stat_ecdf() to achieve the effect I am after.
Consider the following example data:
set.seed(1)
dat <- data.frame(variable = rlnorm(100) + 2)
dat <- transform(dat, lower = variable - 2, upper = variable + 2)
> head(dat)
variable lower upper
1 2.534484 0.5344838 4.534484
2 3.201587 1.2015872 5.201587
3 2.433602 0.4336018 4.433602
4 6.929713 4.9297132 8.929713
5 3.390284 1.3902836 5.390284
6 2.440225 0.4402254 4.440225
I am able to produce an ECDF of variable using
library("ggplot2")
ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf")
However I am unable to use lower and upper as the ymin and ymax aesthetics of geom_ribbon() to superimpose the confidence interval on the plot as another layer. I have tried:
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(ymin = lower, ymax = upper), stat = "ecdf") +
geom_step(stat = "ecdf")
but this raises the following error
Error: geom_ribbon requires the following missing aesthetics: ymin, ymax
Is there a way to coax geom_ribbon() into working with stat_ecdf() to produce a shaded confidence interval? Or, can anyone suggest an alternative means of adding a shaded polygon defined by lower and upper as a layer to the ECDF plot?
Try this (a bit of shot in the dark):
ggplot(dat, aes(x = variable)) +
geom_ribbon(aes(x = variable,ymin = ..y..-2,ymax = ..y..+2), stat = "ecdf",alpha=0.2) +
geom_step(stat = "ecdf")
Ok, so that's not the same thing as what you trying to do, but it should explain what's going on. The stat is returning a data frame with just the original x and the computed y, so I think that's all you have to work with. i.e. stat_ecdf only computes the cumulative distribution function for a single x at a time.
The only other thing I can think of is the obvious, calculating the lower and upper separately, something like this:
l <- ecdf(dat$lower)
u <- ecdf(dat$upper)
v <- ecdf(dat$variable)
dat$lower1 <- l(dat$variable)
dat$upper1 <- u(dat$variable)
dat$variable1 <- v(dat$variable)
ggplot(dat,aes(x = variable)) +
geom_step(aes(y = variable1)) +
geom_ribbon(aes(ymin = upper1,ymax = lower1),alpha = 0.2)
Not sure exactly how you want to reflect the CI, but ggplot_build() lets you get the generated data back from the plot, you can then overplot what you like.
This chart shows:
red = original ribbon
blue = takes the original CI vectors and applies to the ecdf curve
green = calculates the ecdf of upper and lower series and plots
g<-ggplot(dat, aes(x = variable)) +
geom_step(stat = "ecdf") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha=0.5, fill="red")
inside<-ggplot_build(g)
matched<-merge(inside$data[[1]],data.frame(x=dat$variable,dat$lower,dat$upper),by=("x"))
g +
geom_ribbon(data=matched, aes(x = x,
ymin = y + dat.upper-x,
ymax = y - x + dat.lower),
alpha=0.5, fill="blue") +
geom_ribbon(data=matched, aes(x = x,
ymin = ecdf(dat.lower)(x),
ymax = ecdf(dat.upper)(x)),
alpha=0.5, fill="green")

Resources