Violin plot with confidence interval in r - r

How can I add a confidence interval to this violin plot?
df <- data.frame("Need" = c(3,4.3,4.5,2.2,5.1,5.2), "Condition" = c("A","A","A","B","B","B"))
ggplot(df,aes(x = Condition, y = Need, fill=Condition)) +
geom_violin() +
stat_summary(fun.data = "mean_cl_boot", geom = "pointrange",
colour = "red") +
ggtitle("Needs by condition violin plot"))
I can't attach pictures yet, but you get the gist. With this code I can create violin plots with standard deviation lines for each violin plot, but I'd add 95% confidence interval lines.
Any ideas?

What you can do is first calculate the error bars per condition and after that add them by using geom_errorbar like this:
library(tidyverse)
stats <- df %>%
group_by(Condition) %>%
summarise(Mean = mean(Need), SD = sd(Need),
CI_L = Mean - (SD * 1.96)/sqrt(6),
CI_U = Mean + (SD * 1.96)/sqrt(6))
ggplot() +
geom_violin(df, mapping = aes(x = Condition, y = Need, fill=Condition)) +
stat_summary(fun.data = "mean_cl_boot", geom = "pointrange",
colour = "red") +
geom_point(stats, mapping = aes(Condition, Mean)) +
geom_errorbar(stats, mapping = aes(x = Condition, ymin = CI_L, ymax = CI_U), width = 0.2) +
ggtitle("Needs by condition violin plot")
Output:

Related

Binned Histogram with overlay of empirical and/or normal distribution [duplicate]

This question already has answers here:
ggplot2: histogram with normal curve
(5 answers)
Closed 1 year ago.
I am trying to look at the frequency distribution of a certain variable. Due to the large amount of data, I have created bins for a range of values and I'm plotting the count of each bin. I want to be able to overlay lines which will represent both the empirical distribution seen by my data, and what a theoretically normal distribution would look like. I can accomplish this without pre-binning my data or using ggplot2 by doing something such as this:
df <- ggplot2::diamonds
hist(df$price,freq = FALSE)
lines(density(df$price),lwd=3,col="blue")
or with ggplot2 as such:
mean_price <- mean(df$price)
sd_price <- sd(df$price)
ggplot(df, aes(x = price)) +
geom_histogram(aes(y = ..density..),
bins = 40, colour = "black", fill = "white") +
geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(color = 'Normal'),
args = list(mean = mean_price, sd = sd_price)) +
scale_colour_manual(name = "Colors", values = c("red", "blue"))
but I cannot figure out how to overlay similar lines on my pre-binned data:
breaks <- seq(from=min(df$price),to=max(df$price),length.out=11)
price_freq <- cut(df$price,breaks = breaks,right = TRUE,include.lowest = TRUE)
ggplot(data = df,mapping = aes(x=price_freq)) +
stat_count() +
theme(axis.text.x = element_text(angle = 270))
# + geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
# stat_function(fun = dnorm, aes(color = 'Normal'),
# args = list(mean = mean_price, sd = sd_price)) +
# scale_colour_manual(name = "Colors", values = c("red", "blue"))
Any ideas?
Your problem is that cut gives you a factor/character for your x-axis. You need a numeric x-axis to add the other layers. A first step might be to try the following. I added a small fudge to get the last bin to work out.
library(tidyverse)
df <- ggplot2::diamonds
mean_price <- mean(df$price)
sd_price <- sd(df$price)
num_bins <- 40
breaks <- seq(from=min(df$price),to=max(df$price)+1e-10,length.out=num_bins+1)
midpoints <- (breaks[1:num_bins] + breaks[2:(num_bins+1)])/2
precomputed <- df %>%
mutate(bin_left = breaks[findInterval(price, breaks)],
bin_mid = midpoints[findInterval(price, breaks)]) %>%
count(bin_mid)
precomputed %>%
ggplot(aes(x = bin_mid, weight = n)) +
geom_histogram(aes(y = ..density..), bins = num_bins, boundary = breaks[1], colour = "black", fill = "white") +
geom_line(aes(y = ..density.., color = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(color = 'Normal'),
args = list(mean = mean_price, sd = sd_price)) +
scale_colour_manual(name = "Colors", values = c("red", "blue"))
But you will notice that the red Empirical curve is quite different from your ggplot2 example. The reason is that here it is being computed using the summary data which moves all x-values to the bin midpoint. You will need to pre-compute this empirical curve, or drop it and rely on the histogram to represent this data.
Sorry for the partial answer.
Take a look at the PearsonDS package ( I am guessing you are not using rnorm for a reason). The easiest approach may be to generate a vector of data that meets your requirements and map that vector using geom_line.
library("PearsonDS")
df <- rpearson(5000,moments=c(mean=10,variance=2,skewness=0,kurtosis=3))

Confidence interval graph in ggplot: how to make it more clearer?

I have continuous data (spectra) and I would like to determine the significance of the differences. I tried to do it using confidence intervals in r, ggplot.
Here is the code:
ggplot(df, aes(x = df$wn, y = df$value)) +
geom_line(aes(x = df$wn, y = df$value, colour = group)) +
geom_ribbon(aes(x = df$wn, ymin = df$lower, ymax = df$upper, fill = group))
I have around 15 spectra and the graph looks indistinguishable:
How could I make it clearer?

How to plot multiple mean lines in a single histogram with multiple groups present?

I am plotting a distribution of two variables on a single histogram. I am interested in highlighting each distribution's mean value on that graph through a doted line or something similar (but hopefully something that matches the color present already in the aes section of the code).
How would I do that?
This is my code so far.
hist_plot <- ggplot(data, aes(x= value, fill= type, color = type)) +
geom_histogram(position="identity", alpha=0.2) +
labs( x = "Value", y = "Count", fill = "Type", title = "Title") +
guides(color = FALSE)
Also, is there any way to show the count of n for each type on this graph?
i've made some reproducible code that might help you with your problem.
library(tidyverse)
# Generate some random data
df <- data.frame(value = c(runif(50, 0.5, 1), runif(50, 1, 1.5)),
type = c(rep("type1", 50), rep("type2", 50)))
# Calculate means from df
stats <- df %>% group_by(type) %>% summarise(mean = mean(value),
n = n())
# Make the ggplot
ggplot(df, aes(x= value, fill= type, color = type)) +
geom_histogram(position="identity", alpha=0.2) +
labs(x = "Value", y = "Count", fill = "Type", title = "Title") +
guides(color = FALSE) +
geom_vline(data = stats, aes(xintercept = mean, color = type), size = 2) +
geom_text(data = stats, aes(x = mean, y = max(df$value), label = n),
size = 10,
color = "black")
If things go as intended, you'll end up something akin to the following plot.
histogram with means

Plot standard deviation

I want to plot the standard deviation for 1 line (1 flow serie, the plot will have 2) in a plot with lines or smoth areas. I've seen and applied some code from sd representation and other examples... but it's not working for me.
My original data has several flow values for the same day, of which I've calculated the daily mean and sd. I'm stuck here, don't know if it is possible to represent the daily sd with lines from the column created "called sd" or should I use the original data.
The bellow code is a general example of what I'll apply to my data. The flow, flow1 and sd, are examples of the result calculation of daily mean and sd of the original data.
library(gridExtra)
library(ggplot2)
library(grid)
x <- data.frame(
date = seq(as.Date("2012-01-01"),as.Date("2012-12-31"), by="week"),
rain = sample(0:20,53,replace=T),
flow1 = sample(50:150,53,replace=T),
flow = sample(50:200,53,replace=T),
sd = sample (0:10,53, replace=T))
g.top <- ggplot(x, aes(x = date, y = rain, ymin=0, ymax=rain)) +
geom_linerange() +
scale_y_continuous(limits=c(22,0),expand=c(0,0), trans="reverse")+
theme_classic() +
theme(plot.margin = unit(c(5,5,-32,6),units="points"),
axis.title.y = element_text(vjust = 0.3))+
labs(y = "Rain (mm)")
g.bottom <- ggplot(x, aes(x = date)) +
geom_line(aes(y = flow, colour = "flow")) +
geom_line(aes(y = flow1, colour = "flow1")) +
stat_summary(geom="ribbon", fun.ymin="min", fun.ymax="max", aes(fill=sd), alpha=0.3) +
theme_classic() +
theme(plot.margin = unit(c(0,5,1,1),units="points"),legend.position="bottom") +
labs(x = "Date", y = "River flow (m/s)")
grid.arrange(g.top, g.bottom , heights = c(1/5, 4/5))
The above code gives Error: stat_summary requires the following missing aesthetics: y
Other option is geom_smooth, but as far as I could understand it requires some line equation (I can be wrong, I'm new in R).
Something like this maybe?
g.bottom <- x %>%
select(date, flow1, flow, sd) %>%
gather(key, value, c(flow, flow1)) %>%
mutate(min = value - sd, max = value + sd) %>%
ggplot(aes(x = date)) +
geom_ribbon(aes(ymin = min, ymax = max, fill = key)) +
geom_line(aes(y = value, colour = key)) +
scale_fill_manual(values = c("grey", "grey")) +
theme_classic() +
theme(plot.margin = unit(c(0,5,1,1),units="points"),legend.position="bottom") +
labs(x = "Date", y = "River flow (m/s)")

Wrong location of errorbars

I am trying to make a plot with multiple lines and with stat_summary defining the mean values. When I apply geom_errorbar(), some of them are placed with a some distance to mean indications, which means, that some of them are 'flying'. What is happening?
Thanks!
My code:
#First I add another data set with SE, SD and mean.
cdata <- ddply(data2, c("OGTT","Treatment"), summarise,
N = sum(!is.na(Glucose)),
mean = mean(Glucose, na.rm=TRUE),
sd = sd(Glucose, na.rm=TRUE),
se = sd / sqrt(N))
#Then I merge it with my original data
totalglu<-merge(data2,cdata)
#Then I make the ggplot
p<-ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), geom = "point", fun.y = mean, shape = 16, size = 2) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")+
geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=.1, colour="black")
p
It appears that you are not using End.start when calculating the bars, but it is being used by stat_summary because of the faceting.
Try:
cdata <- ddply(data2, c("OGTT","Treatment","End.start"), summarise,
N = sum(!is.na(Glucose)),
mean = mean(Glucose, na.rm=TRUE),
sd = sd(Glucose, na.rm=TRUE),
se = sd / sqrt(N))
#Then I merge it with my original data
totalglu<-merge(data2,cdata)
#Then I make the ggplot
p<-ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), geom = "point", fun.y = mean, shape = 16, size = 2) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")+
geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=.1, colour="black")
p
Though, without the actual starting data, I am not quite sure what data2 looks like, or how ddply is affecting things. Instead, I might suggest skipping making cdata altogether, and just using:
ggplot(data=totalglu, aes(x = factor(OGTT), y = Glucose, group = StudyID, color=StudyID)) +
geom_line() +
facet_grid(End.start ~Treatment)+
stat_summary(aes(group = Treatment), fun.data = mean_cl_normal) +
theme(legend.position="none") +
labs(x = "OGTT time points (min)",y= "Glucose (mmol/l)")

Resources