I have a data frame with several columns and I need to draw boxplots and a some kind of an interval plot (with 2.5% and 97.5%) for each column.
My data set looks like this:
set.seed(123)
x1=rnorm(100,0,1)
x2=rnorm(100,0,0.5)
x3=rnorm(100,0,0.6)
data_x=data.frame(x1,x2,x3)
I was able to draw the boxplots for this data using following lines of code:
datax_long=data_x %>% gather(x ,value ,x1:x3)
ggplot(data=datax_long, aes(y= x, x=value, fill=x))+ geom_boxplot()
Now I need to draw a interval plot for each column. It is kind of a horizontal line from 2.5%th percentile to 97.5%th percentile. The range of values for each variable should roughly the same as in the boxplot.
Is this something we can do using ggplot2 package in R ?
Thank you
Something like this should work:
ggplot(datax_long, aes(x = value, y = x)) +
stat_summary(geom = "errorbarh",
fun.min = function(z) quantile(z, .025),
fun = mean,
fun.max = function(z) quantile(z, 0.975), color = "red") +
stat_summary(geom = "point", fun = mean, color = "blue")
Related
I want to change the summary statistics shown in the following boxplot:
I have created the boxplot as follows:
ggplot(as.data.frame(beta2), aes(y=var1,x=as.factor(Year))) +
geom_boxplot(outlier.shape = NA)+
ylab(expression(beta[1]))+
xlab("\nYear")+
theme_bw()
The default is for the box is the first and third quantile. I want the box to show the 2.5% and 97.5% quantiles. I know one can easily change what is shown when one boxplot is visualised by adding the following to geom_boxplot:
aes(
ymin= min(var1),
lower = quantile(var1,0.025),
middle = mean(var1),
upper = quantile(var1,0.975),
ymax=max(var1))
However, this does not work for when boxplots are generated by group. Any idea how one would do this? You can use the Iris data set:
ggplot(iris, aes(y=Sepal.Length,x=Species)) +
geom_boxplot(outlier.shape = NA)
EDIT:
The answer accepted does work. My data-frame is really big and as such the method provided takes a bit of time. I found another solution here: SOLUTION that works for large datasets and my specific need.
This could be achieved via stat_summary by setting geom="boxplot". and passing to fun.data a function which returns a data frame with the summary statistics you want to display as ymin, lower, ... in your boxplot:
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
stat_summary(geom = "boxplot", fun.data = function(x) {
data.frame(
ymin = min(x),
lower = quantile(x, 0.025),
middle = mean(x),
upper = quantile(x, 0.975),
ymax = max(x)
)}, outlier.shape = NA)
I'm trying to replicate this image.
I was able to plot a scatter plot and the median (but it's not continuous).
I failed to plot the percentiles.
The median varies according to different spell length.
ggplot(df,aes(x=Spell.Length,y=Growth.Rate)) +
geom_point() +
stat_summary(fun = median, fun.min = median, fun.max = median,
geom = "crossbar", width = 0.5,colour="red")
What I'm trying to do
What I got so far
Use dplyr::summarize to create a data frame of the values of percentiles also group_by(Spell.Length), then plot those using geom_line(). Then the horizontal lines with geom_hline().
df %>% group_by(Spell.Length) %>%
summarize(median = quantile(Growth.Rate, p = .5), q1 = quantile(Growth.Rate, p = .25)) %>%
ggplot(aes(x = Spell.Length, y = median) +
geom_line() +
geom_line(aes(x = Spell.Length, y = q1)) +
geom_hline(yintercept = 3)
would be the basic idea.
geom_line() for each specific line style/group
Red lines geom_hline()
About 18 months ago, this helpful exchange appeared, with code to show how to produce a plot of median along with interquartile ranges. Here's the code:
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median)
Producing this plot:
What I'd wonder is how to add labels for the median and IQ ranges, and how to format the bar (color, alpha, etc). I tried calling the plot as an object to see if there were objects within I could then use to call format functions, but nothing was obvious when I looked at it in the r Studio IDE.
Is this even doable? I know I can do a boxplot but that would have to include min/max. I'd like to produce boxplots with just mean/median and IQs.
You can change the formating like you would any ggplot layer, see the docs for Vertical intervals: lines, crossbars & errorbars in this case. An example of this is the following:
library(ggplot2)
ggplot(data = diamonds) +
geom_pointrange(mapping = aes(x = cut, y = depth),
stat = "summary",
fun.ymin = function(z) {quantile(z,0.25)},
fun.ymax = function(z) {quantile(z,0.75)},
fun.y = median,
size = 4, # <- adjusts size
colour = "red", # <- adjusts colour
alpha = .3) # <- adjusts transparency
If you want to control formatting for the points and lines individually you need to do as #camille suggests and pre-process your data as geom_pointrange() draws a single graphical object so the points and lines are one in the same.
I would suggest something like this:
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise(median = median(depth),
lq = quantile(depth, 0.25),
uq = quantile(depth, 0.75)) %>%
ggplot(aes(cut, median)) +
geom_linerange(aes(ymin=lq, ymax=uq), size = 4, colour = "blue", alpha = .4) +
geom_point(size = 10, colour = "red", alpha = .8)
I am trying to generate a (grouped) density plot labelled with sample sizes.
Sample data:
set.seed(100)
df <- data.frame(ab.class = c(rep("A", 200), rep("B", 200)),
val = c(rnorm(200, 0, 1), rnorm(200, 1, 1)))
The unlabelled density plot is generated and looks as follows:
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
What I want to do is add text labels somewhere near the peak of each density, showing the number of samples in each group. However, I cannot find the right combination of options to summarise the data in this way.
I tried to adapt the code suggested in this answer to a similar question on boxplots: https://stackoverflow.com/a/15720769/1836013
n_fun <- function(x){
return(data.frame(y = max(x), label = paste0("n = ",length(x))))
}
ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4) +
stat_summary(geom = "text", fun.data = n_fun)
However, this fails with Error: stat_summary requires the following missing aesthetics: y.
I also tried adding y = ..density.. within aes() for each of the geom_density() and stat_summary() layers, and in the ggplot() object itself... none of which solved the problem.
I know this could be achieved by manually adding labels for each group, but I was hoping for a solution that generalises, and e.g. allows the label colour to be set via aes() to match the densities.
Where am I going wrong?
The y in the return of fun.data is not the aes. stat_summary complains that he cannot find y, which should be specificed in global settings at ggplot(df, aes(x = val, group = ab.class, y = or stat_summary(aes(y = if global setting of y is not available. The fun.data compute where to display point/text/... at each x based on y given in the data through aes. (I am not sure whether I have made this clear. Not a native English speaker).
Even if you have specified y through aes, you won't get desired results because stat_summary compute a y at each x.
However, you can add text to desired positions by geom_text or annotate:
# save the plot as p
p <- ggplot(df, aes(x = val, group = ab.class)) +
geom_density(aes(fill = ab.class), alpha = 0.4)
# build the data displayed on the plot.
p.data <- ggplot_build(p)$data[[1]]
# Note that column 'scaled' is used for plotting
# so we extract the max density row for each group
p.text <- lapply(split(p.data, f = p.data$group), function(df){
df[which.max(df$scaled), ]
})
p.text <- do.call(rbind, p.text) # we can also get p.text with dplyr.
# now add the text layer to the plot
p + annotate('text', x = p.text$x, y = p.text$y,
label = sprintf('n = %d', p.text$n), vjust = 0)
I am doing a basic boxplot where y=age and x=Patient groups
age <- ggplot(data, aes(factor(group2), age)) + ylim(15, 80)
age + geom_boxplot(fill = "grey80", colour = "#3366FF")
I was hoping you could help me out with a few things:
1) Is it possible to include a number of observations per group above each group boxplot (but NOT on the X axis where my group labels are) without having to do this in paint :)?
I have tried using:
age + annotate("text", x = "CON", y = 60, label = "25")
where CON is the 1st group and y = 60 is ~ just above the boxplot for this group. However, the command didn't work. I assume it has something to do that it reads x as a continuous rather than a categorical variable.
2) Also although there are plenty of questions about using the mean rather than the median for the boxplots, I still haven`t found a code that works for me?
3) On the same matter is there a way you could include the mean group stat in the boxplot? Perhaps using
age + stat_summary(fun.y=mean, colour="red", geom="point")
which however only includes a dot of where the mean lies. Or again using
age + annotate("text", x = "CON", y = 30, label = "30")
where CON is the 1st group and y = 30 is ~ the group age mean.
Knowing how flexible and rich ggplot2 syntax is I was hoping that there is a more elegant way of using the real stats output rather than annotate.
Any suggestions/links would be much appreciated!
Thanks!!
Is this anything like what you're after? With stat_summary, as requested:
# function for number of observations
give.n <- function(x){
return(c(y = median(x)*1.05, label = length(x)))
# experiment with the multiplier to find the perfect position
}
# function for mean labels
mean.n <- function(x){
return(c(y = median(x)*0.97, label = round(mean(x),2)))
# experiment with the multiplier to find the perfect position
}
# plot
ggplot(mtcars, aes(factor(cyl), mpg, label=rownames(mtcars))) +
geom_boxplot(fill = "grey80", colour = "#3366FF") +
stat_summary(fun.data = give.n, geom = "text", fun.y = median) +
stat_summary(fun.data = mean.n, geom = "text", fun.y = mean, colour = "red")
Black number is number of observations, red number is mean value. joran's answer shows you how to put the numbers at the top of the boxes
hat-tip: https://stackoverflow.com/a/3483657/1036500
I think this is what you're looking for maybe?
myboxplot <- ddply(mtcars,
.(cyl),
summarise,
min = min(mpg),
q1 = quantile(mpg,0.25),
med = median(mpg),
q3 = quantile(mpg,0.75),
max= max(mpg),
lab = length(cyl))
ggplot(myboxplot, aes(x = factor(cyl))) +
geom_boxplot(aes(lower = q1, upper = q3, middle = med, ymin = min, ymax = max), stat = "identity") +
geom_text(aes(y = max,label = lab),vjust = 0)
I just realized I mistakenly used the median when you were asking about the mean, but you can obviously use whatever function for the middle aesthetic you please.
Answer to the first problem.
To show value above the box you should provide x values as numeric not as level names. So, to plot the value above first value give x=1.
data(ToothGrowth)
ggplot(ToothGrowth,aes(supp,len))+geom_boxplot()+
annotate("text",x=1,y=32,label=30)