I'm doing a boxplot using the ggplot2 package, however, for some external reason, only half of the boxplot is being made for the "Control" and "Commercial IMD" treatments.
See below that when making the graph using the "boxplot" function, the graph is normally done.
mediasCon = tapply(dados$CS, dados$Trat, mean)
boxplot(dados$CS ~ dados$Trat, data = dados, col="gray",
xlab = 'Tratamentos', ylab = 'Espermatozoides - Cabeça Solta')
points(1:3, mediasCon, col = 'Red', pch = 16)
However, when making the same graph using the GGPLOT2 function, see that for the first two treatments only half of the graph is being done, why is this occurring?
Plus, how do I add boxplot "tails" using a ggplot2 function?
library(ggplot2)
ggplot(data=dados, aes(x=Trat, y=CS)) + geom_boxplot(fill=c("#DEEBF7","#2171B5","#034E7B"),color="black") +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta') +
stat_summary(fun=mean, colour="black", geom="point",
shape=18, size=5) +
theme(axis.title = element_text(size = 20),
axis.text = element_text(size = 16))
If you look at the help file under ?geom_boxplot you will see:
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for more information on how hinge positions are calculated for boxplot().
In your case, the 4 entries for IMD Commercial are c(0, 1, 1, 1), which is certainly a small sample.
One way around this is to calculate where you want the hinges to be and pass that data to ggplot, using stat = "identity". This makes the code a bit more complex, but this is often the case when you are trying to modify default behaviour:
library(ggplot2)
library(dplyr)
dados %>%
group_by(Trat) %>%
summarize(median = median(CS), mean = mean(CS),
upper = quantile(CS, 0.75, type = 2),
lower = quantile(CS, 0.25, type = 2),
max = max(CS), min = min(CS)) %>%
ggplot(aes(x = Trat, y = mean, fill = Trat)) +
geom_boxplot(aes(ymin = min, lower = lower,
middle = median, upper = upper, ymax = max),
stat = "identity", color = "black") +
geom_point(size = 3, shape = 21, fill = "red") +
scale_fill_manual(values = c("#DEEBF7","#2171B5","#034E7B")) +
theme_classic() +
xlab('Tratamentos') +
ylab('Espermatozoides - Cabeça Solta')
Related
I'm trying to make a meansplot with confidence intervals, but I would like the intervals to be Tukey HSD intervals after an ANOVA is computed.
I'll use the next example here to explain, in the dataframe there is a factor: poison {1,2,3}
library(magrittr)
library(ggplot2)
library(ggpubr)
library(dplyr)
library("agricolae")
PATH <- "https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv"
df <- read.csv(PATH) %>%
select(-X) %>%
mutate(poison = factor(poison, ordered = TRUE))
glimpse(df)
ggplot(df, aes(x = poison, y = time, fill = poison)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
anova_one_way <- aov(time ~ poison, data = df)
summary(anova_one_way)
# Use TukeyHSD
tukeyHSD <- TukeyHSD(anova_one_way)
plot(tukeyHSD)
I would like the plot to be similar to the one from statgraphics, where you can see the mean point and the lenght of the bars is the HSD tuckey intervals, so in one simple glimpse you can apreciate the best level and if it is better and is statistically significantly better.
I have seen some examples in more complex questions but is for boxplots and I dont understand it enough to adapt the solutions here.
Tukey's results on boxplot in R
example1
example1
TukeyHSD results on boxplot after two-way anova
example2
example2
Edit#############
The answer provided by Allan Cameron #allan-cameron is great, however It doesnt work right now in my computer probably due to versions. stats_summary method keywords change a bit. I took his solution and did a couple of changes to make it work for me.
# Allans original response
tukeyCI <- (tukeyHSD$poison[1, 1] - tukeyHSD$poison[1, 2]) / 2
# Changed fun.max and min to ymax and ymin
# Changed fun to fun.y to make Allans solution work for me.
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.ymax = function(x) mean(x) + tukeyCI,
fun.ymin = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun.y = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Error response was:
Warning:Ignoring unknown parameters:fun.max, fun.min
Warning:Ignoring unknown parameters:fun
No summary function supplied, defaulting to `mean_se()
I'm currently using these versions:
version R version 3.5.2 (2018-12-20)
packageVersion("ggplot2") ‘3.1.0’
packageVersion("dplyr") ‘0.7.8’
The image from statgraphics shows error bars around the mean points, and if I understand you correctly then you want to be able to draw error bars around your mean points such that non-overlapping error bars mean there are significant differences between the variables. That being the case, we can extract the required confidence interval like this:
tukeyCI <- (tukeyHSD$poison[1,1] - tukeyHSD$poison[1,2])/2
And we can draw the result in ggplot like this:
ggplot(df, aes(x = poison, y = time)) +
stat_summary(fun.max = function(x) mean(x) + tukeyCI,
fun.min = function(x) mean(x) - tukeyCI,
geom = 'errorbar', size = 1, color = 'gray50',
width = 0.25) +
stat_summary(fun = mean, geom = 'point', size = 4, shape = 21,
fill = 'white') +
geom_point(position = position_jitter(width = 0.25), alpha = 0.4,
color = 'deepskyblue4') +
theme_minimal(base_size = 16)
Here we can see that there are significant differences between 1 and 3, and between 2 and 3, but that the difference between 1 and 2 is non-significant.
I'm creating a bar chart with a pattern for a subset of the bars, and I want to add error bars.
However, I'm having trouble lining up the error bars with with the bar charts—I want to have them appear centered on each bar. How do I do this? Moreover, the legend currently does not clearly distinguish the striped and non-striped bars as corresponding to not treated and treated groups.
Finally, I'd like to create version of this plot which stacks adjacent bars (i.e. bars within each facet_grid)—any tips on how to do that would be much appreciated.
The code I'm using is:
library(ggplot2)
library(tidyverse)
library(ggpattern)
models = c("a", "b")
task = c("1","2")
ratios = c(0.3, 0.4)
standard_errors = c(0.02, 0.02)
ymax = ratios + standard_errors
ymin = ratios - standard_errors
colors = c("#F39B7FFF", "#8491B4FF")
df <- data.frame(task = task, ratios = ratios)
df <- df %>% mutate(filler = 1-ratios)
df <- df %>% gather(key = "obs", value = "ratios", -1)
df$upper <- df$ratios + c(standard_errors,standard_errors)
df$models <- c(models,models)
df$lower <- df$ratios - c(standard_errors,standard_errors)
df$col <- c(colors,colors)
df$group <- paste(df$task, df$models, sep="-")
df$treated <- "yes"
df[df$ratios<0.5,]$treated = "no"
p <- ggplot(df, aes(x = group, y = ratios, fill = col, ymin = lower, ymax = upper)) +
stat_summary(aes(pattern=treated),
fun = "mean", position=position_dodge(),
geom = "bar_pattern", pattern_fill="black", colour="black") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, position=position_dodge(0.9)) +
scale_pattern_manual(values=c("none", "stripe"))+ #edited part
facet_grid(.~task,
scales = "free_x", # Let the x axis vary across facets.
space = "free_x", # Let the width of facets vary and force all bars to have the same width.
switch = "x") + guides(colour = guide_legend(nrow = 1)) +
guides(fill = "none")
p
Here is an option
df %>%
ggplot(aes(x = models, y = ratios)) +
geom_col_pattern(
aes(fill = col, pattern = treated),
pattern_fill = "black",
colour = "black",
pattern_key_scale_factor = 0.2,
position = position_dodge()) +
geom_errorbar(
aes(ymin = lower, ymax = upper, group = interaction(task, treated)),
width = 0.2,
position = position_dodge(0.9)) +
facet_grid(~ task, scales = "free_x") +
scale_pattern_manual(values = c("none", "stripe")) +
scale_fill_identity()
A few comments:
I don't understand the point of creating group. IMO this is unnecessary. TBH, I also don't understand the point of models and task: if task = "1" then models = "a"; if task = "2" then models = "b"; so task and models are redundant as they encode the same thing (whether you call it "1"/"2" or "a"/"b").
The reason why you (originally) didn't see a pattern in the legend is because of the scale factor in the legend key. As per ?scale_col_pattern, you can adjust this with the pattern_key_scale_factor parameter. Here, I've chosen pattern_key_scale_factor = 0.2 but you may want to play with different values.
The reason why the error bars didn't align with the dodged bars was because geom_errorbar didn't know that there are different task-treated combinations. We can fix this by explicitly defining a group aesthetic given by the combination of task & treated values. The reason why you don't need this in geom_col_pattern is because you already allow for different treated values through the pattern aesthetic.
You want to use scale_fill_identity() if you already have actual colour values defined in the data.frame.
I am plotting a box-plot to see the distribution of the variable. I am also interested in seeing the number of observations in each quartile. Is there any way to add the number of observations in each quartile to the boxplot along with the values of quartiles?
I included some code below which can generate box-plot with the values of quartiles.
df <- datasets::iris
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label_repel", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
I expect the values of quartiles on the left-hand side of the plot and the number of observations on the right-hand side of the plot if possible.
this would be one possibility. I always prefer to have my additional data as an extra data frame, because this gives me more control on what is how calculated.
Counting made with some inspiration from https://stackoverflow.com/a/54451575
quantile_counts=function(x){
df= data.frame(label=table(cut(x, quantile(x))),
label_pos=diff(quantile(x))/2+quantile(x)[1:4])
return(df)
}
df_quantile_counts=quantile_counts(df$Sepal.Length)
boxplot <- ggplot(df, aes(x = "", y = Sepal.Length)) +
geom_boxplot(width=0.1, position = "dodge", fill = "red") +
stat_boxplot(geom = "errorbar", width = 0.1) +
stat_summary(geom = "label", fun.y = quantile, aes(label = ..y..),
position = position_nudge(x = -0.1), size = 3) +
geom_text(data=df_quantile_counts,aes(x="",y=label_pos,label = label.Freq),
position = position_nudge(x = +0.1), size = 3) +
ggtitle("") +
xlab("") +
ylab('Sepal.Length')
HTH, Tobi
#TobiO 's answer is correct. But, my data was kind of skewed and some cut points were the same (such as the first and second cut points were the same). I needed to take the unique values to calculate the number of observations in each quartile. Another point is related to usage of cut function which does not include the starting point (low bound, high bound]. In order to include the starting point, I have used the cut2 function from the Hmisc package. I included a label_pos_extension line in order to prevent the overlap of label/text for the quartiles whose cut points are very close to each other. geom_text_repel did not work for preventing the overlaps.
quantile_counts2 <- function(x){
label_pos_extension <- c(0,3,4,0)
if(length(unique(quantile(x))) < 5){
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = c(0, diff(unique(quantile(x))) / 2 + quantile(x)[1:length(unique(quantile(x)))-1]) + label_pos_extension[1:length(unique(quantile(x)))])
} else {
df <- data.frame(label = table(cut2(x, g = 4)),
label_pos = diff(quantile(x)) / 2 + quantile(x)[1:4] + label_pos_extension)
} return(df)
}
PS. I tried to put my edited function in comment but, it did not work.
I'm approximating a distribution with gaussian mixtures and was wondering whether there was an easy way to automatically plot the estimated kernel density of the whole (uni-dimensional) dataset as the sum of the component densities in a nice fashion like this using ggplot2:
Given the following example data, my approach in ggplot2 would be to manually plot the subset densities into the scaled overall density like this:
#example data
a<-rnorm(1000,0,1) #component 1
b<-rnorm(1000,5,2) #component 2
d<-c(a,b) #overall data
df<-data.frame(d,id=rep(c(1,2),each=1000)) #add group id
##ggplot2
require(ggplot2)
ggplot(df) +
geom_density(aes(x=d,y=..scaled..)) +
geom_density(data=subset(df,id==1), aes(x=d), lty=2) +
geom_density(data=subset(df,id==2), aes(x=d), lty=4)
Note that this does not work out regarding the scales. It also does not work when you scale all 3 densities or no density at all. So I was not able to replicate above plot.
In addition, I am not able to automatically generate this plot without having to subset manually. I tried using position = "stacked" as parameter in geom_density.
I usually have around 5-6 Components per dataset, so manually subsetting would be possible. However, I would like to have different colors or line-types per component density which are displayed in the legend of ggplot, so doing all subsets manually would increase the workload quite a bit.
Any ideas?
Thanks!
Here is a possible solution by specifying each density in the aes call with position = "identity" in one layer and in the second layer using stacked density without the legend.
ggplot(df) +
stat_density(aes(x = d, linetype = as.factor(id)), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = as.factor(id)), position = "identity", geom = "line")
Do note that when using more then two groups:
a <- rnorm(1000, 0, 1)
b <- rnorm(1000, 5, 2)
c <- rnorm(1000, 3, 2)
d <- rnorm(1000, -2, 1)
d <- c(a, b, c, d)
df <- data.frame(d, id = as.factor(rep(c(1, 2, 3, 4), each = 1000)))
curves for each stack appear (this is a problem with the two group example but linetype in first layer disguised it - use group instead to check) :
gplot(df) +
stat_density(aes(x = d, group = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x = d, linetype = id), position = "identity", geom = "line")
A relatively easy fix to this is to add alpha mapping and manually set it to 0 for unwanted curves:
ggplot(df) +
stat_density(aes(x=d, alpha = id), position = "stack", geom = "line", show.legend = F, color = "red") +
stat_density(aes(x=d, linetype = id), position = "identity", geom = "line")+
scale_alpha_manual(values = c(1,0,0,0))
I want to visualize my data using Boxplot.
I have created a boxplot and a stripchar using the follwonig comands:
adjbox(nkv.murgang$NK, main = "NKV, Murgang - Skewness Adjusted", horizontal = T, axes = F,
staplewex = 1, xlab = "Nutzen Kosten Verhältnis")
stripchart(nkv.murgang$NK, main = "NKV, Murgang - Stripchart", horizontal = T, pch = 1,
method = "jitter", xlab = "Nutzen Kosten Verhältnis")
However I cannot figure out how to incorporate the corresponding five number statistics into the graph (min, 1st Qu., Mean, 3rd Qu., Max). I want them to be display next to the whiskers.
What is my y-axis in this case?
Additionaly, I also want to highlight the mean and median with different colours. Something like this:
is it possible to combine this two into one graph?
Thanks for any input. I know this seems very basic, but I am stuck here...
You can combine a boxplot with a point-plot using ggplot2 as follows
require(ggplot2)
ggplot(mtcars, aes(x = as.factor(gear), y = wt)) +
geom_boxplot() +
geom_jitter(aes(col = (cyl == 4)), width = 0.1)
The result would be:
Instead of using adjbox, use ggplot:
There is a trick for the unknown x-axis: x = factor(0).
ggplot(nkv.murgang, aes(x = factor(0), nkv.murgang$NK)) +
geom_boxplot(notch = F, outlier.color = "darkgrey", outlier.shape = 1,
color = "black", fill = "darkorange", varwidth = T) +
ggtitle("NKV Murgang - Einfamilienhaus") +
labs(x = "Murgang", y = "Nutzen / Konsten \n Verhälhniss") +
stat_summary(geom = "text", fun.y = quantile,
aes(label=sprintf("%1.1f", ..y..)),
position=position_nudge(x=0.4), size=3.5)
This question explains.