Include outliers in ggplot boxplot - r

I conducted some interviews and I wanted to create box plots with ggplot based on these interviews. I managed to create the box plots but I do not manage to include the outliers in the box plot. I have only a few observations and therefore I want the outliers to be part of the box plot.
This is the code that I have so far:
data_insurances_boxplot_merged <- ggplot(data_insurances_merged, aes(x = value, y = func, fill = group)) +
stat_boxplot(geom = "errorbar", width = 0.3, position = position_dodge(width = 0.75)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 20, size = 3, color = "red",
position = position_dodge2(width = 0.75,
preserve = "single")) +
scale_x_continuous(breaks = seq(1, 7, 1), limits = c(1, 7)) +
scale_fill_manual(values = c("#E6645E", "#EF9C9D")) +
labs(x = "",
y = "", title = "") +
theme_light(base_size = 12) +
theme(legend.title = element_blank())
data_insurances_boxplot_merged
And this is the box plot that is generated:
Does anyone know how to achieve this?

Related

How to create an individual line plot in between box plot in r

I'm trying to create a plot like this image below where the individual data lines are in between the box plots. Image to create in R ggplot2
The closest I am getting is something like this:
Image using ggplot2 but it looks a bit cluttered with the lines/points behind.
data1 %>%
ggplot(aes(Time,Trait)) +
geom_line(aes(group=ID), position = "identity")+
geom_point(aes(group=ID), shape=21, colour="black", size=2, position = "identity")+
geom_boxplot(width=.5,position = position_dodge(width=0.9), fill="white") +
stat_summary(fun.data= mean_cl_boot, geom = "errorbar", width = 0.1, position = position_dodge(width = .9)) +
stat_summary(fun = mean, geom = "point", shape = 18, size=3, position = "identity")+
facet_wrap(~Cond) +
theme_classic()
Any tips would be greatly appreciated!
One option to achieve your desired result would be to make use of continuous x scale. Doing so makes it possible to shift the box plots to the left or to right and vice versa for the points and lines:
Making use of some random data to mimic your real data set.
data1$Time1 <- as.numeric(factor(data1$Time, levels = c("Pre", "Post")))
data1$Time_box <- data1$Time1 + .1 * ifelse(data1$Time == "Pre", -1, 1)
data1$Time_lp <- data1$Time1 + .1 * ifelse(data1$Time == "Pre", 1, -1)
library(ggplot2)
ggplot(data1, aes(x = Time_box, y = Trait)) +
geom_line(aes(x = Time_lp, group=ID), position = "identity")+
geom_point(aes(x = Time_lp, group=ID), shape=21, colour="black", size=2, position = "identity")+
geom_boxplot(aes(x = Time_box, group=Time1), width=.25, fill="white") +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.1) +
stat_summary(fun = mean, geom = "point", shape = 18, size=3, position = "identity") +
scale_x_continuous(breaks = c(1, 2), labels = c("Pre", "Post")) +
facet_wrap(~Cond) +
theme_classic()
DATA
set.seed(42)
data1 <- data.frame(
ID = rep(1:10, 4),
Time = rep(c("Pre", "Post"), each = 10),
Trait = runif(40),
Cond = rep(c("MBSR", "SME"), each = 20)
)
EDIT If you want to two boxplots side by side it's basically the same. However in that case you have to map the interaction of Time1 and the variable mapped on fill on the group aesthetic in geom_boxplot (and probably the error bars as well):
library(ggplot2)
set.seed(42)
data1 <- data.frame(
ID = rep(1:10, 4),
Time = rep(c("Pre", "Post"), each = 10),
Fill = rep(c("Fill1", "Fill2"), each = 5),
Trait = runif(40),
Cond = rep(c("MBSR", "SME"), each = 20)
)
ggplot(data1, aes(x = Time_box, y = Trait)) +
geom_line(aes(x = Time_lp, group=ID, color = Fill), position = "identity")+
geom_point(aes(x = Time_lp, group=ID, fill = Fill), shape=21, colour="black", size=2, position = "identity")+
geom_boxplot(aes(x = Time_box, group=interaction(Time1, Fill) , fill = Fill), width=.25) +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.1) +
stat_summary(fun = mean, geom = "point", shape = 18, size=3, position = "identity") +
scale_x_continuous(breaks = c(1, 2), labels = c("Pre", "Post")) +
facet_wrap(~Cond) +
theme_classic()

log transform X axis R

I have the following raw data that I plotted in R:
And I would like to edit this plot to look like this version below which was made by log-transforming the X axis using Excel
However, when I run my code below using scale_x_log10(), the output is not the desired plot I was hoping to make. See image below:
Can anyone identify where I have gone wrong?
ggplot(data = data, aes(x = x, y = y, group = group, color = group)) +
stat_summary(fun = "mean", geom = "line", size = 1.2, aes(group = group, linetype = group, color = group)) +
stat_summary(fun = "mean", geom = "point", size = 3, aes(color = group)) +
theme_apa() +
scale_linetype_manual(values = c("solid", "dashed")) +
scale_color_manual(values = c("mediumturquoise", "red")) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_x_log10(limits = c(.01, 40), breaks = c(.01, .1, 1, 10))
It looks like your first datapoint is at zero - this can't be displayed on a log scale. You'll need to work out if there's a difference in you data in excel, failing that you could achieve a similar result by modifying the lowest value of x with:
ggplot(data = data, aes(x = pmax(x,0.01), y = y, group = group, color = group)) +
stat_summary(fun = "mean", geom = "line", size = 1.2, aes(group = group, linetype = group, color = group)) +
stat_summary(fun = "mean", geom = "point", size = 3, aes(color = group)) +
theme_apa() +
scale_linetype_manual(values = c("solid", "dashed")) +
scale_color_manual(values = c("mediumturquoise", "red")) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_x_log10(limits = c(.01, 40), breaks = c(.01, .1, 1, 10))

custom color for each group + category combination raincloud plot

I have a raincloud plot:
but I would like each combination of TL group and yr to be a different color, as one can do in base boxplot():
I have tried using the following code for the raincloud plot:
Y_C_rain= ggplot(yct_rain, aes(y=d13C, x=lengthcat,fill = yr,color=yr)) +
geom_flat_violin(position = position_nudge(x = .2, y =0), alpha = .8)+
geom_point(aes(y = , color = yr),
position = position_jitter(width = .05), size = 2, alpha = .5) +
geom_boxplot(width = .3, guides = FALSE, outlier.shape = NA, alpha = 0, notch = FALSE) +
stat_summary(fun= mean, geom = "point", shape = 21, size = 3, fill = "black") +
scale_y_continuous (limits = c(-35,-10),expand = c(0,0),breaks=seq(-35,-10,5)) +
ylab("d13C") + xlab("TL group") +
ggtitle("YCT d13C") +
theme_bw() +
scale_colour_discrete(my_clrs_yct)+
scale_fill_discrete(my_clrs_yct)
Y_C_rain
I know that the colors in the rain plot will need to be coded with some variant of scale_fill_xxx but I am hitting a road block since it appears that each point also needs to have its own color. Therefore the variations of scale_fill_xxx with only 6 individual colors listed is not working.
Do you want something like this?
library(dplyr)
library(data.table)
library(ggplot2)
# used geom_flat_violin from https://gist.github.com/dgrtwo/eb7750e74997891d7c20
my_clrs_yct <- c("#404040", "#407a8c", "#7a7a7a", "#404f86", "#a6a6a6", "#3e1451")
## used storms from dplyr as reproducible example
data("storms")
setDT(storms)
storms[, season:= factor(ifelse(month <=6, "Q12", "Q34"))]
ggplot(storms, aes(x=status, y=pressure, color=interaction(status, season),
fill=interaction(status, season))) +
geom_point(aes(color = interaction(status, season)),
position = position_jitterdodge(
jitter.width=.1, dodge.width=.25), size = 2, alpha = .5)+
geom_flat_violin(position = position_nudge(x = .5, y =0), alpha = .5)+
geom_boxplot(width = .3, guides = FALSE, outlier.shape = NA, alpha = 0)+
stat_summary(fun = mean, geom = "point", shape = 21, size = 3,
fill = "black", position = position_nudge(x = c(-.075,.075), y =0)) +
theme_bw() +
scale_colour_manual(values=my_clrs_yct) +
scale_fill_manual(values=my_clrs_yct)

R graph: label by group

The data I am working on is a clustering data, with multiple observations within one group, I generated a caterpillar plot and want labelling for each group(zipid), not every line, my current graph and code look like this:
text = hosp_new[,c("zipid")]
ggplot(hosp_new, aes(x = id, y = oe, colour = zipid, shape = group)) +
# theme(panel.grid.major = element_blank()) +
geom_point(size=1) +
scale_shape_manual(values = c(1, 2, 4)) +
geom_errorbar(aes(ymin = low_ci, ymax = high_ci)) +
geom_smooth(method = lm, se = FALSE) +
scale_linetype_manual(values = linetype) +
geom_segment(aes(x = start_id, xend = end_id, y = region_oe, yend = region_oe, linetype = "4", size = 1.2)) +
geom_ribbon(aes(ymin = region_low_ci, ymax = region_high_ci), alpha=0.2, linetype = "blank") +
geom_hline(aes(yintercept = 1, alpha = 0.2, colour = "red", size = 1), show.legend = "FALSE") +
scale_size_identity() +
scale_x_continuous(name = "hospital id", breaks = seq(0,210, by = 10)) +
scale_y_continuous(name = "O:E ratio", breaks = seq(0,7, by = 1)) +
geom_text(aes(label = text), position = position_stack(vjust = 10.0), size = 2)
Caterpillar plot:
Each color represents a region, I just want one label/per region, but don't know how to delete the duplicated labels in this graph.
Any idea?
The key is to have geom_text return only one value for each zipid, rather than multiple values. If we want each zipid label located in the middle of its group, then we can use the average value of id as the x-coordinate for each label. In the code below, we use stat_summaryh (from the ggstance package) to calculate that average id value for the x-coordinate of the label and return a single label for each zipid.
library(ggplot2)
theme_set(theme_bw())
library(ggstance)
# Fake data
set.seed(300)
dat = data.frame(id=1:100, y=cumsum(rnorm(100)),
zipid=rep(LETTERS[1:10], c(10, 5, 20, 8, 7, 12, 7, 10, 13,8)))
ggplot(dat, aes(id, y, colour=zipid)) +
geom_segment(aes(xend=id, yend=0)) +
stat_summaryh(fun.x=mean, aes(label=zipid, y=1.02*max(y)), geom="text") +
guides(colour=FALSE)
You could also use faceting, as mentioned by #user20650. In the code below, panel.spacing.x=unit(0,'pt') removes the space between facet panels, while expand=c(0,0.5) adds 0.5 units of padding on the sides of each panel. Together, these ensure constant spacing between tick marks, even across facets.
ggplot(dat, aes(id, y, colour=zipid)) +
geom_segment(aes(xend=id, yend=0)) +
facet_grid(. ~ zipid, scales="free_x", space="free_x") +
guides(colour=FALSE) +
theme_classic() +
scale_x_continuous(breaks=0:nrow(dat),
labels=c(rbind(seq(0,100,5),'','','',''))[1:(nrow(dat)+1)],
expand=c(0,0.5)) +
theme(panel.spacing.x = unit(0,"pt"))

ggplot2: add line and points showing means (stat_summary)

So I am using this data frame:
xym <- data.frame(
Var1 = c("vloga", "odločitve", "dolgoročno",
"krizno", "uživa v", "vloga", "odločitve",
"dolgoročno", "krizno", "uživa v", "vloga",
"odločitve","dolgoročno", "krizno", "uživa v",
"vloga","odločitve", "dolgoročno", "krizno",
"uživa v"),
Var2 = c("Nad","Nad", "Nad", "Nad", "Nad", "Pod",
"Pod", "Pod", "Pod", "Pod", "Enak","Enak",
"Enak", "Enak", "Enak", "Sam.", "Sam.", "Sam.",
"Sam.", "Sam."),
value = c(4, 3, 4, 4, 3, 3, 3, 2, 3, 3, 3, 2.5, 2.5,
2, 3.5 ,5 ,6 ,6 ,5 ,6))
And with this code:
p <- ggplot(xym, aes(x = Var1, y = value, fill = Var2)) + coord_flip()+
theme_bw() + scale_fill_manual(values = c("yellow", "deepskyblue1", "yellowgreen","orchid4")) + xlim(rev(levels(xym$Var1)))+ theme(axis.title=element_blank(),axis.ticks.y=element_blank(),legend.position = "bottom",
axis.text.x = element_text(angle = 0,vjust = 0.4)) +
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width=0.7)) +
geom_text(aes(x = Var1, y =max(value), label = round(value, 2), fill = Var2),
angle = 0, position = position_dodge(width = 0.7), size = 4.2)
p + labs(fill="")
p + stat_summary(fun.y=mean, colour="red", geom="line", aes(group = 1))
I produce output:
But beside the red line which is marking total average by question (i.e. "dolgoročno", "krizno" etc.) I would like to add points and next to the bars as well as labels of the individual question group mean
My output should look something like the picture below, (I did it in paint), where the black dots represent my desired points and the value 3.6 of the first dot is the average of (6,2,4,2.5) and represents my desired value labels.
I've also looked at:
Plot average line in a facet_wrap
ggplot2: line connecting the means of grouped data
How to label graph with the mean of the values using ggplot2
One option would be the following. I followed your code and added a few lines.
# Your code
p <- ggplot(xym, aes(x = Var1, y = value, fill = Var2)) +
coord_flip() +
theme_bw() +
scale_fill_manual(values = c("yellow", "deepskyblue1", "yellowgreen","orchid4")) +
xlim(rev(levels(xym$Var1))) +
theme(axis.title = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom",
axis.text.x = element_text(angle = 0,vjust = 0.4)) +
geom_bar(stat = "identity", width = 0.7, position = position_dodge(width = 0.7)) +
geom_text(aes(x = Var1, y = max(value), label = round(value, 2), fill = Var2),
angle = 0, position = position_dodge(width = 0.7), size = 4.2)
p + labs(fill = "")
Then, I added the following code. You can add dots changing geom to point in stat_summary. For labels, I chose to get data from ggplot_build() and crated a data frame called foo. (I think there are other ways to do the same job.) Using foo, I added annotation in the end.
p2 <- p +
stat_summary(fun.y = mean, color = "red", geom = "line", aes(group = 1)) +
stat_summary(fun.y = mean, color = "black", geom ="point", aes(group = 1), size = 5,
show.legend = FALSE)
# This is the data for your dots in the graph
foo <- as.data.frame(ggplot_build(p2)$data[[4]])
p2 +
annotate("text", x = foo$x, y = foo$y + 0.5, color = "black", label = foo$y)

Resources