How do I jitter just the outliers in a ggplot boxplot? - r

I am using geom_jitter() for a boxplot with ggplot. I noticed it adds a point for every record on top of the boxplot, instead of jittering just the points that represent outliers.
This is demonstrated by this code.
data <- as.data.frame(c(rnorm(10000, mean = 10, sd = 20), rnorm(300, mean = 90, sd = 5)))
names(data) <- "blapatybloo"
data %>% ggplot(aes("column", blapatybloo)) + geom_boxplot() + geom_jitter(alpha=.1)
How do I apply geom_jitter to only the points on the boxplot without overlapping the rest of the records?

Create a new column to determine if a data point is an outlier or not.
Then overlay the points onto the boxplot.
data <- as.data.frame(c(rnorm(10000, mean = 10, sd = 20),
rnorm(300, mean = 90, sd = 5)))
names(data) <- "blapatybloo"
data <- data %>%
mutate(outlier = blapatybloo > median(blapatybloo) +
IQR(blapatybloo)*1.5 | blapatybloo < median(blapatybloo) -
IQR(blapatybloo)*1.5)
data %>%
ggplot(aes("column", blapatybloo)) +
geom_boxplot(outlier.shape = NA) +
geom_point(data = function(x) dplyr::filter(x, outlier),
position = "jitter")

Related

Elegant ggplot to report summary data and trend at each time point in an RCT

I am analysing an RCT and I wish to report summary statistics (mean with 95%CI) for a number of variables at three time points stratified by treatment allocation. Below is my code so far which only yields this figure.
set.seed(42)
n <- 100
dat1 <- data.frame(id=1:n,
treat = factor(sample(c('Trt','Ctrl'), n, rep=TRUE, prob=c(.5, .5))),
time = factor("T1"),
outcome1=rbinom(n = 100, size = 1, prob = 0.3),
st=runif(n, min=24, max=60),
qt=runif(n, min=.24, max=.60),
zt=runif(n, min=124, max=360)
)
dat2 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T2"),
outcome1=dat1$outcome1,
st=runif(n, min=34, max=80),
qt=runif(n, min=.44, max=.90),
zt=runif(n, min=214, max=460)
)
dat3 <- data.frame(id=1:n,
treat = dat1$treat,
time = factor("T3"),
outcome1=dat1$outcome1,
st=runif(n, min=44, max=90),
qt=runif(n, min=.74, max=1.60),
zt=runif(n, min=324, max=1760)
)
dat <- rbind(dat1,dat2, dat3)
ggplot(dat,aes(x=mean(zt), y=time)) + geom_point(aes(colour=treat)) + coord_flip() + geom_line(aes(colour=treat))
I have three questions
can a line be added connecting T1 to T2 to T3 showing the trend
can the 95%CI for the mean be added to each point without having to calculate a "ymin" and "ymax" for all my response variables
if I have multiple response variables (in this example "st", "qt" and "zt") is there a way to produce these all at one as some sort of facet?
Pivot_longer should do most of what you need. Pivot your st, qt, and zt (and whatever other response variables you need). Here I've labeled them "response_variables" and their values as value. You can then facet_wrap by response_variable. Stat_summary will add a line and the mean and ci (se), after group and color by treat. I opted for scales = "free" in facet_wrap otherwise you won't see much going on as zt dominates with its larger range
library(dplyr)
library(ggplot2)
library(Hmisc)
library(tidyr)
dat %>%
pivot_longer(-(1:4), names_to = "response_variables") %>%
ggplot(.,aes(x=value, y=time, group = treat, color = treat)) +
facet_wrap(~response_variables, scales = "free") +
coord_flip() +
stat_summary(fun.data = mean_cl_normal,
geom = "errorbar") +
stat_summary(fun = mean,
geom = "line") +
stat_summary(fun = mean,
geom = "point")

Adding trend lines across groups and setting tick labels in a grouped violin plot or box plot

I have xy grouped data that I'm plotting using R's ggplot2 geom_violin adding regression trend lines:
Here are the data:
library(dplyr)
library(plotly)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
And my current plot:
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal()
My questions are:
How do I get rid of the alpha part of the legend?
I would like the x-axis ticks to be df$group rather than df$group_age, which means a tick per each group at the center of that group where the label is group. Consider a situation where not all groups have all ages - for example, if a certain group has only two of the ages and I'm pretty sure ggplot will only present only these two ages, I'd like the tick to still be centered between their two ages.
One more question:
It would also be nice to have the p-values of each fitted slope plotted on top of each group.
I tried:
library(ggpmisc)
my.formula <- value ~ group_age
ggplot(df,aes(x=group_age,y=value,fill=age,color=age,alpha=0.5)) +
geom_violin() + geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(data=df,mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) + theme_minimal() +
stat_poly_eq(formula = my.formula,aes(label=stat(p.value.label)),parse=T)
But I get the same plot as above with the following warning message:
Warning message:
Computation failed in `stat_poly_eq()`:
argument "x" is missing, with no default
geom_smooth() fits a line, while stat_poly_eqn() issues an error. A factor is a categorical variable with unordered levels. A trend against a factor is undefined. geom_smooth() may be taking the levels and converting them to "arbitrary" numerical values, but these values are just indexes rather than meaningful values.
To obtain a plot similar to what is described in the question but using code that provides correct linear regression lines and the corresponding p-values I would use the code below. The main change is that the numerical variable time is mapped to x making the fitting of a regression a valid operation. To allow for a linear fit an x-scale with a log10 transformation is used, with breaks and labels at the ages for which data is available.
library(dplyr)
library(ggplot2)
library(ggpmisc)
set.seed(1)
df <-
data.frame(
value = c(
rnorm(500, 8, 1), rnorm(600, 6, 1.5), rnorm(400, 4, 0.5),
rnorm(500, 2, 2), rnorm(400, 4, 1), rnorm(600, 7, 0.5),
rnorm(500, 3, 1), rnorm(500, 3, 1), rnorm(500, 3, 1)
),
age = c(
rep("d3", 500), rep("d8", 600), rep("d24", 400),
rep("d3", 500), rep("d8", 400), rep("d24", 600),
rep("d3", 500), rep("d8", 500), rep("d24", 500)
),
group = c(rep("A", 1500), rep("B", 1500), rep("C", 1500))
) %>%
mutate(time = as.integer(gsub("d", "", age))) %>%
arrange(group, time) %>%
mutate(age = factor(age, levels = c("d3", "d8", "d24")),
group = factor(group))
my_formula = y ~ x
ggplot(df, aes(x = time, y = value)) +
geom_violin(aes(fill = age, color = age), alpha = 0.3) +
geom_boxplot(width = 0.1,
aes(color = age), fill = NA) +
geom_smooth(color = "black", formula = my_formula, method = 'lm') +
stat_poly_eq(aes(label = stat(p.value.label)),
formula = my_formula, parse = TRUE,
npcx = "center", npcy = "bottom") +
scale_x_log10(name = "Age", breaks = c(3, 8, 24)) +
facet_wrap(~group) +
theme_minimal()
Which creates the following figure:
Here is a solution. The alpha - legend issue is easy. Anything you place into the aes() functioning will get placed in a legend. This feature should be used when you want a feature of the data to be used as an aestetic. Putting alpha outside of an aes will remove it from the legend.
I'm not sure the x legend is what you wanted but i did it manually so it should be easy to configure.
Regarding the p.values, i did separate linear regressions and store the p.value in three different vectors which can be called into the ggplot using the annotate. For two of the groups the p.value was <.001 so the round functioning will round it to 0. Therefore, i just added p. <.001
Good luck with this!
library(dplyr)
library(ggplot2)
set.seed(1)
df <- data.frame(value = c(rnorm(500,8,1),rnorm(600,6,1.5),rnorm(400,4,0.5),rnorm(500,2,2),rnorm(400,4,1),rnorm(600,7,0.5),rnorm(500,3,1),rnorm(500,3,1),rnorm(500,3,1)),
age = c(rep("d3",500),rep("d8",600),rep("d24",400),rep("d3",500),rep("d8",400),rep("d24",600),rep("d3",500),rep("d8",500),rep("d24",500)),
group = c(rep("A",1500),rep("B",1500),rep("C",1500))) %>%
dplyr::mutate(time = as.integer(age)) %>%
dplyr::arrange(group,time) %>%
dplyr::mutate(group_age=paste0(group,"_",age))
df$group_age <- factor(df$group_age,levels=unique(df$group_age))
mod1 <- lm(value ~ time,df\[df$group == 'A',\])
mod1 <- summary(mod1)$coefficients\[8\] %>% round(2)
mod2 <- lm(value ~ time,df\[df$group == 'B',\])
mod2 <- summary(mod2)$coefficients\[8\] %>% round(2)
mod3 <- lm(value ~ time,df\[df$group == 'C',\])
mod3 <- summary(mod3)$coefficients\[8\] %>% round(2)
ggplot(df,aes(x=group_age,y=value,fill=age,color=age)) +
geom_violin(alpha=0.5) +
geom_boxplot(width=0.1,aes(fill=age,color=age,middle=mean(value))) +
geom_smooth(mapping=aes(x=group_age,y=value,group=group),color="black",method='lm',size=1,se=T) +
scale_x_discrete(labels = c('','A','','','B','','','C','')) +
annotate('text',x = 2,y = -1,label = paste('pvalue: <.001')) +
annotate('text',x = 6,y = 10,label = paste('pvalue: <.001')) +
annotate('text',x = 8,y = -1.2,label = paste('pvalue:',mod3))+
theme_minimal()

Use stat_summary to label median line on boxplot

I have a function wherein I'm trying to use stat_summary() to plot the value of the median just above the median line on a geom_boxplot(). I've reduced my problem and created a toy example to simplify but retain context.
library(ggplot2)
set.seed(20191120)
dat <- data.frame(var = sample(c("a", "b"),
50,
replace = TRUE),
value = rpois(50, 5))
lims <- c(0, 10)
myplot <- function(DATA, YLIMS) {
ggplot(data = DATA,
aes(x = var)) +
geom_boxplot(aes(y = value),
outlier.shape = NA,
coef = 0) +
stat_summary(aes(y = ifelse(value > (YLIMS[2]*0.9), # if median in top10% of plot window
(value - (YLIMS[2]/10)), # put it below bar
(value + (YLIMS[2]/10))), # else put it above
label = round(..y.., 2)), #round(median(value), 2))
fun.y = median,
geom = "text") +
coord_cartesian(ylim = YLIMS)
}
myplot(dat, lims)
My actual plots have several facets, a variety of ranges, and some of the medians are at the top or bottom of the range. As you can see, I've excluded whiskers and outliers. This is where the YLIMS argument comes in to zoom and focus on the boxes and exclude unused plot space. I've used these YLIMS values to also position the label at +/- 10% of the range which works out perfectly.
I tried using the ..y.. value to get the value of the median for the label argument of stat_summary(aes()) but it is instead taking the new value. As you can see from the plot, we'd expect both labels to be "5" but they are instead "6" as that 10% of 10 has been added.
I also tried recalculating the median (as you can see commented out) but that takes a simple median of all the data and doesn't control for groupings/facets/etc.
I know of ways to refactor my code to calculate to create values for the y labels and positions in the data, or by aggregating and using identity with the boxplot, but I'm wondering if there is a way to calculate this in-line like my attempt is close to doing.
The key for solving this problem is not trying to adjust the value, but using the postion=position_nudge() option to move the location.
library(ggplot2)
set.seed(20191120)
dat <- data.frame(var = sample(c("a", "b"), 50, replace = TRUE),
value = rpois(50, 5))
lims <- c(0, 10)
myplot <- function(DATA, YLIMS) {
ggplot(data = DATA, aes(x = var)) +
geom_boxplot(aes(y = value), outlier.shape = NA, coef = 0) +
stat_summary(aes(y = value , label = round(..y.., 2)),
fun.y = median, geom = "text",
position=position_nudge(y = ifelse(value > (YLIMS[2]*0.9), #if median in top 10% of plot window
(-YLIMS[2]/10), #put it below bar
(YLIMS[2]/10)), x = 0)) +
coord_cartesian(ylim = YLIMS)
}
myplot(dat, lims)

Advice/ on how to plot side by side histograms with line graph going through in ggplot2

I'm currently finishing off my Masters project and need to include some graphics for the write-up. Without boring you too much, I have some data which is associated with AR(1) parameters ranging from 0.1 to 0.9 by 0.1 increments. As such I thought of doing a faceted histogram like the one below (worry not about the hideous fruit salad of colours, it will not be used).
I used this code.
ggplot(opt_lens_geom,aes(x=l_1024,fill=factor(rho))) + geom_histogram()+coord_flip()+facet_grid(.~rho,scales = "free_x")
I also would like to draw a trend line for the median values since the AR(1) parameter is continuous. In a later iteration I deleted the padding and made it "look" like it was one graph, but I have had issues with the endpoints matching up since each facet is a separate graphical device. Can anyone give me some advice on how to do this? I am not particularly partial to the faceting so if it is not needed I do away with it.
I will try and upload sample data, but all simulating 100 values for each of the 9 rhos would work just to get it started like:
opt_lens_geom <- data.frame(rho= rep(seq(0.1,0.9,by=0.1),each=100),l_1024=rnorm(900))
You might consider ggridges. I've assumed here that you want a median value for each value of rho.
library(ggplot2)
library(ggridges)
library(dplyr)
set.seed(1001)
opt_lens_geom <- data.frame(rho = rep(seq(0.1, 0.9, by = 0.1), each = 100),
l_1024 = rnorm(900))
opt_lens_geom %>%
mutate(rho_f = factor(rho)) %>%
ggplot(aes(l_1024, rho_f)) +
stat_density_ridges(quantiles = 2, quantile_lines = TRUE)
Result. You can add scale = 1 as a parameter to stat_density_ridges if you don't like the amount of overlap.
Try the following. It uses a pre-computed data frame of the medians.
library(ggplot2)
df <- iris[c(1, 5)]
names(df) <- c("val", "rho")
med <- plyr::ddply(df, "rho", summarise, m = median(val))
ggplot(data = df, aes(x = val, fill = factor(rho))) +
geom_histogram() +
coord_flip() +
geom_vline(data = med, aes(xintercept = m), colour = 'black') +
facet_wrap(~ factor(rho))
You could do a variant on this using geom_violin instead of using histograms, although you wouldn't get labelled counts, just an idea of the relative density. Example with made up data:
df = data.frame(
rho = rep(c(0.1, 0.2, 0.3), each = 50),
val = sample(1:10, 150, replace = TRUE)
)
df$val = df$val + (5 * (df$rho == 0.2)) + (8 * (df$rho == 0.3))
ggplot(df, aes(x = rho, y = val, fill = factor(rho))) +
geom_violin() +
stat_summary(aes(group = 1), colour = "black",
geom = "line", fun.y = "median")
This produces a violin for each value of rho, and joins the medians for each violin.

show 2 standard deviations on a ggplot2 control chart (in addition to the normal 3)

First I create the data:
library(ggplot2)
library(ggQC)
set.seed(5555)
Golden_Egg_df <- data.frame(month=1:12, egg_diameter = rnorm(n = 12, mean = 1.5, sd = 0.2))
Then I setup the base ggplot.
XmR_Plot <- ggplot(Golden_Egg_df, aes(x = month, y = egg_diameter)) +
geom_point() + geom_line()
I can create a simple control chart with the ggQC package, in the following manner.
XmR_Plot + stat_QC(method = "XmR")
I can facet the control chart to show different levels of standard deviation (in this example, between 1-3).
XmR_Plot + stat_qc_violations(method = "XmR")
What I want is to be able to see both 2 and 3 standard deviations on the same chart, not faceted. My imagined syntax would be
XmR_Plot + stat_QC(method = "XmR", stand.dev = c(2, 3))
or something like that. But it obviously does not work, how do I get multiple standard deviations to show on 1 chart? It'd look something like this:
[
I highly recommend calculating your summary statistics yourself. You'll get a lot more control over the plot!
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(5555)
golden.egg.df = data.frame(month=1:12,
egg_diameter = rnorm(n = 12,
mean = 1.5,
sd = 0.2)
)
lines.df = golden.egg.df %>%
# Calculate all the summary stats
mutate(mean = mean(egg_diameter),
sd = sd(egg_diameter),
plus_one = mean + sd,
plus_two = mean + 2 * sd,
plus_three = mean + 3 * sd,
minus_one = mean - sd,
minus_two = mean - 2 * sd,
minus_three = mean - 3 * sd
) %>%
# Remove what we don't want to plot
select(-month, -egg_diameter, -sd) %>%
# Filter so the dataframe is now one unique row
unique() %>%
# Make the table tall for plotting
gather(key = stat,
value = value) %>%
# Add a new column which indicates how many SDs a line is from
# the mean
mutate(linetype = gsub("[\\s\\S]+?_", "", stat, perl = TRUE))
ggplot(golden.egg.df,
aes(x = month, y = egg_diameter)) +
geom_hline(data = lines.df,
aes(yintercept = value, linetype = linetype)) +
geom_point() +
geom_line()

Resources