How to mimic geom_boxplot() with outliers using geom_boxplot(stat = "identity") - r

I would like to pre-compute by-variable summaries of data (with plyr and passing a quantile function) and then plot with geom_boxplot(stat = "identity"). This works great except it (a) does not plot outliers as points and (b) extends the "whiskers" to the max and min of the data being plotted.
Example:
library(plyr)
library(ggplot2)
set.seed(4)
df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE),
val = c(1:10, 100, 101))
df
# fact val
# 1 b 1
# 2 a 2
# 3 a 3
# 4 a 4
# 5 b 5
# 6 a 6
# 7 b 7
# 8 b 8
# 9 b 9
# 10 a 10
# 11 b 100
# 12 a 101
by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val))
by.fact.df
# fact 0% 25% 50% 75% 100%
# 1 a 2 3.25 5.0 9.00 101
# 2 b 1 5.50 7.5 8.75 100
# What I can do...with faults (a) and (b) above
ggplot(by.fact.df,
aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity")
# What I want...
ggplot(df, aes(x = fact, y = val)) +
geom_boxplot()
What I can do...with faults (a) and (b) mentioned above:
What I would like to obtain, but still leverage pre-computation via plyr (or other method):
Initial Thoughts: Perhaps there is some way to pre-compute the true end-points of the whiskers without the outliers? Then, subset the data for outliers and pass them as geom_point()?
Motivation: When working with larger datasets, I have found it faster and more practical to leverage plyr, dplyr, and/or data.table to pre-compute the stats and then plot them rather than having ggplot2 to the calculations.
UPDATE
I am able to extract what I need with the following mix of dplyr and plyr code, but I'm not sure if this is the most efficient way:
df %>%
group_by(fact) %>%
do(ldply(boxplot.stats(.$val), data.frame))
Source: local data frame [6 x 3]
Groups: fact
fact .id X..i..
1 a stats 2
2 a stats 4
3 a stats 10
4 a stats 13
5 a stats 16
6 a n 9

Here's my answer, using built-in functions quantile and boxplot.stats.
geom_boxplot does the calcualtions for boxplot slightly differently than boxplot.stats. Read ?geom_boxplot and ?boxplot.stats to understand my implementation below
#Function to calculate boxplot stats to match ggplot's implemention as in geom_boxplot.
my_boxplot.stats <-function(x){
quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1))
labels <-names(quantile(x))
#replacing the upper whisker to geom_boxplot
quantiles[5] <-boxplot.stats(x)$stats[5]
res <-data.frame(rbind(quantiles))
names(res) <-labels
res$out <-boxplot.stats(x)$out
return(res)
}
Code to calculate the stats and plot it
library(dplyr)
df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>%
ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`,
upper = `75%`, ymax = `100%`)) +
geom_boxplot(stat = "identity") + geom_point()

To get the correct statistics, you have to do some more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not draw the outliers. So you have to calculate the statistics without the outliers and then use geom_point to draw the outliers seperately. The following function (basically a simplified version of stat_boxplot) is probably not the most efficient, but it gives the desired result:
box.df <- df %>% group_by(fact) %>% do({
stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1)))
iqr <- diff(stats[c(2, 4)])
coef <- 1.5
outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr)
if (any(outliers)) {
stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE)
}
outlier_values = .$val[outliers]
if (length(outlier_values) == 0) outlier_values <- NA_real_
res <- as.list(t(stats))
names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker")
res$out <- outlier_values
as.data.frame(res)
})
box.df
## Source: local data frame [2 x 7]
## Groups: fact
##
## fact lower.whisker lower.hinge median upper.hinge upper.whisker out
## 1 a 2 3.25 5.0 9.00 10 101
## 2 b 1 5.50 7.5 8.75 9 100
ggplot(box.df, aes(x = fact, y = out, middle = median,
ymin = lower.whisker, ymax = upper.whisker,
lower = lower.hinge, upper = upper.hinge)) +
geom_boxplot(stat = "identity") +
geom_point()

Related

Calculating odds ratios between deciles of data in R

Similar to: How to calculate and plot odds-ratios and their standard errors from a logistic regression in R?
But I would like to plot the Phenotypes separately in the plot.
Data (subset of 20,000 similar lines):
ID PHENO SCORE
1 1 -0.001
2 1 0.132
3 1 0.023
4 0 -0.20032
5 1 -0.002
6 0 0.012
7 1 -0.23
8 0 0.321
9 0 -0.21
10 0 -0.497
I have then run a glm logistic model on this data
I would like to put the scores into deciles or some meaningful division and then work out the Odds ratio of having the phenotype (1 is having the disease, 0 is controls) per division of score , ideally between cases and control, using R.
To decile I do:
library(dplyr)
#place each value into a decile
data$decile <- ntile(data, 10)
I then follow the question above but wanted the plot to show the cases and controls separately.
I would like to end up with an image like below (with case(1) vs control(0) from the PHENO column:
Any help would be appreciated.
First of all, I generated some random data to make it more reproducible. First, you could make your target and deciles a factor. To extract the odds ratios and confidence intervals, you could use coef and confint with exp. After you can take the mean of each ID and PHENO of your results. To create the graph you can use geom_pointrange like this:
# Generate random data
set.seed(7)
data <- data.frame(ID = rep(c(1:10), 2000),
PHENO = sample(c(0,1), replace=TRUE, size=20000),
SCORE = rnorm(20000, 0, 1))
library(dplyr)
library(ggplot2)
#place each value into a decile
data <- data %>% mutate(decile = ntile(SCORE, 10))
# convert PHENO and decile to factor
data$PHENO <- as.factor(data$PHENO)
data$decile <- as.factor(data$decile)
# model
fit <- glm(PHENO ~ decile, data=data, family='binomial')
# Extract odds ratio with intervals
results <- as.data.frame(exp(cbind(coef(fit), confint(fit))))
#> Waiting for profiling to be done...
# Change columnames results dataframe
colnames(results) <- c('odds_ratio', '2.5', '97.5')
# Add id column
results$ID <- c(1:10)
# Join data and results dataframe
data <- left_join(data, results, by = 'ID')
# Take mean
data_sum <- data %>%
group_by(decile, PHENO) %>%
summarise(odds_ratio = mean(odds_ratio),
`2.5` = mean(`2.5`),
`97.5` = mean(`97.5`))
#> `summarise()` has grouped output by 'decile'. You can override using the
#> `.groups` argument.
# plot
ggplot(data_sum, aes(x = decile, y = odds_ratio, ymin = `2.5`, ymax = `97.5`, color = PHENO, shape = PHENO)) +
geom_pointrange(position = position_dodge(width = 0.4)) +
scale_color_manual(values = c('blue', 'green')) +
scale_shape_manual(values = c(18, 16)) +
guides(shape = 'none') +
theme_classic() +
labs(x = 'Decile', y = 'Odds ratio', color = '')
Created on 2022-10-29 with reprex v2.0.2

Explain the code underlying a linear model in R visualised with ggplot

I am trying to understand how linear modelling can be used to as an alternative to the t-test when analysing gene expression data. For a single gene, I have a dataframe of 20 gene expression values altogether in group 1 (n=10) and group 2 (n=10).
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = c(rep(1, 10), rep(2, 10)))
The data can be (box)plotted using ggplot as shown below:
plot <- gexp %>%
ggplot(aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
plot
I wish to model the expression in groups 1 and 2 using the regression formula:
Y = Beta0 + (Beta1 x X) + e where Y is the expression I want to model and X represents the two groups that are encoded as 0 and 1 respectively. Therefore, the expression in group 1 (when x = 0) is equal to Beta0; and the expression in group 2 (when x = 1) is equal to Beta0 + Beta1.
If this is modelled with:
mod1 <- lm(expression ~ group, data = gexp)
mod1
The above code outputs an intercept of 2.75 and a slope of 1.58. It is the visualisation of the linear model that I don't understand. I would be grateful for a clear explanation of the below code:
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
I get why the data.frame values are the ones chosen (the value of 4.33 is the sum of the intercept, Beta0 and the slope, Beta1) , but it is the geom_abline arguments I do not understand. Why is the intercept calculation as shown? In the text I am using it states, '...we need to subtract the slope from the intercept when plotting the linear model because groups 1 and 2 are encoded as 0 and 1 in the model, but plotted as 1 and 2 on the figure.' I don't follow this point and would be grateful for an explanation, without getting too technical.
I believe your code is correct if the group variable was encoded as a factor.
library(ggplot2)
gexp = data.frame(expression = c(2.7,0.4,1.8,0.8,1.9,5.4,5.7,2.8,2.0,4.0,3.9,2.8,3.1,2.1,1.9,6.4,7.5,3.6,6.6,5.4),
group = factor(c(rep(1, 10), rep(2, 10))))
plot <-
ggplot(gexp, aes(x = group, y = expression)) +
geom_boxplot() +
geom_point()
mod1 <- lm(expression ~ group, data = gexp)
plot +
geom_point(data = data.frame(x = c(1, 2), y = c(2.75, 4.33)),
aes(x = x, y = y),
colour = "red", size = 5) +
geom_abline(intercept = coefficients(mod1)[1] - coefficients(mod1)[2],
slope = coefficients(mod1)[2])
Created on 2022-03-30 by the reprex package (v2.0.1)
To understand the difference between factors and integers in specifying linear models, you can have a look at the model matrix.
model.matrix(y ~ f, data = data.frame(f = 1:3, y = 1))
#> (Intercept) f
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> attr(,"assign")
#> [1] 0 1
model.matrix(y ~ f, data = data.frame(f = factor(1:3), y = 1))
#> (Intercept) f2 f3
#> 1 1 0 0
#> 2 1 1 0
#> 3 1 0 1
#> attr(,"assign")
#> [1] 0 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$f
#> [1] "contr.treatment"
Created on 2022-03-30 by the reprex package (v2.0.1)
In the first model matrix, what you specify is what you get: you're modelling something as a function of the intercept and the f variable. In this model, you account for that f = 2 is twice as much as f = 1.
This works a little bit differently when f is a factor. A k-level factor gets split up in k-1 dummy variables, where each dummy variable encodes with 1 or 0 whether it deviates from the reference level (the first factor level). By modelling it in this way, you don't consider that the 2nd factor level might be twice the 1st factor level.
Because in ggplot2, the first factor level is displayed at position = 1 and not at position = 0 (how it is modelled), your calculated intercept is off. You need to subtract 1 * slope from the calculated intercept to get it to display right in ggplot2.

Creating stimuli in R with ggplot

I am trying to generate my own stimuli for an experiment using R. Below is the code that creates my (x,y) coordinates using the rnorm() with different a sample size of 100, different means and sd. I also create another variable to represent the size of the circles, which are determined by the runif().
dt <- data.frame(x = NA,
y = NA,
size = NA,
M = NA,
sd = NA,
col = NA,
iter = NA)
sa<-0
mySD<-c(5, 15)
myMeans<-c(35, 45)
colors<-c("Blues", "Reds")
for(i in 1:10){
for(s in mySD){
for(m in myMeans){
x = abs(rnorm(n=1, mean=m, sd=s))
y = abs(rnorm(n=1, mean=m, sd=s))
size = runif(1, 1, 25) #select a random x speed between [25,35]
sa<-sa+1
dt[sa,] <- NA
dt$x[sa]<-x
dt$y[sa]<-y
dt$M[sa]<-m
dt$sd[sa]<-s
dt$size[sa]<-size
dt$iter[sa]<-i
}
}
}
}
Next, I want to use ggplot(dt, aes(x, y, size=size) to plot. I want to randomly select 4 (x,y) values to plot for one graph, then 8 for another, then 16 for another, etc. Basically, I want to plot different graphs with a different number of data points. For example, some graphs that you would see would have 4 data points that vary by size and color, others would have 32 data points that vary in size and color. I m not sure how to select a set of unique data points from the data frame that I created. Any help would be great. I'm pretty new to R.
Here are two ways - depending if you wanted each group to not contain points from any other group.
I'll just use a dummy data frame that just has columns x, y, and size.
library(tidyverse)
dt <- tibble(x = runif(100), y = runif(100), size = runif(100))
Allowing groups to share the same points
Create a vector for the size of each group.
sample_sizes <- 2^(seq_len(4) + 1)
sample_sizes
#> [1] 4 8 16 32
Randomly sample the data frame and add a group column.
sampled <- map_dfr(
sample_sizes,
~sample_n(dt, .),
.id = "group"
)
Plot using facets.
ggplot(sampled, aes(x, y, size = size)) +
geom_point() +
facet_wrap(~group)
Requiring groups to have different points
First, we need a way to generate four 1s, eight 2s etc. This can be done using log2 and some tricks.
groups <- floor(log2(seq_len(nrow(dt)) + 3)) - 1
groups
#> [1] 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
#> [36] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5
#> [71] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Shuffle this vector and add it as a column.
dt$group <- sample(groups)
Facet using this new column to generate the desired plots.
ggplot(dt, aes(x, y, size = size)) +
geom_point() +
facet_wrap(~group)
First of all, the question's data creation code can be greatly simplified, rewritten with no loops at all. R is a vectorized language and the following will create a data frame with the same structure.
Don't forget to set the RNG seed, in order to make the results reproducible.
library(ggplot2)
set.seed(2020) # make the results reproducible
sd <- rep(rep(mySD, each = 2), 10)
M <- rep(myMeans, 2*10)
x <- abs(rnorm(n = 40, mean = M, sd = sd))
y <- abs(rnorm(n = 40, mean = M, sd = sd))
size <- runif(40, 1, 25)
iter <- seq_along(x)
dt2 <- data.frame(x, y, size, M, sd, iter)
dt2$col <- c("blue", "red")
Now the plots. The following function accepts a data frame X as its first argument and a number of points to draw as the second one. Then plots n points chosen at random with color col and size (a continuous variable) size.
plot_fun <- function(X, n){
Colors <- unique(X[["col"]])
Colors <- setNames(Colors, Colors)
i <- sample(nrow(X), n)
g <- ggplot(X[i,], aes(x, y, size = size, color = col)) +
geom_point() +
scale_color_manual(values = Colors) +
theme_bw()
g
}
plot_fun(dt2, 8)
To plot several values for n, produce the plots with lapply then use grid.arrange from package gridExtra.
plot_list <- lapply(c(4,8,16,32), function(n) plot_fun(dt2, n))
gridExtra::grid.arrange(grobs = plot_list)
Individual plots are still possible with
plot_list[[1]]
plot_list[[2]]
and so on.
Another way is to use faceting. Write another function, plot_fun_facets assigning the number of points to a new variable in the sample data frames, n, and use that variable as a faceting variable.
plot_fun_facets <- function(X, n){
Colors <- unique(X[["col"]])
Colors <- setNames(Colors, Colors)
X_list <- lapply(n, function(.n){
i <- sample(nrow(X), .n)
Y <- X[i,]
Y$n <- .n
Y
})
X <- do.call(rbind, X_list)
g <- ggplot(X, aes(x, y, size = size, color = col)) +
geom_point() +
scale_color_manual(values = Colors) +
facet_wrap(~ n) +
theme_bw()
g
}
plot_fun_facets(dt2, c(4,8,16,32))

understanding the fundamentals of quantile() and quantiles

If there are, say 59 observations, I'm confused why quantile(data, probs = 0.05) would put 3 values under the 5th percentile, as 3/59 = ~0.051.
library(tidyverse)
a <- seq(c(1:59))
b <-rnorm(59)
df <- data.frame(a,b)
df_5thperc <- df %>% summarize(`05%` = quantile(b,
probs=0.05))
y <- mean(df_5thperc$`05%`)
ggplot() + geom_point(data = df, aes(x = a, y = b)) +
geom_hline(yintercept = y, color = "blue")
To expand on #BenBolker, you could consider the type parameter for the quantile() function. You are using a continuous distribution so types 4 through 9 are relevant. For example:
b[b < quantile(b, probs = c(.05), type = 9)]
Types 4 and 6 will give what you were probably expecting
[1] -1.893092 -3.263889
while 5, 7, 8, and 9 will give
[1] -1.893092 -1.538927 -3.263889
The help file gives much detail about why, but in the end it comes down to the fact that there is no agreed upon method to estimate sample quantiles (including the median).

ggplot to show confidence intervals from bootstrapping curve fitting

Thanks to the help I got here, I was able to get a spaghetti plot of curve fits using bootstrapping. I am trying to derive confidence bands from these fitted models. I've had no luck getting something like
quants <- apply(fitted_boot, 1, quantile, c(0.025, 0.5, 0.975))
to work with the following:
library(dplyr)
library(broom)
library(ggplot2)
xdata <- c(-35.98, -34.74, -33.46, -32.04, -30.86, -29.64, -28.50, -27.29, -26.00,
-24.77, -23.57, -22.21, -21.19, -20.16, -18.77, -17.57, -16.47, -15.35,
-14.40, -13.09, -11.90, -10.47, -9.95,-8.90,-7.77,-6.80, -5.99,
-5.17, -4.21, -3.06, -2.29, -1.04)
ydata <- c(-4.425, -4.134, -5.145, -5.411, -6.711, -7.725, -8.087, -9.059, -10.657,
-11.734, NA, -12.803, -12.906, -12.460, -12.128, -11.667, -10.947, -10.294,
-9.185, -8.620, -8.025, -7.493, -6.713, -6.503, -6.316, -5.662, -5.734, -4.984,
-4.723, -4.753, -4.503, -4.200)
data <- data.frame(xdata,ydata)
x_range <- seq(min(xdata), max(xdata), length.out = 1000)
fitted_boot <- data %>%
bootstrap(100) %>%
do({
m <- nls(ydata ~ A*cos(2*pi*((xdata-x_0)/z))+M, ., start=list(A=4,M=-7,x_0=-10,z=30))
f <- predict(m, newdata = list(xdata = x_range))
data.frame(xdata = x_range, .fitted = f)
} )
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
geom_point(size=3) +
theme_bw()
I thought perhaps geom_ribbon() would be a nice way to go, but I just don't know where to go from here.
Thank you to Axeman for helping on the other post!
One approach would be to calculate the confidence interval at each x-value and then just plot that. Here, I am using the first value outside of the 2.5th percentile and the 97.5th percentiles, though you could adjust the code as needed.
First, I change to group_by the xdata locations (instead of replicates). Then, I arrange by the .fitted values so that I can slice out the values I want (the first outside the percentile cutoffs). Finally, I tag them with which bound I am getting (they always go lower then upper because we sorted).
forConfInt <-
fitted_boot %>%
ungroup() %>%
group_by(xdata) %>%
arrange(.fitted) %>%
slice(c(floor(0.025 * n() )
, ceiling(0.975 * n() ) ) ) %>%
mutate(range = c("lower", "upper"))
This gives:
replicate xdata .fitted range
<int> <dbl> <dbl> <chr>
1 9 -35.98000 -4.927462 lower
2 94 -35.98000 -4.249348 upper
3 9 -35.94503 -4.927248 lower
4 94 -35.94503 -4.257776 upper
5 9 -35.91005 -4.927228 lower
6 94 -35.91005 -4.266334 upper
7 9 -35.87508 -4.927401 lower
8 94 -35.87508 -4.275020 upper
9 9 -35.84010 -4.927766 lower
10 94 -35.84010 -4.283836 upper
# ... with 1,990 more rows
And we can then just add an additional line to the ggplot call:
ggplot(data, aes(xdata, ydata)) +
geom_line(aes(y=.fitted, group=replicate), fitted_boot, alpha=.1, color="blue") +
# Added confidence interval:
geom_line(aes(y=.fitted, group=range), forConfInt, color="red") +
geom_point(size=3) +
theme_bw()
Gives this plot:

Resources