Removing outliers from statistical testing of stat_compare_means - r

I have a larger dataset where it has to be presented in boxplot format, however there may be outliers within each group and I would want to perform statistical testing after excluding the outliers first, for sample df and code below:
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
df %>% ggplot(aes(x=name,y=score)) + geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),method="t.test", paired=F)
The stat_compare_means function is used because I have much more groups and facets in the larger dataset making manual elimination of outliers very tedious (unless it can be incorporated into the whole dataset) so I was wondering if it is possible to somehow incorporate it into the function to make them ignore the outliers when computing the statistical tests? Thanks

If you want to remove the outliers in your statistical test, that means you will show test scores (without outliers) on a graph with outliers which is misleading. So you could remove the outliers beforehand to do the t.test. The first graph shows the t.test without outliers to a graph with outliers and the second graph shows a t.test without outliers to a graph without outliers:
library(dplyr)
library(ggpubr)
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
y
}
df2 <- df %>%
group_by(name) %>%
mutate(score = remove_outliers(score)) %>%
ungroup()
indx <- which(is.na(df2$score), arr.ind=TRUE)
df %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(data = df2[-indx,], comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
df2 %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
#> Warning: Removed 2 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 2 rows containing non-finite values (stat_signif).
Created on 2022-08-10 by the reprex package (v2.0.1)

Related

Adding mean comparisons to plot + Is it possible to display p-values in ggplot (or R in general) from a KS test, specifically on a violin plot?

I'm seeking to create something like this:
Example Output
Using my own data, I would be specifically using the p-values I found here:
KS test p-values
I was able to produce something similar, albeit with the incorrect method. Specifically, I was able to produce something similar using a T-test:
T test p-value
I produced this by writing this code:
l<- ggplot(VioPos, aes(x=Regulation, y=Score,fill=Regulation)) +
geom_violin(trim=FALSE)+
labs(title="Plot of ARE Scores by Regulation",x="Gene Regulation", y = "ARE Score")+
geom_boxplot(width=0.1,fill="white")+
theme_classic()
l
dp <- l + scale_y_continuous(trans="log2")
dp
dp7 <- dp +
stat_compare_means(comparisons=my_comparisons, method="t.test")
dp7
In other words, I utilized stat_compare_means() using ggplot2/tidyverse/ggpubr/rstatix.
However, if I modify the method in the code, it seems to display correctly for Wilcoxon and T tests, but not for ANOVA and Kruskal-Wallis tests. Moreover, it seems that stat_compare_means() only supports those four and not KS, but I'm specifically interested in plotting mean comparisons from my KS test output onto my violin plots.
Is there some other package I can use?
Also please note: for the KS test, the "UpScorePos" "DownScorePos" etc. was to compare ARE score by regulation (as I did with the graphs in the T test).
You can get the p-value from a KS-test like this:
x <- rnorm(100)
y <- rnorm(100)
res <- ks.test(x, y)
res$p.value
[1] 0.9670685
Just use this p-value and add it to your plots.
EDIT: A somewhat hacky solution is to use run a t-test and get the right data structure that can be used with stat_pvalalue_manual and insert the pvalues from a ks.test. See the example below (I used the ToothGrowth data as an example).
# Transform `dose` into factor variable
df <- ToothGrowth
df$dose <- as.factor(df$dose)
stat.test <- df %>%
t_test(len ~ dose)
stat.test
# prepare test tibble for ks.test
stat.test <- df %>%
t_test(len ~ dose)
stat.test <- stat.test %>% add_y_position()
stat.test
kst <- stat.test # copy tibble to overwrite p-values for ks.test
p1 <- ks.test(x = ToothGrowth$len[ToothGrowth$dose == 0.5],
y = ToothGrowth$len[ToothGrowth$dose == 1]
)$p
p2 <- ks.test(x = ToothGrowth$len[ToothGrowth$dose == 0.5],
y = ToothGrowth$len[ToothGrowth$dose == 2]
)$p
p3 <- ks.test(x = ToothGrowth$len[ToothGrowth$dose == 1],
y = ToothGrowth$len[ToothGrowth$dose == 2]
)$p
kst[, 'p'] <- as.numeric(c(p1, p2, p3))
ggplot(df, aes(x = dose, y = len)) +
geom_violin(trim = F) +
stat_pvalue_manual(kst, label = "p = {p}")

Find all local maxima of a geom_smooth curve in R ggplot?

I need to find all local maxima of a geom_smooth() curve in R. This has been asked in Stack Overflow before:
How can I get the peak and valleys of a geom_smooth line in ggplot2?
But the answer related to finding a single maximum. What if there are multiple local maxima we want to find?
Here's some sample data:
library(tidyverse)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df %>% ggplot(aes(x=x,y=y)) +
geom_point() +
geom_smooth()
To find a single maximum, we use the function underlying geom_smooth() in order to get the y values of the curve. This would be either gam() for 1000+ data points or loess() for fewer than 1000. In this case, it's gam() from library(mgcv). To find our maximum is a simple matter of subsetting with which.max(). We can plot the modeled y values over geom_smooth() to confirm they're the same, with our maximum represented by a vertical line:
library(mgcv)
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
maximum <- df$x[which.max(df$smooth_y)]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
So far, so good. But, there is more than one maximum here. Maybe we're trying to find the periodicity of the sine wave, measured as the average distance between maxima. How do we make sure we find all maxima in the series?
I am posting my answer below, but I am wondering if there's a more elegant solution than the brute-force method I used.
You can find the points where the difference between subsequent points flips sign using run-length encoding. Note that this method is approximate and relies on x being ordered. You can refine the locations by predicting more closely spaced x-values.
library(tidyverse)
library(mgcv)
set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))
df <- df %>%
mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))
# Run length encode the sign of difference
rle <- rle(diff(as.vector(df$smooth_y)) > 0)
# Calculate startpoints of runs
starts <- cumsum(rle$lengths) - rle$lengths + 1
# Take the points where the rle is FALSE (so difference goes from positive to negative)
maxima_id <- starts[!rle$values]
# Also convenient, but not in the question:
# minima_id <- starts[rle$values]
maximum <- df$x[maxima_id]
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = maximum,color="green")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Created on 2020-12-24 by the reprex package (v0.3.0)
I went with a brute force, Monte Carlo method to solve the problem. Using replicate(), we try out 100 random ranges of x and find the maximum y value within each range. We reject maxima that occur at either end of the range. Then we find all unique values of the output vector:
maxima <- replicate(100,{
x_range <- sample(df$x,size=2,replace=FALSE) %>% sort()
max_loc <- df %>%
filter(x >= x_range[1] & x <= x_range[2]) %>%
filter(smooth_y == max(smooth_y)) %>%
pull(x)
if(max_loc == min(x_range)|max_loc == max(x_range)){NA}else{max_loc}
})
unique_maxima <- unique(maxima[!is.na(maxima)])
df %>% ggplot() +
geom_point(aes(x=x,y=y)) +
geom_smooth(aes(x=x,y=y)) +
geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red") +
geom_vline(xintercept = unique_maxima,color="green")

Is it possible to recreate the functionality of bayesplot's "mcmc_areas" plot in ggplot in R

There is a package supported by Stan called bayesplot that can produce nice density area plots with the area under the density curves partitioned based on credibility intervals on the posterior parameter samples drawn through MCMC, this results in a plot that looks like the following:
I am looking to make a similar style of plot given 1D lists of sampled values using ggplot, that I can pass any generic list of values to without it having to be a Stan fit etc. Does anyone know how to do this? The density part is clear via geom_density, but I am struggling with the fill partitioning.
Here's a function that generates a plot similar to bayesplot::mcmc_areas. It plots credible intervals (equal-tailed by default, or highest density) with optional setting of the probability width of the interval:
library(tidyverse)
library(ggridges)
library(bayestestR)
theme_set(theme_classic(base_size=15))
# Create ridgeplots with credible intervals
# ARGUMENTS
# data A data frame
# FUN A function that calculates credible intervals
# ci The width of the credible interval
# ... For passing optional arguments to geom_ridgeline.
# For example, change the scale parameter to control overlap of ridge lines.
# geom_ridgeline's default is scale=1.
plot_density_ridge = function(data, FUN=c("eti", "hdi"), ci=0.89, ...) {
# Determine whether to use eti or hdi function
FUN = match.arg(FUN)
FUN = match.fun(FUN)
# Get kernel density estimate as a data frame
dens = map_df(data, ~ {
d = density(.x, na.rm=TRUE)
tibble(x=d$x, y=d$y)
}, .id="name")
# Set relative width of median line
e = diff(range(dens$x)) * 0.006
# Get credible interval width and median
cred.int = data %>%
pivot_longer(cols=everything()) %>%
group_by(name) %>%
summarise(CI=list(FUN(value, ci=ci)),
m=median(value, na.rm=TRUE)) %>%
unnest_wider(CI)
dens %>%
left_join(cred.int) %>%
ggplot(aes(y=name, x=x, height=y)) +
geom_vline(xintercept=0, colour="grey70") +
geom_ridgeline(data= . %>% group_by(name) %>%
filter(between(x, CI_low, CI_high)),
fill=hcl(230,25,85), ...) +
geom_ridgeline(data=. %>% group_by(name) %>%
filter(between(x, m - e, m + e)),
fill=hcl(240,30,60), ...) +
geom_ridgeline(fill=NA, ...) +
geom_ridgeline(fill=NA, aes(height=0), ...) +
labs(y=NULL, x=NULL)
}
Now let's try out the function
# Fake data
set.seed(2)
d = data.frame(a = rnorm(1000, 0.6, 1),
b = rnorm(1000, 1.3, 0.5),
c = rnorm(1000, -1.2, 0.7))
plot_density_ridge(d)
plot_density_ridge(d, ci=0.5, scale=1.5)
plot_density_ridge(iris %>% select(-Species))
plot_density_ridge(iris %>% select(-Species), FUN="hdi")
Use the ggridges package:
library(tidyverse)
library(ggridges)
tibble(data_1, data_2, data_3) %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, y = name, group = name)) +
geom_density_ridges()
Data:
set.seed(123)
n <- 15
data_1 <- rnorm(n)
data_2 <- data_1 - 1
data_3 <- data_1 + 2

R Percentiles of data frame with non-zero subset of observations

I would like to calculate the percentiles of the following tibble...
I have a non-zero subset of 10 observations in each of 3 variables i.e...
n <- 10
tibb <- tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y)
(The excluded observations are all zero)
Therefore the mean is the sum of the fields / 10 (as opposed to / 5):
meantibb <- tibb %>% group_by() %>%
summarise_if(is.numeric, sum, na.rm = TRUE) / n
meantibb
How do I get the following percentiles of x, y and z in the tibble please?
perciles <- c(0.5, 0.75)
percentiles <- function(p) quantile(p, perciles)
Thank you
You could create a data set including the zeroes
missingRowCount <- n - nrow(tibb)
colCount <- ncol(tibb)
zeroTibb <- matrix(rep(0, missingRowCount * colCount), ncol = colCount, nrow = missingRowCount) %>% as.tibble()
colnames(zeroTibb) <- colnames(tibb)
allTibb <- dplyr::bind_rows(tibb, zeroTibb)
Once you have the full data you can run the following to get a tibble of percentiles
percTibble = sapply(allTibb, percentiles) %>%
as.tibble()
The assumption here is that the data is not going to be too large when the zeroes are included.
You're close, your method of creating the mean (and subsequently the percentiles) could be simpler if you use gather first and then group the data by the three different factors.
library(dplyr)
n <- 10
tibb <- tibble(x = 1:5, y = 1, z = x ^ 2 + y)
tibb %>%
gather("fctr", "value") %>%
group_by(fctr) %>%
summarise(mean = sum(value) / n,
perc_50 = quantile(value, 0.5),
perc_75 = quantile(value, 0.75))
However, I'm not sure if you want the percentile of the non-zero subset or the entire dataset, because this will change your outcomes, i.e.
> x = 1:5
> quantile(x, 0.1)
10%
1.4
> test <- c(0,0,0,0,0,1,2,3,4,5)
> quantile(test, 0.1)
10%
0

Function to remove outliers by group from dataframe

I am trying to remove the outliers from my dataframe containing x and y variables grouped by variable cond.
I have created a function to remove the outliers based on a boxplot statistics, and returning df without outliers. The function works well when applied for a raw data. However, if applied on grouped data, the function does not work and I got back an error:
Error in mutate_impl(.data, dots) :
Evaluation error: argument "df" is missing, with no default.
Please, how can I correct my function to take vectors df$x and df$y as arguments, and correctly get rid of outliers by group?
My dummy data:
set.seed(955)
# Make some noisily increasing data
dat <- data.frame(cond = rep(c("A", "B"), each = 22),
xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115),
yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200))
removeOutliers<-function(df, ...) {
# first, identify the outliers and store them in a vector
outliers.x<-boxplot.stats(df$x)$out
outliers.y<-boxplot.stats(df$y)$out
# remove the outliers from the original data
df<-df[-which(df$x %in% outliers.x),]
df[-which(df$y %in% outliers.y),]
}
# REmove outliers (try if function works)
removeOutliers(dat)
# Apply the function to group
# Not working!!!
dat_noOutliers<- dat %>%
group_by(cond) %>%
mutate(removeOutliers)
I have found this function to remove the outliers from a vector data . However, I would like to remove outliers from both df$x and df$y vectors in a dataframe.
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
(remove outliers by group in R)
Since you are applying this function to entire df, you should instead use mutate_all. Do:
dat_noOutliers<- dat %>%
group_by(cond) %>%
mutate_all(remove_outliers)
You may just filter your data:
library(tidyverse)
set.seed(955)
dat <- data.frame(cond = rep(c("A", "B"), each = 22),
xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115),
yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200))
dat %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point() +
geom_smooth(method = lm) +
ggthemes::theme_hc()
dat %>%
group_by(cond) %>%
filter(!xvar %in% boxplot.stats(xvar)$out) %>%
filter(!yvar %in% boxplot.stats(yvar)$out) %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point() +
geom_smooth(method = lm) +
ggthemes::theme_hc()
Created on 2018-12-11 by the reprex package (v0.2.1)

Resources