Function to remove outliers by group from dataframe

Function to remove outliers by group from dataframe - r

I am trying to remove the outliers from my dataframe containing x and y variables grouped by variable cond.
I have created a function to remove the outliers based on a boxplot statistics, and returning df without outliers. The function works well when applied for a raw data. However, if applied on grouped data, the function does not work and I got back an error:
Error in mutate_impl(.data, dots) :
Evaluation error: argument "df" is missing, with no default.
Please, how can I correct my function to take vectors df$x and df$y as arguments, and correctly get rid of outliers by group?
My dummy data:
set.seed(955)
# Make some noisily increasing data
dat <- data.frame(cond = rep(c("A", "B"), each = 22),
xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115),
yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200))
removeOutliers<-function(df, ...) {
# first, identify the outliers and store them in a vector
outliers.x<-boxplot.stats(df$x)$out
outliers.y<-boxplot.stats(df$y)$out
# remove the outliers from the original data
df<-df[-which(df$x %in% outliers.x),]
df[-which(df$y %in% outliers.y),]
}
# REmove outliers (try if function works)
removeOutliers(dat)
# Apply the function to group
# Not working!!!
dat_noOutliers<- dat %>%
group_by(cond) %>%
mutate(removeOutliers)
I have found this function to remove the outliers from a vector data . However, I would like to remove outliers from both df$x and df$y vectors in a dataframe.
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
(remove outliers by group in R)

Since you are applying this function to entire df, you should instead use mutate_all. Do:
dat_noOutliers<- dat %>%
group_by(cond) %>%
mutate_all(remove_outliers)

You may just filter your data:
library(tidyverse)
set.seed(955)
dat <- data.frame(cond = rep(c("A", "B"), each = 22),
xvar = c(1:10+rnorm(20,sd=3), 40, 10, 11:20+rnorm(20,sd=3), 85, 115),
yvar = c(1:10+rnorm(20,sd=3), 200, 60, 11:20+rnorm(20,sd=3), 35, 200))
dat %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point() +
geom_smooth(method = lm) +
ggthemes::theme_hc()
dat %>%
group_by(cond) %>%
filter(!xvar %in% boxplot.stats(xvar)$out) %>%
filter(!yvar %in% boxplot.stats(yvar)$out) %>%
ggplot(aes(x = xvar, y = yvar)) +
geom_point() +
geom_smooth(method = lm) +
ggthemes::theme_hc()
Created on 2018-12-11 by the reprex package (v0.2.1)

Related

Removing outliers from statistical testing of stat_compare_means

I have a larger dataset where it has to be presented in boxplot format, however there may be outliers within each group and I would want to perform statistical testing after excluding the outliers first, for sample df and code below:
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
df %>% ggplot(aes(x=name,y=score)) + geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),method="t.test", paired=F)
The stat_compare_means function is used because I have much more groups and facets in the larger dataset making manual elimination of outliers very tedious (unless it can be incorporated into the whole dataset) so I was wondering if it is possible to somehow incorporate it into the function to make them ignore the outliers when computing the statistical tests? Thanks

If you want to remove the outliers in your statistical test, that means you will show test scores (without outliers) on a graph with outliers which is misleading. So you could remove the outliers beforehand to do the t.test. The first graph shows the t.test without outliers to a graph with outliers and the second graph shows a t.test without outliers to a graph without outliers:
library(dplyr)
library(ggpubr)
df = data.frame(name = c(rep("Bob",5),rep("Tom",5)),
score = c(2,3,4,5,100,5,8,9,10,95))
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
val <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - val)] <- NA
y[x > (qnt[2] + val)] <- NA
y
}
df2 <- df %>%
group_by(name) %>%
mutate(score = remove_outliers(score)) %>%
ungroup()
indx <- which(is.na(df2$score), arr.ind=TRUE)
df %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(data = df2[-indx,], comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
df2 %>% ggplot(aes(x=name,y=score)) +
geom_boxplot() +
stat_compare_means(comparisons = list(c("Bob","Tom")),
method="t.test",
paired=F)
#> Warning: Removed 2 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 2 rows containing non-finite values (stat_signif).
Created on 2022-08-10 by the reprex package (v2.0.1)

ggplot - use data passed to ggplot to calculate the mean of the data in subsequent geom calls [duplicate]

I was wondering why variable mean_y is not recognized by my
geom_hline(yintercept = unique(mean_y)) call?
library(tidyverse)
set.seed(20)
n_groups <- 2
n_in_group <- 20
sd_e = 2
groups <- gl(n_groups, n_in_group, labels = c("T","C"))
age <-rnorm(length(groups), 25, 3)
betas <- c(5,0,0,2)
dat <- data.frame(groups=groups,age=age)
X <- model.matrix(~ groups * age, data = dat)
lin_pred <- as.vector(X %*% betas)
dat$y <- rnorm(nrow(X), lin_pred, sd_e)
dat %>% group_by(groups) %>% mutate(mean_y = mean(y)) %>%
ungroup() %>%
ggplot()+aes(x = age, y = y) +
geom_point(aes(color=groups)) +
geom_hline(yintercept = unique(mean_y)) # Error in unique(mean_y) :
# object 'mean_y' not found

Variables need to be inside aes(), try:
geom_hline(aes(yintercept = mean_y))

data column not recognized in the ggplot geom_hline

I was wondering why variable mean_y is not recognized by my
geom_hline(yintercept = unique(mean_y)) call?
library(tidyverse)
set.seed(20)
n_groups <- 2
n_in_group <- 20
sd_e = 2
groups <- gl(n_groups, n_in_group, labels = c("T","C"))
age <-rnorm(length(groups), 25, 3)
betas <- c(5,0,0,2)
dat <- data.frame(groups=groups,age=age)
X <- model.matrix(~ groups * age, data = dat)
lin_pred <- as.vector(X %*% betas)
dat$y <- rnorm(nrow(X), lin_pred, sd_e)
dat %>% group_by(groups) %>% mutate(mean_y = mean(y)) %>%
ungroup() %>%
ggplot()+aes(x = age, y = y) +
geom_point(aes(color=groups)) +
geom_hline(yintercept = unique(mean_y)) # Error in unique(mean_y) :
# object 'mean_y' not found

Variables need to be inside aes(), try:
geom_hline(aes(yintercept = mean_y))

Passing data frame fields to a regression model in a function

I am just a few months into using R and this is my first post. I am looking to create a function that takes fields from a data frame, filters the outliers via quantiles, then writes the regression parameters as annotations on a scatter plot. The filtering and plotting work correctly but I get an error in the linear model. Can I convert those fields to execute in the model?
Error message:
Error in model.frame.default(formula = df[, field1] ~ df[, field2], drop.unused.levels = TRUE) :
invalid type (list) for variable 'df[, field1]'
Here is the function:
scatter_filtered <- function(df,field1,field2,field3) {
range1 <- quantile(df[, field1], probs= c(0.1,0.9), na.rm=TRUE)
range2 <- quantile(df[, field2], probs= c(0.1,0.9), na.rm=TRUE)
low_end1 <- range1[1]
high_end1 <- range1[2]
low_end2 <- range2[1]
high_end2 <- range2[2]
df %>%
filter(df[, field1] > low_end1, df[, field1] < high_end1,
df[, field2] > low_end2, df[, field2] < high_end2) %>%
model <- lm(df[,field1] ~ df[, field2])
r_output <- round(glance(model)$r.squared, digits = 5)
r_adj_output <- round(glance(model)$adj.r.squared, digits = 5)
p_output <- round(glance(model)$p.value, digits = 5) %>%
ggplot(aes_string(x = field1, y = field2, color = field3)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
# annotate("text", label = paste("r_sq:",r_output), x=0.1, y=0.1, parse=TRUE) +
# annotate("text", label = paste("p-val:",p_output), x=0.1, y=0.1, parse=TRUE)
}

Fill negative value area below geom_line [duplicate]

I frequently use kernel density plots to illustrate distributions. These are easy and fast to create in R like so:
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
plot(dens)
#or in one line like this: plot(density(rnorm(100)^2))
Which gives me this nice little PDF:
I'd like to shade the area under the PDF from the 75th to 95th percentiles. It's easy to calculate the points using the quantile function:
q75 <- quantile(draws, .75)
q95 <- quantile(draws, .95)
But how do I shade the the area between q75 and q95?

With the polygon() function, see its help page and I believe we had similar questions here too.
You need to find the index of the quantile values to get the actual (x,y) pairs.
Edit: Here you go:
x1 <- min(which(dens$x >= q75))
x2 <- max(which(dens$x < q95))
with(dens, polygon(x=c(x[c(x1,x1:x2,x2)]), y= c(0, y[x1:x2], 0), col="gray"))
Output (added by JDL)

Another solution:
dd <- with(dens,data.frame(x,y))
library(ggplot2)
qplot(x,y,data=dd,geom="line")+
geom_ribbon(data=subset(dd,x>q75 & x<q95),aes(ymax=y),ymin=0,
fill="red",colour=NA,alpha=0.5)
Result:

An expanded solution:
If you wanted to shade both tails (copy & paste of Dirk's code) and use known x values:
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
plot(dens)
q2 <- 2
q65 <- 6.5
qn08 <- -0.8
qn02 <- -0.2
x1 <- min(which(dens$x >= q2))
x2 <- max(which(dens$x < q65))
x3 <- min(which(dens$x >= qn08))
x4 <- max(which(dens$x < qn02))
with(dens, polygon(x=c(x[c(x1,x1:x2,x2)]), y= c(0, y[x1:x2], 0), col="gray"))
with(dens, polygon(x=c(x[c(x3,x3:x4,x4)]), y= c(0, y[x3:x4], 0), col="gray"))
Result:

This question needs a lattice answer. Here's a very basic one, simply adapting the method employed by Dirk and others:
#Set up the data
set.seed(1)
draws <- rnorm(100)^2
dens <- density(draws)
#Put in a simple data frame
d <- data.frame(x = dens$x, y = dens$y)
#Define a custom panel function;
# Options like color don't need to be hard coded
shadePanel <- function(x,y,shadeLims){
panel.lines(x,y)
m1 <- min(which(x >= shadeLims[1]))
m2 <- max(which(x <= shadeLims[2]))
tmp <- data.frame(x1 = x[c(m1,m1:m2,m2)], y1 = c(0,y[m1:m2],0))
panel.polygon(tmp$x1,tmp$y1,col = "blue")
}
#Plot
xyplot(y~x,data = d, panel = shadePanel, shadeLims = c(1,3))

Here's another ggplot2 variant based on a function that approximates the kernel density at the original data values:
approxdens <- function(x) {
dens <- density(x)
f <- with(dens, approxfun(x, y))
f(x)
}
Using the original data (rather than producing a new data frame with the density estimate's x and y values) has the benefit of also working in faceted plots where the quantile values depend on the variable by which the data is being grouped:
Code used
library(tidyverse)
library(RColorBrewer)
# dummy data
set.seed(1)
n <- 1e2
dt <- tibble(value = rnorm(n)^2)
# function that approximates the density at the provided values
approxdens <- function(x) {
dens <- density(x)
f <- with(dens, approxfun(x, y))
f(x)
}
probs <- c(0.75, 0.95)
dt <- dt %>%
mutate(dy = approxdens(value), # calculate density
p = percent_rank(value), # percentile rank
pcat = as.factor(cut(p, breaks = probs, # percentile category based on probs
include.lowest = TRUE)))
ggplot(dt, aes(value, dy)) +
geom_ribbon(aes(ymin = 0, ymax = dy, fill = pcat)) +
geom_line() +
scale_fill_brewer(guide = "none") +
theme_bw()
# dummy data with 2 groups
dt2 <- tibble(category = c(rep("A", n), rep("B", n)),
value = c(rnorm(n)^2, rnorm(n, mean = 2)))
dt2 <- dt2 %>%
group_by(category) %>%
mutate(dy = approxdens(value),
p = percent_rank(value),
pcat = as.factor(cut(p, breaks = probs,
include.lowest = TRUE)))
# faceted plot
ggplot(dt2, aes(value, dy)) +
geom_ribbon(aes(ymin = 0, ymax = dy, fill = pcat)) +
geom_line() +
facet_wrap(~ category, nrow = 2, scales = "fixed") +
scale_fill_brewer(guide = "none") +
theme_bw()
Created on 2018-07-13 by the reprex package (v0.2.0).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Function to remove outliers by group from dataframe - r

Since you are applying this function to entire df, you should instead use mutate_all. Do: dat_noOutliers<- dat %>% group_by(cond) %>% mutate_all(remove_outliers)

Related

Removing outliers from statistical testing of stat_compare_means

ggplot - use data passed to ggplot to calculate the mean of the data in subsequent geom calls [duplicate]

data column not recognized in the ggplot geom_hline

Passing data frame fields to a regression model in a function

Fill negative value area below geom_line [duplicate]

Categories

Resources