label or colour outliers geomboxplot with a column [duplicate] - r

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.
Here is my code to create my boxplot
require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome",
y="Ratio of Portuguese to Dutch/British ships") +
geom_boxplot(outlier.size=2,outlier.colour="green") +
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") +
ggtitle("Portugese Sea Battles")
Can anyone help? I knew this is correct, I just want to label the outliers.

The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.
Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier. We then perform the "analysis/checking" and plot the data -- first we group_by our variable (cyl in this example, in your example, this would be PortugesOutcome) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted). Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.
library(dplyr)
library(ggplot2)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
mtcars %>%
group_by(cyl) %>%
mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
ggplot(., aes(x = factor(cyl), y = drat)) +
geom_boxplot() +
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)

You can do this simply within ggplot itself, using an appropriate stat_summary call.
ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) +
geom_boxplot() +
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
hjust = -1
)

To label the outliers with rownames (based on JasonAizkalns answer)
library(dplyr)
library(ggplot2)
library(tibble)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)

Does this work for you?
library(ggplot2)
library(data.table)
#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))
ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.
#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
quantiles <- quantile(v,probs=c(0.25,0.75))
IQR <- quantiles[2]-quantiles[1]
res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
return(res)
}
#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]
#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)

Similar answer to above, but gets outliers directly from ggplot2, thus avoiding any potential conflict in method:
# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()
# get list of outliers
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]
# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))
# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")
# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value),
hjust = -.3)

With a small twist on #JasonAizkalns solution you can label outliers with their location in your data frame.
mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...
I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.

Related

Annotate several regression lines produced with geom_smooth

I have a figure with 16 regression lines and I need to be able to identify them. Using a color gradient or symbols or different line types do not really help.
My idea therefore is, to just (haha) annotate every line.
Therefore, I build a dataset (hpAnnotatedLines) with the different maximum x values. This is the position the text should start. However, I have no idea how to automatically extract the respective y values of the predicted regression lines at the maximum x-axis values, which is different for each line.
Please find a smaller data set using mtcars as an example
library(ggplot2)
library(dplyr)
library(ggrepel)
#just select the data I need
mtcars1 <- select(mtcars, disp,cyl,hp)
mtcars1$cyl <- as.factor(mtcars1$cyl)
#extract max values
mtcars2 <- mtcars1 %>%
group_by(cyl) %>%
summarise(Max.disp= max(disp))
#build dataset for the annotation layer
#note that hp was done by hand. Here I need help
hpAnnotatedLines <- data.frame(cyl=levels(mtcars2$cyl),
disp=mtcars2$Max.disp,
hp=c(90,100,210))
#example plot
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50)) +
geom_text_repel(
data = hpAnnotatedLines,
aes(label = cyl),
size = 3,
nudge_x = 1)
Instead of extracting the fitted values you could add the labels via geom_text by switching the stat to smooth and setting the label aesthetic via after_stat such that only the last point of each regression line gets labelled:
library(ggplot2)
library(dplyr)
myfun <- function(x, color) {
data.frame(x = x, color = color) %>%
group_by(color) %>%
mutate(label = ifelse(x %in% max(x), as.character(color), "")) %>%
pull(label)
}
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm) +
geom_text(aes(label = after_stat(myfun(x, color))),
stat = "smooth", method = "lm", hjust = 0, size = 3, nudge_x = 1, show.legend = FALSE) +
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))
It's a bit of a hack, but you can extract the data from the compiled plot object. For example first make the plot without the labels,
myplot <- ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))
Then use ggplot_build to get the data from the second layer (The geom_smooth layer) and transform it back into the names used by your data. Here we find the largest x value per group, and then take that y value.
pobj <- ggplot_build(myplot)
hpAnnotatedLines <- pobj$data[[2]] %>% group_by(group) %>%
top_n(1, x) %>%
transmute(disp=x, hp=y, cyl=levels(mtcars$cyl)[group])
Then add an additional layer to your plot
myplot +
geom_text_repel(
data = hpAnnotatedLines,
aes(label = cyl),
size = 3,
nudge_x = 1)
If your data is not that huge, you can extract the predictions out using augment() from broom and take that with the largest value:
library(broom)
library(dplyr)
library(ggplot2)
hpAnn = mtcars %>% group_by(cyl) %>%
do(augment(lm(hp ~ disp,data=.))) %>%
top_n(1,disp) %>%
select(cyl,disp,.fitted) %>%
rename(hp = .fitted)
# A tibble: 3 x 3
# Groups: cyl [3]
cyl disp hp
<dbl> <dbl> <dbl>
1 4 147. 96.7
2 6 258 99.9
3 8 472 220.
Then plot:
ggplot(mtcars, aes(x=disp, y=hp, color = factor(cyl))) +
geom_point() +
geom_smooth(method=lm)+
coord_cartesian(xlim = c(min(mtcars$disp), max(mtcars$disp) + 50))+
geom_text_repel(
data = hpAnn,
aes(label = cyl),
size = 3,
nudge_x = 1)

How to place geom_text labels in the correct position with a grouped box plot in ggplot

In this scenario I have added a grouping variable in the iris dataframe. I wish to make a boxplot of Sepal.Length by Species and filled by the grouping variable with the outliers identified with a label. This all works but when I try to label the outlier with geom_text, they do now print with the grouped position but instead in the center. It seems geom_text is not inheriting the global aes() but I don't know why.
code:
library(tidyverse)
# function to id outlier
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
# make a grouping variable
iris$group <- sample(1:3, nrow(iris),replace = T)
# make a outlier variable
iris <-
iris %>%
group_by(Species, group) %>%
mutate(outlier = ifelse(is_outlier(Sepal.Length), Sepal.Length, as.numeric(NA)))
iris$outlier
# graph
iris %>%
ggplot(aes(x = Species,y = Sepal.Length, fill = factor(group))) +
geom_boxplot() +
geom_text(aes(label = outlier))
labels are in the center rather than over their respective box. What's going on here?
This is due to dodging in the boxplot once you have the group. Use position_dodge to explicitly control it. You may want to experiment with the hjust and vjust arguments in geom_text to avoid plotting over the point.
iris %>%
ggplot(aes(x = Species,y = Sepal.Length, fill = factor(group))) +
geom_boxplot(position = position_dodge(width = 1)) +
geom_text(aes(label = outlier), position = position_dodge(width = 1))

R: Unexplainable behavior of ggplot inside a function

I have composed a function that develops histograms using ggplot2 on the numerical columns of a dataframe that will be passed to it. The function stores these plots into a list and then returns the list.
However when I run the function I get the same plot again and again.
My code is the following and I provide also a reproducible example.
hist_of_columns = function(data, class, variables_to_exclude = c()){
library(ggplot2)
library(ggthemes)
data = as.data.frame(data)
variables_numeric = names(data)[unlist(lapply(data, function(x){is.numeric(x) | is.integer(x)}))]
variables_not_to_plot = c(class, variables_to_exclude)
variables_to_plot = setdiff(variables_numeric, variables_not_to_plot)
indices = match(variables_to_plot, names(data))
index_of_class = match(class, names(data))
plots = list()
for (i in (1 : length(variables_to_plot))){
p = ggplot(data, aes(x= data[, indices[i]], color= data[, index_of_class], fill=data[, index_of_class])) +
geom_histogram(aes(y=..density..), alpha=0.3,
position="identity", bins = 100)+ theme_economist() +
geom_density(alpha=.2) + xlab(names(data)[indices[i]]) + labs(fill = class) + guides(color = FALSE)
name = names(data)[indices[i]]
plots[[name]] = p
}
plots
}
data(mtcars)
mtcars$am = factor(mtcars$am)
data = mtcars
variables_to_exclude = 'mpg'
class = 'am'
plots = hist_of_columns(data, class, variables_to_exclude)
If you check the list plots you will discover that it contains the same plot repeated.
Simply use aes_string to pass string variables into the ggplot() call. Right now, your plot uses different data sources, not aligned with ggplot's data argument. Below x, color, and fill are separate, unrelated vectors though they derive from same source but ggplot does not know that:
ggplot(data, aes(x= data[, indices[i]], color= data[, index_of_class], fill=data[, index_of_class]))
However, with aes_string, passing string names to x, color, and fill will point to data:
ggplot(data, aes_string(x= names(data)[indices[i]], color= class, fill= class))
Here is strategy using tidyeval that does what you are after:
library(rlang)
library(tidyverse)
hist_of_cols <- function(data, class, drop_vars) {
# tidyeval overhead
class_enq <- enquo(class)
drop_enqs <- enquo(drop_vars)
data %>%
group_by(!!class_enq) %>% # keep the 'class' column always
select(-!!drop_enqs) %>% # drop any 'drop_vars'
select_if(is.numeric) %>% # keep only numeric columns
gather("key", "value", -!!class_enq) %>% # go to long form
split(.$key) %>% # make a list of data frames
map(~ ggplot(., aes(value, fill = !!class_enq)) + # plot as usual
geom_histogram() +
geom_density(alpha = .5) +
labs(x = unique(.$key)))
}
hist_of_cols(mtcars, am, mpg)
hist_of_cols(mtcars, am, c(mpg, wt))

Label only the outliers in plotly boxplot [duplicate]

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.
Here is my code to create my boxplot
require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome",
y="Ratio of Portuguese to Dutch/British ships") +
geom_boxplot(outlier.size=2,outlier.colour="green") +
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") +
ggtitle("Portugese Sea Battles")
Can anyone help? I knew this is correct, I just want to label the outliers.
The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.
Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier. We then perform the "analysis/checking" and plot the data -- first we group_by our variable (cyl in this example, in your example, this would be PortugesOutcome) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted). Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.
library(dplyr)
library(ggplot2)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
mtcars %>%
group_by(cyl) %>%
mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
ggplot(., aes(x = factor(cyl), y = drat)) +
geom_boxplot() +
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)
You can do this simply within ggplot itself, using an appropriate stat_summary call.
ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) +
geom_boxplot() +
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
hjust = -1
)
To label the outliers with rownames (based on JasonAizkalns answer)
library(dplyr)
library(ggplot2)
library(tibble)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
Does this work for you?
library(ggplot2)
library(data.table)
#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))
ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.
#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
quantiles <- quantile(v,probs=c(0.25,0.75))
IQR <- quantiles[2]-quantiles[1]
res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
return(res)
}
#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]
#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)
Similar answer to above, but gets outliers directly from ggplot2, thus avoiding any potential conflict in method:
# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()
# get list of outliers
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]
# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))
# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")
# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value),
hjust = -.3)
With a small twist on #JasonAizkalns solution you can label outliers with their location in your data frame.
mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...
I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.

Labeling Outliers of Boxplots in R

I have the code that creates a boxplot, using ggplot in R, I want to label my outliers with the year and Battle.
Here is my code to create my boxplot
require(ggplot2)
ggplot(seabattle, aes(x=PortugesOutcome,y=RatioPort2Dutch ),xlim="OutCome",
y="Ratio of Portuguese to Dutch/British ships") +
geom_boxplot(outlier.size=2,outlier.colour="green") +
stat_summary(fun.y="mean", geom = "point", shape=23, size =3, fill="pink") +
ggtitle("Portugese Sea Battles")
Can anyone help? I knew this is correct, I just want to label the outliers.
The following is a reproducible solution that uses dplyr and the built-in mtcars dataset.
Walking through the code: First, create a function, is_outlier that will return a boolean TRUE/FALSE if the value passed to it is an outlier. We then perform the "analysis/checking" and plot the data -- first we group_by our variable (cyl in this example, in your example, this would be PortugesOutcome) and we add a variable outlier in the call to mutate (if the drat variable is an outlier [note this corresponds to RatioPort2Dutch in your example], we will pass the drat value, otherwise we will return NA so that value is not plotted). Finally, we plot the results and plot the text values via geom_text and an aesthetic label equal to our new variable; in addition, we offset the text (slide it a bit to the right) with hjust so that we can see the values next to, rather than on top of, the outlier points.
library(dplyr)
library(ggplot2)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
mtcars %>%
group_by(cyl) %>%
mutate(outlier = ifelse(is_outlier(drat), drat, as.numeric(NA))) %>%
ggplot(., aes(x = factor(cyl), y = drat)) +
geom_boxplot() +
geom_text(aes(label = outlier), na.rm = TRUE, hjust = -0.3)
You can do this simply within ggplot itself, using an appropriate stat_summary call.
ggplot(mtcars, aes(x = factor(cyl), y = drat, fill = factor(cyl))) +
geom_boxplot() +
stat_summary(
aes(label = round(stat(y), 1)),
geom = "text",
fun.y = function(y) { o <- boxplot.stats(y)$out; if(length(o) == 0) NA else o },
hjust = -1
)
To label the outliers with rownames (based on JasonAizkalns answer)
library(dplyr)
library(ggplot2)
library(tibble)
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
dat <- mtcars %>% tibble::rownames_to_column(var="outlier") %>% group_by(cyl) %>% mutate(is_outlier=ifelse(is_outlier(drat), drat, as.numeric(NA)))
dat$outlier[which(is.na(dat$is_outlier))] <- as.numeric(NA)
ggplot(dat, aes(y=drat, x=factor(cyl))) + geom_boxplot() + geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05)
Does this work for you?
library(ggplot2)
library(data.table)
#generate some data
set.seed(123)
n=500
dat <- data.table(group=c("A","B"),value=rnorm(n))
ggplot defines an outlier by default as something that's > 1.5*IQR from the borders of the box.
#function that takes in vector of data and a coefficient,
#returns boolean vector if a certain point is an outlier or not
check_outlier <- function(v, coef=1.5){
quantiles <- quantile(v,probs=c(0.25,0.75))
IQR <- quantiles[2]-quantiles[1]
res <- v < (quantiles[1]-coef*IQR)|v > (quantiles[2]+coef*IQR)
return(res)
}
#apply this to our data
dat[,outlier:=check_outlier(value),by=group]
dat[,label:=ifelse(outlier,"label","")]
#plot
ggplot(dat,aes(x=group,y=value))+geom_boxplot()+geom_text(aes(label=label),hjust=-0.3)
Similar answer to above, but gets outliers directly from ggplot2, thus avoiding any potential conflict in method:
# calculate boxplot object
g <- ggplot(mtcars, aes(factor(cyl), drat)) + geom_boxplot()
# get list of outliers
out <- ggplot_build(g)[["data"]][[1]][["outliers"]]
# label list elements with factor levels
names(out) <- levels(factor(mtcars$cyl))
# convert to tidy data
tidyout <- purrr::map_df(out, tibble::as_tibble, .id = "cyl")
# plot boxplots with labels
g + geom_text(data = tidyout, aes(cyl, value, label = value),
hjust = -.3)
With a small twist on #JasonAizkalns solution you can label outliers with their location in your data frame.
mtcars[,'row'] <- row(mtcars)[,1]
...
mutate(outlier = ifelse(is_outlier(drat), row, as.numeric(NA)))
...
I load the data frame into the R Studio Environment, so I can then take a closer look at the data in outlier rows.

Resources