How is the extreme of the whisker of boxplot calculated with ggplot? - r

I'm trying to do a boxplot with ggplot function : ggplot() + geom_boxplot(...)
how is the whisker calculated with ggplot?

This is answered on the ?geom_boxplot help page:
The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually.
(From my version 2.2.1. Find the current docs here: http://ggplot2.tidyverse.org/reference/geom_boxplot.html, or the docs for whatever version you are using at ?geom_boxplot.)
If you have questions about the hinge, see the preceding paragraph on the help page.

Related

R Boxplot - specify upper and lower whisker?

The lower whisker is defined by Q1 - c * IQR
and the upper whicker by Q3 + c * IQR
being Q1 - first quartile, Q3 - third quartile, IQR - inter quartile range, and C a variable. Usually C is 1.5, and I'm not sure, but probably is default for boxplot().
Is it possible to specify this C value when creating a boxplot?
Solutions using ggplot2 are welcome too.
I found the answer. Completely missed it reading the documentation, sorry.
It can be given as an argument in boxplot():
range
This determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
boxplot(x,...,range=2)
Edit: as tjebo suggested, this post (In ggplot2, what do the end of the boxplot lines represent?) completes the answer. Altough the title has ggplot2, it mentions the solution to my problem and might be worth checking out.

Method of Outlier Removal for Boxplots

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Documentation for special variables in ggplot (..count.., ..density.., etc.)

The stats_ functions in ggplot2 create special variables, e.g. stat_bin2d creates a special variable called ..count... Where can I find documentation listing which special variables are returned by which stat_ function?
I've looked on the main ggplot2 documentation pages, and in R online help. I tried reading the source code for stat_bin2d, but it uses bits of the language I don't understand -- I don't know how to get at the code behind StatBin2d$new(...).
Most of them are documented in the value section of the help pages, e.g., ?stat_boxplot says
Value:
A data frame with additional columns:
width: width of boxplot
ymin: lower whisker = smallest observation greater than or equal to
lower hinge - 1.5 * IQR
lower: lower hinge, 25% quantile
notchlower: lower edge of notch = median - 1.58 * IQR / sqrt(n)
middle: median, 50% quantile
notchupper: upper edge of notch = median + 1.58 * IQR / sqrt(n)
upper: upper hinge, 75% quantile
ymax: upper whisker = largest observation less than or equal to
upper hinge + 1.5 * IQR
I suggest submitting bug reports for those that remain undocumented. There is a bug report for stat_bin2d, but it was closed as fixed. If you create a new bug report you can refer to that one.

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Changing the outlier rule in a boxplot

I have constructed some box-plots in R and have several outliers. I know that the default criteria to set outlier limits are:
Q3 + 1.5*IQR
Q1 - 1.5* IQR
However, I would like outliers classified as values that fall outside of the boundaries:
Q3 + 3*IQR
Q1 - 3* IQR
Is it possible to set this in R?
From ?boxplot
range: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So set range=3
I'd encourage you not to do this without a lot of thought - people expect that the whiskers extend 1.5 IQRs. Changing the range will violate these assumptions and make it easy for people to draw incorrect conclusions from your graphic.

Resources