Inconsistent Whiskers upper distance with 1.5 IQR - r

Same data, using two different boxplot methods provide two different length for the Whiskers, I understand that the whisker only goes as far as the maximum (minimum) point less (greater) than the upper (lower) fence value. In my case they are picking a different maximum point or the fence is miscalculated. From what I read in the documentation both methods are using 1.5 IRQ and the fence limit should be 57.8 so 39 should be chosen not 58.8
http://docs.ggplot2.org/0.9.3.1/geom_boxplot.html
https://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/boxplot.stats.html
geom_boxplot
vs boxplot
df=data.frame(num=c(81.16469 ,11.59219 ,29.7309 ,86.03547 ,16.42667 ,33.52099 ,26.07814 ,30.91522 ,39.49079 ,31.634 ,37.8732 ,20.50268 ,16.9127 ,20.1115 ,23.74309 ,22.30444 ,24.21399 ,27.30867 ,39.07071 ,14.81049 ,21.42116 ,23.30437 ,17.94871 ,17.50281 ,58.82008 ,20.18478 ,10.65572 ,37.97092 ,25.16336 ,35.69668))
quantile(df$num)
0% 25% 50% 75% 100%
10.65572 20.12982 24.68867 35.15276 86.03547
boxplot(df$num)
IQR(df$num)*1.5+quantile(df$num)[4]
75%
57.68716
ggplot(df,aes("x",num))+geom_boxplot()
boxplot(df$num)
And more importantly I want to extract from df$num the stats (vector of length 5, containing the extreme of the lower whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and the extreme of the upper whisker) using any function, so I can add text to ggplot in specific locations. boxplot.stats(df$num)$stats below provides those stats but the upper whisker is not matching my ggplot results.
boxplot.stats(df$num)$stats
[1] 10.65572 20.11150 24.68867 35.69668 58.82008

Related

Method of Outlier Removal for Boxplots

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Subsetting Outliers from a data frame by using the results of a boxplot diagram

I transformed my data into a boxplot (used geom_boxplot of ggplot), so that the outliers got visible. Afterwards I wanted to remove them from my data. That is why I used "ggplot_build" to get on all the informations of the plot and saved it with a new name.
Outlier_boxplot<-ggplot_build(boxplot)
Now it was possible to extract the column with the outliers. In the next step I used the function "subset" to select only the values of my data.frame, which are not equal to the extracted outliers.
Without_Outlier_dF<-subset(round(dF[1],digits=3),Test !=c(round(Outlier_boxplot$data[[1]]$outliers[[4]],digits=3))))
That worked out well for nearly all cases. The problem is, that sometimes values (even so they look the same) are not left out.
Extract of values data.frame:
-234,347 75,764 93,34 95,237 99,005 100,044 97,924 98,875 98,072 99,569 98,848 98,414 99,33 96,901 99,29 100,359 99,169 97,828 97,146 97,229 94,278 97,146 97,229 94,278
Outliers
-234.347 75.764 93.340 94.278
Results: Outliers removed except for the value 94,278
95,237 99,005 100,044 97,924 98,875 98,072 99,569 98,848 98,414 99,33 96,901 99,29 100,359 99,169 97,828 97,146 97,229 94,278
I already tried to round all values (as you can see) but it didn't help. Do you have any ideas?
geom_boxplot calls boxplot.stats to calculate the positions of the upper and lower whiskers. You can do it too:
> boxplot.stats(v)
$stats
[1] 93.340 96.069 97.876 99.087 100.359
$n
[1] 24
$conf
[1] 96.90265 98.84935
$out
[1] -234.347 75.764
(v is assumed to be your input data vector):
From the boxplot.stats documentation:
stats a vector of length 5, containing the extreme of the lower
whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and the
extreme of the upper whisker.
n the number of non-NA observations in the sample.
conf the lower and upper extremes of the ‘notch’ (if(do.conf)). See
the details.
out the values of any data points which lie beyond the extremes of
the whiskers (if(do.out)).
I guess it contains all the data you might need for further analysis.

Documentation for special variables in ggplot (..count.., ..density.., etc.)

The stats_ functions in ggplot2 create special variables, e.g. stat_bin2d creates a special variable called ..count... Where can I find documentation listing which special variables are returned by which stat_ function?
I've looked on the main ggplot2 documentation pages, and in R online help. I tried reading the source code for stat_bin2d, but it uses bits of the language I don't understand -- I don't know how to get at the code behind StatBin2d$new(...).
Most of them are documented in the value section of the help pages, e.g., ?stat_boxplot says
Value:
A data frame with additional columns:
width: width of boxplot
ymin: lower whisker = smallest observation greater than or equal to
lower hinge - 1.5 * IQR
lower: lower hinge, 25% quantile
notchlower: lower edge of notch = median - 1.58 * IQR / sqrt(n)
middle: median, 50% quantile
notchupper: upper edge of notch = median + 1.58 * IQR / sqrt(n)
upper: upper hinge, 75% quantile
ymax: upper whisker = largest observation less than or equal to
upper hinge + 1.5 * IQR
I suggest submitting bug reports for those that remain undocumented. There is a bug report for stat_bin2d, but it was closed as fixed. If you create a new bug report you can refer to that one.

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Changing the outlier rule in a boxplot

I have constructed some box-plots in R and have several outliers. I know that the default criteria to set outlier limits are:
Q3 + 1.5*IQR
Q1 - 1.5* IQR
However, I would like outliers classified as values that fall outside of the boundaries:
Q3 + 3*IQR
Q1 - 3* IQR
Is it possible to set this in R?
From ?boxplot
range: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So set range=3
I'd encourage you not to do this without a lot of thought - people expect that the whiskers extend 1.5 IQRs. Changing the range will violate these assumptions and make it easy for people to draw incorrect conclusions from your graphic.

Resources