Method of Outlier Removal for Boxplots - r

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.

tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Related

R - TablePlot() - Clarifications

I'm just trying to understand how to read table plots. I don't understand what the dividing line in a numerical columns/variable represents. For example, the dividing black line in P1/2/3/4/5 here:
https://steemitimages.com/DQmeEJ8RyPkdRhdqX6CwNsUTzXfGWt36RwyFrixt6NNbPTw/tabplot.PNG
Also, I understand the Y Axis represents proportions (0% to 100%). Does the X axis for each variable represent proportions too or is that just regular values for the data?
Thanks!
It's hard to be sure, but those look like boxplots (but they're not.. see below). A classic boxplot will have a central marker for the median and then box ends are at a point called hinges that are set by the first and third interquartile points. You can read up about it at ?boxplot.stats. It's also possible that someone chose a different statistic to form the x-bounds for those boxes, but we can be certain that they are not proportions since some of them are negative.
The proportions in the "y-axis" are for the various regions. You will need to find documentation to determine the sequence of the regions. I suspect they are viticulture regions in Italy.
Here is a copy of part of the help page: ?tableplot:
numMode
character value that determines how numeric values are plotted. The value consists of the following building blocks, which are concatenated with the "-" symbol. The default value is "mb-sdb-sdl". Prior to version 1.2, "MB-ML" was the default value.
sdb sd bars between mean-sd to mean+sd are shown
sdl sd lines at mean-sd and mean+sd are shown
mb mean bars are shown
MB mean bars are shown, where the color of the bar indicate completeness where positive mean values are blue and negative orange
ml mean lines are shown
ML mean lines are shown, where positive mean values are blue and negative orange
mean2 mean values are shown
This default value for numMode obviously changed since most of the example in the documentation only show the mean value.

How to account for outliers in a histogram? - R/Matlab

I am wondering if there is a way to account for outlier in a histogram plot. I want to plot the frequencies of a random variable, which is very small and distributed around zero. However, in most of the cases I am considering I also have an outlier that complicates things. Is there a way to adjust the scale of the x axis in R/Matlab so that I can capture the distribution of the random variable I am considering and also show the outlier? Because normal ways to obtain the plot result in such a scale that all values are considered to be zero, and I want to show how they are distributed around zero. So ideally I would like to have the scales around zero accounting for very small numbers and than after a gap (which does not necessarily have to be proportional to the actual distance from zero) a bin to indicate the value of the outlier. And I do not want to remove the outlier from the sample.
Is such a thing possible in R/Matlab? Any other suggestions would be welcome.
Edit: The problem is not in identifying the outliers and using a different color for them. The problem is in adjusting the scales on the x-axis so I can observe the distribution of the variable as well as have the outlier included in the plot.
The next code will do the job, but you need to change the Xticklabels of the axes in order to make them show the real value of the outliers.
A=rand(1000,1)*0.1;
A(1:10)=10;
% modify the data for plotting pourposes. Get the outliers closer
expected_maximum_value=1; % You can compute this useg 3*sigma maybe?
distance_to_outliers=0.5;
outlier_mean=mean(A(A>expected_maximum_value));
A(A>expected_maximum_value)=A(A>expected_maximum_value)-outlier_mean+distance_to_outliers;
% plot
h=histogram(A,'BinWidth',0.01)
%% trick the X axis
ax=gca;
ax.XTickLabel{end-1}=[ax.XTickLabel{end-1} '//'];
ax.XTickLabel{end}=['//' num2str(outlier_mean)];

Subsetting Outliers from a data frame by using the results of a boxplot diagram

I transformed my data into a boxplot (used geom_boxplot of ggplot), so that the outliers got visible. Afterwards I wanted to remove them from my data. That is why I used "ggplot_build" to get on all the informations of the plot and saved it with a new name.
Outlier_boxplot<-ggplot_build(boxplot)
Now it was possible to extract the column with the outliers. In the next step I used the function "subset" to select only the values of my data.frame, which are not equal to the extracted outliers.
Without_Outlier_dF<-subset(round(dF[1],digits=3),Test !=c(round(Outlier_boxplot$data[[1]]$outliers[[4]],digits=3))))
That worked out well for nearly all cases. The problem is, that sometimes values (even so they look the same) are not left out.
Extract of values data.frame:
-234,347 75,764 93,34 95,237 99,005 100,044 97,924 98,875 98,072 99,569 98,848 98,414 99,33 96,901 99,29 100,359 99,169 97,828 97,146 97,229 94,278 97,146 97,229 94,278
Outliers
-234.347 75.764 93.340 94.278
Results: Outliers removed except for the value 94,278
95,237 99,005 100,044 97,924 98,875 98,072 99,569 98,848 98,414 99,33 96,901 99,29 100,359 99,169 97,828 97,146 97,229 94,278
I already tried to round all values (as you can see) but it didn't help. Do you have any ideas?
geom_boxplot calls boxplot.stats to calculate the positions of the upper and lower whiskers. You can do it too:
> boxplot.stats(v)
$stats
[1] 93.340 96.069 97.876 99.087 100.359
$n
[1] 24
$conf
[1] 96.90265 98.84935
$out
[1] -234.347 75.764
(v is assumed to be your input data vector):
From the boxplot.stats documentation:
stats a vector of length 5, containing the extreme of the lower
whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and the
extreme of the upper whisker.
n the number of non-NA observations in the sample.
conf the lower and upper extremes of the ‘notch’ (if(do.conf)). See
the details.
out the values of any data points which lie beyond the extremes of
the whiskers (if(do.out)).
I guess it contains all the data you might need for further analysis.

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Changing the outlier rule in a boxplot

I have constructed some box-plots in R and have several outliers. I know that the default criteria to set outlier limits are:
Q3 + 1.5*IQR
Q1 - 1.5* IQR
However, I would like outliers classified as values that fall outside of the boundaries:
Q3 + 3*IQR
Q1 - 3* IQR
Is it possible to set this in R?
From ?boxplot
range: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So set range=3
I'd encourage you not to do this without a lot of thought - people expect that the whiskers extend 1.5 IQRs. Changing the range will violate these assumptions and make it easy for people to draw incorrect conclusions from your graphic.

Resources