Boxplot in R uper/lower whiskers - r

I am using the most basic function of Boxplot, boxplot(x, ..., range = 1.5, but if I don't set the rang, and let R use its default value. Something like boxplot(x, ...,) what exactly quantile of the whiskers ? Because I have outliners that is larger or smaller than the upper/lower whiskers. How can I know the exact percentage of the outliners above or below the uper/lower whiskers? In other words, without setting the range, may I know what the percentage of the data is for the uper/lower whiskers?

For example, you could calculate the percentage of utliers as follows:
# Some data with outliers:
d <- rnorm(100)
d[sample(1:100, 10)] <- rnorm(10,mean = 0, sd = 10)
bp <- boxplot(d)
# Get the values of the outliers:
out <- bp$out
# The proportion of outliers:
length(out)/length(d)*100
9

Not entirely sure what your question is, but: ?boxplot says the default value of range is 1.5, and then it says
range: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
In other words, the whiskers are not defined as a proportion of the data, but as a multiple of the interquartile range.
If you want to know the proportions, you can use boxplot.stats:
set.seed(101)
x <- runif(100)
bb <- boxplot.stats(x)
c(mean(x<min(bb$stats)),mean(x>max(bb$stats)))
## [1] 0 0
mean(<logical value>) is a shortcut for computing a proportion. Because I have chosen the data from a uniform distribution, there are actually no points beyond the whiskers (confirmed by looking at boxplot(x)). If I were to do re-do this with rcauchy() there would be lots ...

Related

When plotting a curve in R, a piece of the curve gets cut off, not sure why

I am trying to plot this formula. As x approaches 0 from the right, y should be approaching infinity, and so my curve should be going upwards close to y-axis. Instead it gets cut off at y=23 or so.
my_formula = function(x){7.9*x^(-0.5)-1.3}
curve(my_formula,col="red",from=0 ,to=13, xlim=c(0,13),ylim=c(0,50),axes=T, xlab=NA, ylab=NA)
I tried to play with from= parameter, and actually got what I needed when I
put from=-4.8 but I have no idea why this works. in fact x doesn't get less than 0, and from/to should represent the range of x values, Do they? If someone could explain it to me, this would be amazing! Thank you!
By default, curve only chooses 101 x-values within the (from, to) range, set by the default value of the n argument. In your case this means there aren't many values that are close enough to 0 to show the full behaviour of the function. Increasing the number of values that are plotted with something like n=500 helps:
curve(my_formula,col="red",from=0 ,to=13,
xlim=c(0,13),ylim=c(0,50),axes=T, xlab=NA, ylab=NA,
n=500)
This is due mainly to the fact that my_formula(0) is Inf:
So plotting from=0, to=13 in curve means your first 2 values are by default (with 101 points as #Marius notes):
# x
seq(0, 13, length.out=101)[1:2]
#[1] 0.00 0.13
# y
my_formula(seq(0, 13, length.out=101)[1:2])
#[1] Inf 20.61066
And R will not plot infinite values to join the lines from the first point to the second one.
If you get as close to 0 on your x axis as is possible on your system, you can make this work a-okay. For instance:
curve(my_formula, col="red", xlim=c(0 + .Machine$double.eps, 13), ylim=c(0,50))

Method of Outlier Removal for Boxplots

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Set symmetric specific cuts for bimodal data

I have a binomial assymetric distribution which I would like to cut at both ends. The specific part of it is that I would like to calculate symmetric boundaries at the appropriate side of each 'bell'. The figure shows an extreme case of separation between bells for simplicity.
In this case the red cuts were selected by eye and the 1550 blue lines used at each side represent an arbitrary value that could potentially be passed through a function for the trim. My goal would be subset everything between blue lines.
hist(p3_cut$x,50)
abline(v=c(6200,7600),col='red')
abline(v=c(6200-1500,7600+1500),col='blue')
My guess is that the problem here is basically find the 'edges' of each curve. I cannot use half distance between means, I need something that recognizes frequency change from 0 (or very low value) to something relatively high.
A somewhat general answer. Depending on the problem you might need to adjust the binwidth in the density function:
# get density of x and normalize so max is one
dens <- density(x,adjust=0.1)
dens$y <- dens$y / max(dens$y)
# keep all x where density is higher than some fraction of max (here 1%)
min_frac <- 0.01
x_keep <- dens$x[dens$y > 0.01]
# find position of gap in x, and get x just before and after gap
gap_pos <- which.max(diff(x_keep))
left_cut <- x_keep[gap_pos]
right_cut <- x_keep[gap_pos + 1]
Using this code and changing the adjust parameter in the density function I was able to calculate almost perfect cuts at least for this case. I am positive that this approach is flexible enough for most situations that are similar to this one. I show the results for the cuts proposed.

R: area under curve of ogive?

I have an algorithm that uses an x,y plot of sorted y data to produce an ogive.
I then derive the area under the curve to derive %'s.
I'd like to do something similar using kernel density estimation. I like how the upper/lower bounds are smoothed out using kernel densities (i.e. the min and max will extend slightly beyond my hard coded input).
Either way... I was wondering if there is a way to treat an ogive as a type of cumulative distribution function and/or use kernel density estimation to derive a cumulative distribution function given y data?
I apologize if this is a confusing question. I know there is a way to derive a cumulative frequency graph (i.e. ogive). However, I can't determine how to derive a % given this cumulative frequency graph.
What I don't want is an ecdf. I know how to do that, and I am not quite trying to capture an ecdf. But, rather integration of an ogive given two intervals.
I'm not exactly sure what you have in mind, but here's a way to calculate the area under the curve for a kernel density estimate (or more generally for any case where you have the y values at equally spaced x-values (though you can, of course, generalize to variable x intervals as well)):
library(zoo)
# Kernel density estimate
# Set n to higher value to get a finer grid
set.seed(67839)
dens = density(c(rnorm(500,5,2),rnorm(200,20,3)), n=2^5)
# How to extract the x and y values of the density estimate
#dens$y
#dens$x
# x interval
dx = median(diff(dens$x))
# mean height for each pair of y values
h = rollmean(dens$y, 2)
# Area under curve
sum(h*dx) # 1.000943
# Cumulative area
# cumsum(h*dx)
# Plot density, showing points at which density is calculated
plot(dens)
abline(v=dens$x, col="#FF000060", lty="11")
# Plot cumulative area under curve, showing mid-point of each x-interval
plot(dens$x[-length(dens$x)] + 0.5*dx, cumsum(h*dx), type="l")
abline(v=dens$x[-length(dens$x)] + 0.5*dx, col="#FF000060", lty="11")
UPDATE to include ecdf function
To address your comments, look at the two plots below. The first is the empirical cumulative distribution function (ECDF) of the mixture of normal distributions that I used above. Note that the plot of this data looks the same below as it does above. The second is a plot of the ECDF of a plain vanilla normal distribution, mean=0, sd=1.
set.seed(67839)
x = c(rnorm(500,5,2),rnorm(200,20,3))
plot(ecdf(x), do.points=FALSE)
plot(ecdf(rnorm(1000)))

R question about plotting probability/density histogram the right way

I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.

Resources