How to account for outliers in a histogram? - R/Matlab - r

I am wondering if there is a way to account for outlier in a histogram plot. I want to plot the frequencies of a random variable, which is very small and distributed around zero. However, in most of the cases I am considering I also have an outlier that complicates things. Is there a way to adjust the scale of the x axis in R/Matlab so that I can capture the distribution of the random variable I am considering and also show the outlier? Because normal ways to obtain the plot result in such a scale that all values are considered to be zero, and I want to show how they are distributed around zero. So ideally I would like to have the scales around zero accounting for very small numbers and than after a gap (which does not necessarily have to be proportional to the actual distance from zero) a bin to indicate the value of the outlier. And I do not want to remove the outlier from the sample.
Is such a thing possible in R/Matlab? Any other suggestions would be welcome.
Edit: The problem is not in identifying the outliers and using a different color for them. The problem is in adjusting the scales on the x-axis so I can observe the distribution of the variable as well as have the outlier included in the plot.

The next code will do the job, but you need to change the Xticklabels of the axes in order to make them show the real value of the outliers.
A=rand(1000,1)*0.1;
A(1:10)=10;
% modify the data for plotting pourposes. Get the outliers closer
expected_maximum_value=1; % You can compute this useg 3*sigma maybe?
distance_to_outliers=0.5;
outlier_mean=mean(A(A>expected_maximum_value));
A(A>expected_maximum_value)=A(A>expected_maximum_value)-outlier_mean+distance_to_outliers;
% plot
h=histogram(A,'BinWidth',0.01)
%% trick the X axis
ax=gca;
ax.XTickLabel{end-1}=[ax.XTickLabel{end-1} '//'];
ax.XTickLabel{end}=['//' num2str(outlier_mean)];

Related

Method of Outlier Removal for Boxplots

In R, what method is used in boxplot() to remove outliers? In other words, what determines if a given value is an outlier?
Please note, this question is not asking how to remove outliers.
EDIT: Why is this question being downvoted? Please provide comments. The method for outlier removal is not available in any documentation I have come across.
tl;dr outliers are points that are beyond approximately twice the interquartile range away from the median (in a symmetric case). More precisely, points beyond a cutoff equal to the 'hinges' (approx. 1st and 3d quartiles) +/- 1.5 times the interquartile range.
R's boxplot() function does not actually remove outliers at all; all observations in the data set are represented in the plot (unless the outline argument is FALSE). The information on the calculation for which points are plotted as outliers (i.e., as individual points beyond the whiskers) is, implicitly, contained in the description of the range parameter:
range [default 1.5]: this determines how far the plot whiskers extend out from the
box. If ‘range’ is positive, the whiskers extend to the most
extreme data point which is no more than ‘range’ times the
interquartile range from the box. A value of zero causes the
whiskers to extend to the data extremes.
This has to be deconstructed a little bit more: what does "from the box" mean? To figure this out, we need to look at the Details of ?boxplot.stats:
The two ‘hinges’ are versions of the first and third quartile,
i.e., close to ‘quantile(x, c(1,3)/4)' [... see ?boxplot.stats for slightly more detail ...]
The reason for all the complexity/"approximately equal to the quartile" language is that the developers of the boxplot wanted to make sure that the hinges and whiskers were always drawn at points representing actual observations in the data set (whereas the quartiles can be located between observed points, e.g. in the case of data sets with odd numbers of observations).
Example:
set.seed(101)
z <- rnorm(100000)
boxplot(z)
hinges <- qnorm(c(0.25,0.75))
IQR <- diff(qnorm(c(0.25,0.75)))
abline(h=hinges,lty=2,col=4) ## hinges ~ quartiles
abline(h=hinges+c(-1,1)*1.5*IQR,col=2)
## in this case hinges = +/- IQR/2, so whiskers ~ +/- 2*IQR
abline(h=c(-1,1)*IQR*2,lty=2,col="purple")

Is there a way to plot a frequency histogram from a continuous variable?

I have DNA segment lengths (relative to chromosome arm, 251296 entries), as such:
0.24592963
0.08555043
0.02128725
...
The range goes from 0 to 2, and I would like to make a continuous relative frequency plot. I know that I could bin the values and use a histogram, but I would like to show continuity. Is there a simple strategy? If not, I'll use binning. Thank you!
EDIT:
I have created a binning vector with 40 equally spaced values between 0 and 2 (both included). For simplicity's sake, is there a way to round each of the 251296 entries to the closest value within the binning vector? Thank you!
Given that most of your values are not duplicated and thus don't have an easy way to derive a value for plotting on the y-axis, I'd probably go for a density plot. This will highlight dense segment lengths i.e. where you have lots of segment lengths occurring near each other.
d <- c(0.24592963, 0.08555043, 0.02128725)
plot(density(d), xlab="DNA Segment Length", xlim=c(0,2))

Understanding what the kde2d z values mean?

I have two data sets that I am comparing using a ked2d contour plot on a log10 scale,
Here I will use an example of the following data sets,
b<-log10(rgamma(1000,6,3))
a<-log10((rweibull(1000,8,2)))
density<-kde2d(a,b,n=100)
filled.contour(density,color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')))
This produces the following plot,
Now my question is what does the z values on the legend actually mean? I know it represents where most the data lies but 0-15 confuses me. I thought it could be a percentage but without the log10 scale I have values ranging from 0-1? And I have also produced plots with scales 1-1.2, 1-2 using my real data.
The colors represent the the values of the estimated density function ranging from 0 to 15 apparently. Just like with your other question about the odd looking linear regression I can relate to your confusion.
You just have to understand that a density's integral over the full domain has to be 1, so you can use it to calculate the probability of an observation falling into a specific region.

Bar plot with broken y-axis and log x-axis

I am looking to present a variable as a bar plot with the caveat that the groups I am trying to plot (the size of an object) vary over several orders of magnitude. The other complication of the data is that the variable y also varies over several orders of magnitude when positive as well as having negative values. I usually think in pictures so I have sketched something along the lines that I am looking for below (the colour would simply be a function of the distance from zero, i.e. white zero, dark blue very negative, dark red very positive etc):
Here is a real case of the data if required:
x <- c(1.100e-08, 1.200e-08, 1.300e-08, 1.400e-08, 1.600e-08, 1.700e-08, 1.900e-08, 2.100e-08, 2.300e-08, 2.600e-08, 3.100e-08, 3.500e-08, 4.200e-08, 4.700e-08, 5.200e-08, 5.800e-08, 6.400e-08, 7.100e-08, 7.900e-08, 8.800e-08, 9.800e-08, 1.100e-07, 1.230e-07, 1.380e-07, 1.550e-07, 1.760e-07, 3.250e-07, 3.750e-07, 4.250e-07, 4.750e-07, 5.400e-07, 6.150e-07, 6.750e-07, 7.500e-07, 9.000e-07, 1.150e-06, 1.450e-06, 1.800e-06, 2.250e-06, 2.750e-06, 3.250e-06, 3.750e-06, 4.500e-06, 5.750e-06, 7.000e-06, 8.000e-06, 9.250e-06, 1.125e-05, 1.375e-05, 1.625e-05, 1.875e-05, 2.250e-05, 2.750e-05, 3.100e-05)
y <-c(1.592140e+01, -1.493541e+01, -6.255603e+00, -2.191637e+00, -1.274086e+00, -1.343391e+00, -8.869018e-01, -7.717447e-01, -6.140710e-01, -5.637220e-01, -5.404424e-01, -3.473077e-01, -2.279666e-01, -1.945254e-01, -2.485636e-01, -2.363181e-01, -2.197054e-01, -2.119314e-01, -1.897220e-01, -1.656779e-01, -1.478176e-01, -1.364191e-01, -1.297830e-01, -1.408082e-01, -1.514742e-01, -1.311300e-01, -1.358422e-01, -2.718636e+00, -2.231532e+00, -3.479395e+00, -3.572720e+00, -2.297957e+00, -3.265428e+00, -5.449620e+00, -7.741435e+00, -1.172256e+01, 9.368365e+00, 1.078983e+02, 9.542029e+01, 1.484089e+02, 2.293383e+02, 3.678836e+02, 7.965286e+02, 1.349151e+03, 1.577808e+04, 4.554271e+05, 1.821730e+06, 8.092310e+04, 1.015619e+06, 2.113788e+06, 5.208331e+06, 4.534863e+06, 8.086026e+06, 1.577413e+07)
I could also plot this as a scatterplot with broken axis but I am currently playing with the a nice approach to display such data- important for me is highlighting at the approximate value of x that y changes sign as well as the variability and magnitude of both the positive and negative values. Any tips and advice you have plotting such data would be great.
Edit based upon comments
I realise that on my graph x and y are the wrong way around, apologies for that. Parameter x should indeed be on the x-axis and parameter y on the y-axis.
Taking on board your suggestions I would be better to plot this data as a scatterplot. Accepting that I still need to break my axis at a relevant value of y (not x as shown in the figure) and have a log scale above this value and linear scale below. Somewhere below the smallest "positive" value of y seems sensible for this break. Can this be done using base r?
I guess something like this but with the split on the y-axis rather than the x-axis and in r of course.

Probability distribution values plot

I have
probability values: 0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04 ,summing up to 1
the corresponding intervals where a stochastic variable may take its value with the corresponding probability from the above list:
126,162,233,304,375,446,517,588,659,730,801,839
How can I plot the probability distribution?
On the x axis, the interval values, between the intervals histogram with the probability value?
Thanks.
How about
x <- c(126,162,233,304,375,446,517,588,659,730,801,839)
p <- c(0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04)
plot(x,c(p,0),type="s")
lines(x,c(0,p),type="S")
rect(x[-1],0,x[-length(x)],p,col="lightblue")
for a quick answer? (With the rect included you might not need the lines call and might be able to change it to plot(x,p,type="n"). As usual I would recommend par(bty="l",lty=1) for my preferred graphical defaults ...)
(Explanation: "s" and "S" are two different stair-step types (see Details in ?plot): I used them both to get both the left and right boundaries of the distribution.)
edit: In your comments you say "(it) doesn't look like a histogram". It's not quite clear what you want. I added rectangles in the example above -- maybe that does it? Or you could do
b <- barplot(p,width=diff(x),space=0)
but getting the x-axis labels right is a pain.

Resources