Understanding what the kde2d z values mean? - r

I have two data sets that I am comparing using a ked2d contour plot on a log10 scale,
Here I will use an example of the following data sets,
b<-log10(rgamma(1000,6,3))
a<-log10((rweibull(1000,8,2)))
density<-kde2d(a,b,n=100)
filled.contour(density,color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')))
This produces the following plot,
Now my question is what does the z values on the legend actually mean? I know it represents where most the data lies but 0-15 confuses me. I thought it could be a percentage but without the log10 scale I have values ranging from 0-1? And I have also produced plots with scales 1-1.2, 1-2 using my real data.

The colors represent the the values of the estimated density function ranging from 0 to 15 apparently. Just like with your other question about the odd looking linear regression I can relate to your confusion.
You just have to understand that a density's integral over the full domain has to be 1, so you can use it to calculate the probability of an observation falling into a specific region.

Related

Plotting a probability mass function for a poisson distribution

uppose that i have a poisson distribution with mean of 6 i would like to plot a probability mass function which includes an overlay of the approximating normal density.
This is what i have tried
plot( dpois( x=0:10, lambda=6 ))
this produces
which is wrong since it doesnt contain an overlay of approxiamating noral density
How do i go about this?
Something like what you seem to be asking for (I'm outlining the commands and the basic ideas, but checking the help on the functions and trying should fill in the remaining details):
taking a wider range of x-values (out to at least 13 or so) and use xlim to extend the plot slightly into the negatives (maybe to -1.5) and
plotting the pmf of the Poisson with solid dots (similar to your command but with pch=16 as an argument to plot) with a suitable color, then
call points with the same x and y arguments as above and have type=h and lty=3 to get vertical dotted lines (to give a clear impression of the relative heights, somewhat akin to the appearance of a Cleveland dot-chart); I'd use the same colour as the dots or a slightly lighter/greyer version of the dot-colour
use curve to draw the normal curve with the same mean and standard deviation as the Poisson with mean 6 (see details at the Wikipedia page for the Poisson which gives the mean and variance), but across the wider range we plotted; I'd use a slightly contrasting colour for that.
I'd draw a light x-axis in (e.g. using abline with the h argument)
Putting all those suggestions together:
(However, while it's what you're asking for it's not strictly a suitable way to compare discrete and continuous variables since density and pmf are not on the same scale, since density is not probability -- the "right" comparison between a Poisson and an approximating normal would be on the scale of the cdfs so you compare like with like -- they'd both be on the scale of probabilities then)

How to account for outliers in a histogram? - R/Matlab

I am wondering if there is a way to account for outlier in a histogram plot. I want to plot the frequencies of a random variable, which is very small and distributed around zero. However, in most of the cases I am considering I also have an outlier that complicates things. Is there a way to adjust the scale of the x axis in R/Matlab so that I can capture the distribution of the random variable I am considering and also show the outlier? Because normal ways to obtain the plot result in such a scale that all values are considered to be zero, and I want to show how they are distributed around zero. So ideally I would like to have the scales around zero accounting for very small numbers and than after a gap (which does not necessarily have to be proportional to the actual distance from zero) a bin to indicate the value of the outlier. And I do not want to remove the outlier from the sample.
Is such a thing possible in R/Matlab? Any other suggestions would be welcome.
Edit: The problem is not in identifying the outliers and using a different color for them. The problem is in adjusting the scales on the x-axis so I can observe the distribution of the variable as well as have the outlier included in the plot.
The next code will do the job, but you need to change the Xticklabels of the axes in order to make them show the real value of the outliers.
A=rand(1000,1)*0.1;
A(1:10)=10;
% modify the data for plotting pourposes. Get the outliers closer
expected_maximum_value=1; % You can compute this useg 3*sigma maybe?
distance_to_outliers=0.5;
outlier_mean=mean(A(A>expected_maximum_value));
A(A>expected_maximum_value)=A(A>expected_maximum_value)-outlier_mean+distance_to_outliers;
% plot
h=histogram(A,'BinWidth',0.01)
%% trick the X axis
ax=gca;
ax.XTickLabel{end-1}=[ax.XTickLabel{end-1} '//'];
ax.XTickLabel{end}=['//' num2str(outlier_mean)];

Is there a way to plot a frequency histogram from a continuous variable?

I have DNA segment lengths (relative to chromosome arm, 251296 entries), as such:
0.24592963
0.08555043
0.02128725
...
The range goes from 0 to 2, and I would like to make a continuous relative frequency plot. I know that I could bin the values and use a histogram, but I would like to show continuity. Is there a simple strategy? If not, I'll use binning. Thank you!
EDIT:
I have created a binning vector with 40 equally spaced values between 0 and 2 (both included). For simplicity's sake, is there a way to round each of the 251296 entries to the closest value within the binning vector? Thank you!
Given that most of your values are not duplicated and thus don't have an easy way to derive a value for plotting on the y-axis, I'd probably go for a density plot. This will highlight dense segment lengths i.e. where you have lots of segment lengths occurring near each other.
d <- c(0.24592963, 0.08555043, 0.02128725)
plot(density(d), xlab="DNA Segment Length", xlim=c(0,2))

Bar plot with broken y-axis and log x-axis

I am looking to present a variable as a bar plot with the caveat that the groups I am trying to plot (the size of an object) vary over several orders of magnitude. The other complication of the data is that the variable y also varies over several orders of magnitude when positive as well as having negative values. I usually think in pictures so I have sketched something along the lines that I am looking for below (the colour would simply be a function of the distance from zero, i.e. white zero, dark blue very negative, dark red very positive etc):
Here is a real case of the data if required:
x <- c(1.100e-08, 1.200e-08, 1.300e-08, 1.400e-08, 1.600e-08, 1.700e-08, 1.900e-08, 2.100e-08, 2.300e-08, 2.600e-08, 3.100e-08, 3.500e-08, 4.200e-08, 4.700e-08, 5.200e-08, 5.800e-08, 6.400e-08, 7.100e-08, 7.900e-08, 8.800e-08, 9.800e-08, 1.100e-07, 1.230e-07, 1.380e-07, 1.550e-07, 1.760e-07, 3.250e-07, 3.750e-07, 4.250e-07, 4.750e-07, 5.400e-07, 6.150e-07, 6.750e-07, 7.500e-07, 9.000e-07, 1.150e-06, 1.450e-06, 1.800e-06, 2.250e-06, 2.750e-06, 3.250e-06, 3.750e-06, 4.500e-06, 5.750e-06, 7.000e-06, 8.000e-06, 9.250e-06, 1.125e-05, 1.375e-05, 1.625e-05, 1.875e-05, 2.250e-05, 2.750e-05, 3.100e-05)
y <-c(1.592140e+01, -1.493541e+01, -6.255603e+00, -2.191637e+00, -1.274086e+00, -1.343391e+00, -8.869018e-01, -7.717447e-01, -6.140710e-01, -5.637220e-01, -5.404424e-01, -3.473077e-01, -2.279666e-01, -1.945254e-01, -2.485636e-01, -2.363181e-01, -2.197054e-01, -2.119314e-01, -1.897220e-01, -1.656779e-01, -1.478176e-01, -1.364191e-01, -1.297830e-01, -1.408082e-01, -1.514742e-01, -1.311300e-01, -1.358422e-01, -2.718636e+00, -2.231532e+00, -3.479395e+00, -3.572720e+00, -2.297957e+00, -3.265428e+00, -5.449620e+00, -7.741435e+00, -1.172256e+01, 9.368365e+00, 1.078983e+02, 9.542029e+01, 1.484089e+02, 2.293383e+02, 3.678836e+02, 7.965286e+02, 1.349151e+03, 1.577808e+04, 4.554271e+05, 1.821730e+06, 8.092310e+04, 1.015619e+06, 2.113788e+06, 5.208331e+06, 4.534863e+06, 8.086026e+06, 1.577413e+07)
I could also plot this as a scatterplot with broken axis but I am currently playing with the a nice approach to display such data- important for me is highlighting at the approximate value of x that y changes sign as well as the variability and magnitude of both the positive and negative values. Any tips and advice you have plotting such data would be great.
Edit based upon comments
I realise that on my graph x and y are the wrong way around, apologies for that. Parameter x should indeed be on the x-axis and parameter y on the y-axis.
Taking on board your suggestions I would be better to plot this data as a scatterplot. Accepting that I still need to break my axis at a relevant value of y (not x as shown in the figure) and have a log scale above this value and linear scale below. Somewhere below the smallest "positive" value of y seems sensible for this break. Can this be done using base r?
I guess something like this but with the split on the y-axis rather than the x-axis and in r of course.

R question about plotting probability/density histogram the right way

I have a following matrix [500,2], so we have 500 rows and 2 columns, the left one gives us the index of X observations, and the right one gives the probability with which this X comes true, so - a typical probability density relationship.
So, my question is, how to plot the histogram the right way, so that the x-axis is the x-index, and the y-axis is the density(0.01-1.00). The bandwidth of the estimator is 0.33.
Thanks in advance!
the end of the whole data looks like this: just for a little orientation
[490,] 2.338260830 0.04858685
[491,] 2.347839477 0.04797310
[492,] 2.357418125 0.04736149
[493,] 2.366996772 0.04675206
[494,] 2.376575419 0.04614482
[495,] 2.386154067 0.04553980
[496,] 2.395732714 0.04493702
[497,] 2.405311361 0.04433653
[498,] 2.414890008 0.04373835
[499,] 2.424468656 0.04314252
[500,] 2.434047303 0.04254907
#everyone,
yes, I have made the estimation before, so.. the bandwith is what I mentioned, the data is ordered from low to high values, so respecively the probability at the beginning is 0,22, at the peak about 0,48, at the end 0,15.
The line with the density is plotted like a charm but I have to do in addition is to plot a histogram! So, how I can do this, ordering the blocks properly(ho the data to be splitted in boxes etc..)
Any suggestions?
Here is a part of the data AFTER the estimation, all values are discrete, so I assume histogram can be created.., hopefully.
[491,] 4.956164 0.2618131
[492,] 4.963014 0.2608723
[493,] 4.969863 0.2599309
[494,] 4.976712 0.2589889
[495,] 4.983562 0.2580464
[496,] 4.990411 0.2571034
[497,] 4.997260 0.2561599
[498,] 5.004110 0.2552159
[499,] 5.010959 0.2542716
[500,] 5.017808 0.2533268
[501,] 5.024658 0.2523817
Best regards,
appreciate the fast responses!(bow)
What will do the job is to create a histogram just for the indexes, grouping them in a way x25/x50 each, for instance...and compute the average probability for each 25 or 50/100/150/200/250 etc as boxes..?
Assuming the rows are in order from lowest to highest value of x, as they appear to be, you can use the default plot command, the only change you need is the type:
plot(your.data, type = 'l')
EDIT:
Ok, I'm not sure this is better than the density plot, but it can be done:
x = dnorm(seq(-1, 1, length = 500))
x.bins = rep(1:50, each = 10)
bars = aggregate(x, by = list(x.bins), FUN = sum)[,2]
barplot(bars)
In your case, replace x with the probabilities from the second column of your matrix.
EDIT2:
On second thought, this only makes sense if your 500 rows represent discrete events. If they are instead points along a continuous distribution function adding them together as I have done is incorrect. Mathematically I don't think you can produce the binned probability for a range using only a few points from within that range.
Assuming M is the matrix. wouldn't this just be :
plot(x=M[ , 1], y = M[ , 2] )
You have already done the density estimation since this is not the original data.

Resources