Detect multimodal distribution and split the data in R - r

I have a data with more than 10000 distributions looking like the ones in red. I want to compare each one of them with a reference distribution like the one in blue. Because some are unimodal and some are multimodal I cannot use a t-test for all of them. So I am trying to detect multimodal distribution to apply a conditional test (t-test for normal distribution, mann-whithney for multimodal distribution - If any other idea please let me know). Is there any way to detect multimodal distribution?
I am also thinking about splitting the modes when I have a multimodal distribution and compare each of the mode to the reference. Is this possible? I found this SO link Calculate the modes in a multimodal distribution in R but didn't find anything more recent.
I tried mclust to find how many mode can be found but it doesn't work well
as it will find 2 mode when the distribution looks unimodal.
library(mclust)
clust <- Mclust(data$sample_frequency)
I also tried dip.test
library(diptest)
dip.test(b$sample_frequency)
but again the p-value will not always be correct (for example the plot 77 will be significaant at p=0.001 when it will be at p=0.076 for the plot 79).
Any help/thought is welcome!
Thanks!

Related

Can one set distributions to give only the distribution mean while calibrating a model?

I am trying to calibrate parameters in a probabilistic cost-effective model I built. The model is a discrete state-transition (Markov) model. Before building models in R, I used the software TreeAge Pro. There one could set the model to run using the distribution means for each parameter, instead of drawing random values from the distributions. This made it easy see how varying single input parameters affected certain outcome variables.
My model uses many different distribution draws with Beta, Gamma and Normal distributions (rbeta, rgamma, rnorm). I would like to be able to set them to take their mean value using one command and setting them back to draw values with another. Does something like this exist?
E.g. the Beta distribution rbeta(1, 45, 55) would then take the value 0.45, when it is called, as long as it is set to take the mean.
I tried temporarily replacing the rbeta(1,, rgamma(1, etc. and parts of the distributions with qbeta(0.5 and qgamma(0.5 but this does not give me the means. For qbeta(0.5, 45,55) it gave me 0.4496653 instead of 0.45; for qbeta(0.5, 1,99) it gave me 0.006977 instead of 0.01.
The question is if there is a function in R that tells all distributions that draw random values to return the mean instead of random values, thereby avoiding having to replace every single distribution manually with its mean value.
I appreciate any hint on how to set the distributions to take their mean values. Does anyone know of any package that would be able to help?

R identifying type of frequency distribution

I am interested in frequency distributions that are not normally distributed.
If I have a frequency distributions table which is not normally distributed.
Is there a function or package that will identify the type of distribution for me?
You can use the fitdistr function (library MASS i think) and check for yourself if you find a 'fitting' distribution. However i suggest that you plot the function first and see how it looks like. This approach is generally not recommended as you always can use different parameters to fit a distribution and thus confuse one distribution with another. If you have found a suited distribution you should test it against data.
Edit: For instance a normal distribution may look like a poisson distribution. Fitting is in my oppinion only useful if you have enough random variables. Otherwise just draw variables from your data if you need to
You can always try to test whether a distribution is adequate for your data with QQ plot. If you have data that is dynamic, I would suggest that you use ECDF (Empirical Cumulative Distribution Function) which will give you more precise distributions as your data grows. You can use ECDF in R with the ecdf() function.

Test for a logistic distribution in R

I have a set of data and I'd like to know whether this data set has a logistic distribution.
When I made a histogram of my data set (see the histogram on http://imageshack.us/photo/my-images/593/histogram.png/) it seems to have a logistic distribution, but to be sure I'd like to test for a logistic distribution in R. So my question is: Is there a way to test your data for a logistic distribution and how do you do this?
Additional information: The data set consists of 8544 items. The data are horizontal distances in km between 2 geographical points.
Thanks for your attention
Sander
In R you can use the ks.test or chisq.test functions (and probably others) to test against a hypothesized distribution. Note that these tests (and others) are all rule out tests, a non-significant result does not guarentee that the data come from the given distribution, just that you cannot rule it out. Also note that with a sample size of 8544 these tests are likely to be way overpowered, meaning that it will have power to find slight meaningless differences and you are likely to reject the null hypothesis even though it is "close enough". Also the fact that you decided on a distribution based on looking at the data first could bias results.
Another approach that may give you a better feel for if a logistic distribution is "close enough" rather than exactly is to use the vis.test function in the TeachingDemos package (be sure to read the paper referenced in the help page to understand the test and what assumptions you are making).
Most importantly is understanding the science that leads to the data, does a logistic distribution make sense scientifically? what other distributions could be reasonble? Also understand what question(s) you are trying to answer with the data and what is the effect on those answers of the distribution (e.g. the CLT will let you use the normal to answer some questions, but not others, using a normal distribution even though the data comes from a logistic or something similar).

Comparing Kernel Density Estimation plots

I am actually a novice to R and stats.. Could something like this be done in R
Determining the density estimates of two samples ( 2 Vectors )..??
I have done this Using R and obtained 2 density curves for the 2 samples using kernel density estimation ..
Is there anyway to quantitatively compare how similar/Dissimilar the density estimates of 2 samples are..?
I am trying to find out which data sample exhibits has a similar distribution to a particular distribution..
I am using R Language... Can somebody please help..??
You can use Kolmogorov-Smirnov test (ks.test) to compare two distributions. Cramer-von-Mises test is another one. There is this PDF Fitting Distributions with R where they also list other tests that are available (although the nortest package that he uses only tests for normality).
Apprentice Queue is right about using the Kolmogorov-Smirnoff test, but I wanted to add a warning: don't use it on its own. You should visually compare the distributions as well, either with two kernel density plots or histograms, or with a qqplot. Human brains are very good at playing spot-the-difference.
You can try calculating the Earth mover's distance

Histogram matching - image processing - c/c++

I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
Thanks
Raj,
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.

Resources