How to create a discrete normal distribution in R? - r

I am trying to create a discrete normal distribution using something such as
x <- rnorm(1000, mean = 350, sd = 20)
but I don't think the rnorm function has a built in "discrete numbers only" option. I have spent a few hours trying to search this on StackOverflow, Google and R documentation but have yet to find anything.

Obviously, there is no discrete normal distribution as by default it is continuous. However, as mentioned here (Wikipedia is not the best possible source but this is correct anyway):
If n is large enough, then the skew of the distribution is not too great. In this case a reasonable approximation to B(n, p) is given by the normal distribution
This can be seen with a quick example:
par(mfrow=c(1,2) )
#values generated by a binomial distribution
plot(density(rbinom(1000, 30, p=0.25)))
#values generated by a normal distribution
plot(density(rnorm(1000)))
Plot:
The graph on the left (binomial) certainly approximates the right (normal) and this will get more obvious as n goes to Inf.
As you will see rbinom(1000, 30, p=0.25) will produce discrete values (integers). Also, density is probably not the best function to use on a discrete variable, but it proves the point here.

if you want to generate a set of random integers following a normal distribution you could simply round them like so...
round(rnorm(10, 5, 1), 0)

library(extraDistr)
set.seed(12)
rdnorm(10)

Related

Test multivariate normality of 2D binned data in R

I have some heatmap data and I want a notion as to whether that heat map is 'centered' around the middle of my image or skewed to one side (in R). My data is too big to give an example here, so this is some fake data of the same form (but in real life my intensity values are not uniformly distributed, I assume they are binned counts from an underlying multivariate normal distribution but I don't know how to code that as a reproducible example).
set.seed(42)
tibble(
x = rep(0:7, each = 8),
y = rep(0:7, 8),
intensity = sample(0:10, 64, replace = TRUE)
)
The x value here is the horizontal index of a pixel, the y value is the vertical index of a pixel and intensity is the value of that pixel according to a heatmap. I have managed to find a "centre" of the heatmap by marginalising these intensity values and finding the marginalised mean for x and y, but how would I perform a hypothesis test on whether the underlying multivariate normal distribution was centered around a certain point? In this case I would like to have a test statistic (more specifically a -log10 p-value) as to whether the underlying multivariate normal distibution that generated this count data is centered around the point c(3.5, 3.5).
Furthermore, I would also like a test statistic (again, more specifically a -log10 p-value) as to whether the underlying distribution that generated the count data actually is multivariate normal.
This is all part of a larger pipeline where I would like to use dplyr and group_by to perform this test on multiple heatmaps at once so if it is possible to keep this in tidy format that would be great.
A little bit of googling finds this web page which suggests mvnormtest::mshapiro.test.
mshap <- function(z, nrow = round(sqrt(length(z)))) {
mvnormtest::mshapiro.test(matrix(z, nrow = nrow))
}
mshap(dd$intensity)
If you want to make this more tidy-like you could do something with map/nest/etc..
I'm not quite sure how to test the centering hypothesis (likelihood ratio test using mnormt::dmnorm ?)

Random number simulation in R

I have been going through some random number simulation equations while i found out that as Pareto dosent have an inbuilt function.
RPareto is found as
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
can someone explain the intuitive meaning behind this.
It's a well known result that if X is a continuous random variable with CDF F(.), then Y = F(X) has a Uniform distribution on [0, 1].
This result can be used to draw random samples of any continuous random variable whose CDF is known: generate u, a Uniform(0, 1) random variable and then determine the value of x for which F(x) = u.
In specific cases, there may well be more efficient ways of sampling from F(.), but this will always work as a fallback.
It's likely (I haven't checked the accuracy of the code myself, but it looks about right) that the body of your function solves f(x) = u for known u in order to generate a random variable with a Pareto distribution. You can check it with a little algebra after getting the CDF from this Wikipedia page.

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Simulate a distribution with a given kurtosis and skewness in r? [duplicate]

Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly.
If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it.
Thanks
There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.
Here's an example of fitting a Johnson to some sample data:
require(SuppDists)
## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000, 4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)
## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")
## Print out the parameters
sJohnson(parms)
## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
The final plot looks like this:
You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.
Good luck!
EDIT
As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:
parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
Which produces the following:
Anyone have any ideas why Johnson seems biased when fit using moments?
This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.
There a few different ways of tackling this problem:
Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.
Saddlepoint approximations. In this paper:
Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.
We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.
Laguerre expansions:
Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.
The results in this paper seem more promising, but I haven't coded them up.
This question was asked more than 3 years ago, so I hope my answer doesn't come too late.
There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.
The problem is then to maximize S subject to the constraints:
1) $\int x p(x) dx = "first moment"$,
2) $\int x^2 p(x) dx = "second moment"$,
3) ... and so on
I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."
Here is a link that tries to implement this in R:
https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r
One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.
To generate 10 random values from that distribution try:
library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)
I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.
One suggested approach is as follows:
Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
Generate N normal variables (mean = 0, std = 1)
Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std
The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.
Caveats:
Fleishman will not work for all combinations of skewness and kurtois
Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform
Those parameters don't actually fully define a distribution. For that you need a density or equivalently a distribution function.
The entropy method is a good idea, but if you have the data samples you use more information compared to the use of only the moments! So a moment fit is often less stable. If you have no more information about how the distribution looks like then entropy is a good concept, but if you have more information, e.g. about the support, then use it! If your data is skewed and positive then using a lognormal model is a good idea. If you know also the upper tail is finite, then do not use the lognormal, but maybe the 4-parameter Beta distribution. If nothing is known about support or tail characteristics, then maybe a scaled and shifted lognormal model is fine. If you need more flexibility regarding kurtosis, then e.g. a logT with scaling + shifting is often fine. It can also help if you known that the fit should be near-normal, if this is the case then use a model which includes the normal distribution (often the case anyway), otherwise you may e.g. use a generalized secant-hyperbolic distribution. If you want to do all this, then at some point the model will have some different cases, and you should make sure that there are no gaps or bad transition effects.
As #David and #Carl wrote above, there are several packages dedicated to generate different distributions, see e.g. the Probability distributions Task View on CRAN.
If you are interested in the theory (how to draw a sample of numbers fitting to a specific distribution with the given parameters) then just look for the appropriate formulas, e.g. see the gamma distribution on Wiki, and make up a simple quality system with the provided parameters to compute scale and shape.
See a concrete example here, where I computed the alpha and beta parameters of a required beta distribution based on mean and standard deviation.

Octave distribution plots not working

I am trying to plot the cdf of a uniform distribution in octave but I am not getting the cdf. I am simply getting the original distribution. Also the original distribution, which is meant to be a uniform distribution, is not a uniform distribution at all!
Here is my octave code:
x = unifrnd(0,1,100,1);
hist(x)
cdfPlot = unifcdf(x)
hist(cdfPlot)
The histogram for the 1st one (hist(x)):
and the second one (hist(cdfPlot)) :
I also tried to use cdfplot(x) in octave but it said :
warning: the 'cdfplot' function belongs to the statistics package from
Octave Forge but has not yet been implemented.
Please read http://www.octave.org/missing.html to learn how you can
contribute missing functionality.
please help!
Judging by the submitted code, what you are trying to do is obtain a sample from a uniform distribution and then show a flat (mostly) histogram corresponding to a uniform distribution and a line corresponding to the cumulative distribution of the distribution.
For the first part:
Of course, with 100 samples (and no averaging), you are not going to observe a flat distribution, but if you try:
x=unifrnd(0,1,100000,1);
hist(x);
Then you are more likely to get a flat-looking histogram.
For the second part:
unifcdf(x,A,B) will return the value of a uniform distribution's CDF at some value x, between the interval set by parameters A,B. That is, the value of the CDF model itself, NOT the cumulative sum of the sample's histogram. To obtain that, you need to:
x=unifrnd(0,1,100000,1);
[counts, intervals] = hist(x);
xCDF = cumsum(counts);
bar(xCDF);
Finally, if you are looking for the model values, that is the values that would be returned by a formula describing a distribution, then for the uniform distribution that would be a probability of (1/nBins) between your A, B interval (in this case, 0,1) and a count of (1/nBins)*NSamples, while the CDF would be a line of slope (1/nBins) (i.e. the interval of the density function) and of binNum*((1/nBins)*NSamples). In the example above and using the default nBins for hist which is 10, x is decomposed to 10 intervals each with an approximate number of counts of 10000 items of x and the last value of the cumulative sum is 100000 which is of course the total number of samples in x.
For more information please see this link.
Hope this helps.

Resources