I want to generate some data which correspond to a quantile function. But the data need a min and a max value.
set.seed(30)
a1<-950 ; a2<-0; a3<-2.48; a4<-1.92
invcdf<-function (x)(a1+a2*a3*((-log(x))^(1/a4)))/(a3*((-log(x))^(1/a4))+1)
t<-invcdf(runif(2000,min=80,max=800))
When I use min and max in the runif function NaN's are produced.
How can I improve this code to avoid NaN's? I can't change the parameters.
Since you don't explain what exactly you are trying to do (which distribution are you trying to sample?), all I can do is interpret this as an attempt to generate random variable according to some distribution using its inverse CDF function. Because I don't know which it is, I can't comment on whether your implementation of it is correct.
However, when you use this method, you should know that the CDF function takes values between 0 and 1, as it is a cumulative density, starting at 0, and going to 1 in some limit.
The inverse of that function then only makes sense if you feed it values between 0 and 1, and that is where a possible error lies. runif(2000,min=80,max=800) generates random values between 80 and 800, way outside the (0,1) interval.
If you instead do this:
t <- invcdf(runif(2000))
We do get results (which happen to lie between 80 and 800 mostly):
Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly.
If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it.
Thanks
There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.
Here's an example of fitting a Johnson to some sample data:
require(SuppDists)
## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000, 4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)
## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")
## Print out the parameters
sJohnson(parms)
## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
The final plot looks like this:
You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.
Good luck!
EDIT
As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:
parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
Which produces the following:
Anyone have any ideas why Johnson seems biased when fit using moments?
This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.
There a few different ways of tackling this problem:
Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.
Saddlepoint approximations. In this paper:
Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.
We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.
Laguerre expansions:
Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.
The results in this paper seem more promising, but I haven't coded them up.
This question was asked more than 3 years ago, so I hope my answer doesn't come too late.
There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.
The problem is then to maximize S subject to the constraints:
1) $\int x p(x) dx = "first moment"$,
2) $\int x^2 p(x) dx = "second moment"$,
3) ... and so on
I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."
Here is a link that tries to implement this in R:
https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r
One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.
To generate 10 random values from that distribution try:
library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)
I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.
One suggested approach is as follows:
Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
Generate N normal variables (mean = 0, std = 1)
Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std
The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.
Caveats:
Fleishman will not work for all combinations of skewness and kurtois
Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform
Those parameters don't actually fully define a distribution. For that you need a density or equivalently a distribution function.
The entropy method is a good idea, but if you have the data samples you use more information compared to the use of only the moments! So a moment fit is often less stable. If you have no more information about how the distribution looks like then entropy is a good concept, but if you have more information, e.g. about the support, then use it! If your data is skewed and positive then using a lognormal model is a good idea. If you know also the upper tail is finite, then do not use the lognormal, but maybe the 4-parameter Beta distribution. If nothing is known about support or tail characteristics, then maybe a scaled and shifted lognormal model is fine. If you need more flexibility regarding kurtosis, then e.g. a logT with scaling + shifting is often fine. It can also help if you known that the fit should be near-normal, if this is the case then use a model which includes the normal distribution (often the case anyway), otherwise you may e.g. use a generalized secant-hyperbolic distribution. If you want to do all this, then at some point the model will have some different cases, and you should make sure that there are no gaps or bad transition effects.
As #David and #Carl wrote above, there are several packages dedicated to generate different distributions, see e.g. the Probability distributions Task View on CRAN.
If you are interested in the theory (how to draw a sample of numbers fitting to a specific distribution with the given parameters) then just look for the appropriate formulas, e.g. see the gamma distribution on Wiki, and make up a simple quality system with the provided parameters to compute scale and shape.
See a concrete example here, where I computed the alpha and beta parameters of a required beta distribution based on mean and standard deviation.
So I've got a data set that I want to parameterise but it is not a Gaussian distribution so I can't parameterise it in terms of it's mean and standard deviation. I want to fit a distribution function with a set of parameters and extract the values of the parameters (eg. a and b) that give the best fit. I want to do this exactly the same as the
lm(y~f(x;a,b))
except that I don't have a y, I have a distribution of different x values.
Here's an example. If I assume that the data follows a Gumbel, double exponential, distribution
f(x;u,b) = 1/b exp-(z + exp-(z)) [where z = (x-u)/b]:
#library(QRM)
#library(ggplot2)
rg <- rGumbel(1000) #default parameters are 0 and 1 for u and b
#then plot it's distribution
qplot(rg)
#should give a nice skewed distribution
If I assume that I don't know the distribution parameters and I want to perform a best fit of the probability density function to the observed frequency data, how do I go about showing that the best fit is (in this test case), u = 0 and b = 1?
I don't want code that simply maps the function onto the plot graphically, although that would be a nice aside. I want a method that I can repeatedly use to extract variables from the function to compare to others. GGPlot / qplot was used as it quickly shows the distribution for anyone wanting to test the code. I prefer to use it but I can use other packages if they are easier.
Note: This seems to me like a really obvious thing to have been asked before but I can't find one that relates to histogram data (which again seems strange) so if there's another tutorial I'd really like to see it.
I applied the fitdistrplus package in order to fit an empirical distribution.
It turned out the best fit was the negative binomial distribution with parameters:
size=0.6900788
mu=2.6522087
dnbinom(0:10, mu = 2.6522087, size =0.6900788)
[1] 0.33666338 0.18435650 0.12362301 0.08796440 0.06439416 0.04793144 0.03607044 0.02735574 0.02086667 0.01598815 0.01229390
I am now trying to generate the same numbers on EXCEL where the parameters are required in different format:
NEGBINOMDIST(number_f,number_s,probability_s)
How am I meant to do this? Any ideas?
Many thanks..
According to Microsoft's documentation, Excel uses the standard "number of draws before n failures" definition; the parameterization used by fitdistrplus is the alternative referred to in ?dnbinom as:
An alternative parametrization (often used in ecology) is by the
mean ‘mu’, and ‘size’, the dispersion parameter, where ‘prob’
= ‘size/(size+mu)’. The variance is ‘mu + mu^2/size’ in this
parametrization.
So if you want to get back from mu and size to prob and size (Excel's probability_s and number_s respectively) you need
number_s=size
probability_s=size/(size+mu)
muval <- 2.6522087
sizeval <- 0.6900788
(probval <- sizeval/(sizeval+muval))
## [1] 0.206469
all.equal(dnbinom(0:10,mu=muval,size=sizeval),
dnbinom(0:10,prob=probval,size=sizeval))
## TRUE
However, you're not done yet, because (as commented above by #James) Excel only allows positive integers for number_s, and the estimated value above is 0.69. You may need to search/ask on an Excel-related forum about how to overcome this limitation ... at worst, since Excel does have an implementation of the gamma function, you can use the formula given in ?dnbinom
Gamma(x+n)/(Gamma(n) x!) p^n (1-p)^x
to implement your own calculation of the NB (this formulation allows non-integer values of n). It would be best to use the GAMMLN function in Excel to calculate the numerator and denominator of the normalization constant on the log scale ... if you're lucky, someone out there will have saved you some trouble and implemented this already ...
I am using R, and would like to generate a number of observations using rweibull(n, shape, scale = 1).
I have the arrival rate (i.e. 1/interarrival time), but I do not know how to use it in rweibull function.
The scale parameter is what you need to be working with and the shape parameter is what needs to be set to 1 to create an exponential distribution. The scale parameter is 1/rate:
interT = 8
plot( density(rexp(100, rate=1/interT)) )
with( density(rweibull(100, scale=interT, shape=1)),
lines(x,y, col="red"))
(But if you are using the survival package you need to be aware that the parameters are different.)