Simulate a distribution with a given kurtosis and skewness in r? [duplicate] - r

Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly.
If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it.
Thanks

There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.
Here's an example of fitting a Johnson to some sample data:
require(SuppDists)
## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000, 4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)
## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")
## Print out the parameters
sJohnson(parms)
## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
The final plot looks like this:
You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.
Good luck!
EDIT
As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:
parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
Which produces the following:
Anyone have any ideas why Johnson seems biased when fit using moments?

This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.
There a few different ways of tackling this problem:
Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.
Saddlepoint approximations. In this paper:
Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.
We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.
Laguerre expansions:
Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.
The results in this paper seem more promising, but I haven't coded them up.

This question was asked more than 3 years ago, so I hope my answer doesn't come too late.
There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.
The problem is then to maximize S subject to the constraints:
1) $\int x p(x) dx = "first moment"$,
2) $\int x^2 p(x) dx = "second moment"$,
3) ... and so on
I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."
Here is a link that tries to implement this in R:
https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r

One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.
To generate 10 random values from that distribution try:
library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)

I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.
One suggested approach is as follows:
Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
Generate N normal variables (mean = 0, std = 1)
Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std
The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.
Caveats:
Fleishman will not work for all combinations of skewness and kurtois
Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform

Those parameters don't actually fully define a distribution. For that you need a density or equivalently a distribution function.

The entropy method is a good idea, but if you have the data samples you use more information compared to the use of only the moments! So a moment fit is often less stable. If you have no more information about how the distribution looks like then entropy is a good concept, but if you have more information, e.g. about the support, then use it! If your data is skewed and positive then using a lognormal model is a good idea. If you know also the upper tail is finite, then do not use the lognormal, but maybe the 4-parameter Beta distribution. If nothing is known about support or tail characteristics, then maybe a scaled and shifted lognormal model is fine. If you need more flexibility regarding kurtosis, then e.g. a logT with scaling + shifting is often fine. It can also help if you known that the fit should be near-normal, if this is the case then use a model which includes the normal distribution (often the case anyway), otherwise you may e.g. use a generalized secant-hyperbolic distribution. If you want to do all this, then at some point the model will have some different cases, and you should make sure that there are no gaps or bad transition effects.

As #David and #Carl wrote above, there are several packages dedicated to generate different distributions, see e.g. the Probability distributions Task View on CRAN.
If you are interested in the theory (how to draw a sample of numbers fitting to a specific distribution with the given parameters) then just look for the appropriate formulas, e.g. see the gamma distribution on Wiki, and make up a simple quality system with the provided parameters to compute scale and shape.
See a concrete example here, where I computed the alpha and beta parameters of a required beta distribution based on mean and standard deviation.

Related

Compute and plot central credible and highest posterior density intervals for distributions in Distributions.jl

I would like to (i) compute and (ii) plot the central credible interval and the highest posterior density intervals for a distribution in the Distributions.jl library.
Ideally, one can write their own function to compute CI and HPD and then use Plots.jl to plot them. However, I'm finding the implementation quite tricky (disclaimer: I'm new to Julia).
Any suggestions about libraries/gists/repo to check out that make the computing and plotting them easier?
Context
using Plots, StatsPlots, LaTeXStrings
using Distributions
dist = Beta(10, 10)
plot(dist) # thanks to StatsPlots it nicely plots the distribution
# missing piece 1: compute CI and HPD
# missing piece 2: plot CI and HPD
Expected end result summarized in the image below or at p. 33 of BDA3.
Resources found so far:
gist: uses PyPlot, though
hdrcde R package
Thanks for updating the question; it brings a new perspective.
The gist is kind of correct; only it uses an earlier version of Julia.
Hence linspace should be replaced by LinRange. Instead of using PyPlot use using Plots.
I would change the plotting part to the following:
plot(cred_x, pdf(B, cred_x), fill=(0, 0.9, :orange))
plot!(x,pdf(B,x), title="pdf with 90% region highlighted")
At first glance, the computation of the CI seems correct. (Like the answer from Closed Limelike Curves or the answer from the question [there][1]). For the HDP, I concur with Closed Limelike Curves. Only I would add that you could build your HDP function upon the gist code. I would also have a version for posterior with a known distribution (like in your reference document page 33, figure 2.2) as you don't need to sample. And another with sampling like Closed Limelike Curves indicated.
OP edited the question, so I'm giving a new answer.
For central credible intervals, the answer is pretty easy: Take the quantiles at each point:
lowerBound = quantile(Normal(0, 1), .025)
upperBound = quantile(Normal(0, 1), .975)
This will give you an interval where the probability of x lying below the lower bound .025, and similarly for the upper bound, adding up to .05.
HPDs are harder to calculate. In addition, they tend to be less common because they have some weird properties that aren't shared by central credible intervals. The easiest way to do it is probably using a Monte Carlo algorithm. Use randomSample = rand(Normal(0, 1), 2^12) to draw 2^12 samples from the Normal distribution. (Or however many samples you want, more gives more accurate results that are less affected by random chance.) Then, for each random point, evaluate the probability density at that random point using pdf.(randomSample). Then, pick the 95% of points with the highest probability density; include all of these points in the highest density interval, and any points between them (I'm assuming you're dealing with a single-moded distribution like the normal).
There are better ways to do this for the normal distribution, but they're harder to generalize.
You're looking for ArviZ.jl, together with Turing.jl's MCMCChains. MCMCChains will give you very basic plotting capabilities, e.g. a plot of the PDF estimated from each chain. ArviZ.jl (a wrapper around the Python ArviZ package) adds a lot more plots.

Generate multivariate nonnormal random numbers in R

Background
I want to generate multivariate distributed random numbers with a fixed variance matrix. For example, I want to generate a 2 dimensional data with covariance value = 0.5, each dimensional variance = 1. The first maginal of data is a norm distribution with mean = 0, sd = 1, and the next is a exponential distribution with rate = 2.
My attempt
My attempt is that we can generate a correlated multinormal distribution random numbers and then revised them to be any distribution by Inverse transform sampling.
In below, I give an example about transforming 2 dimensional normal distribution random numbers into a norm(0,1)+ exp(2) random number:
# generate a correlated multi-normal distribution, data[,1] and data[,2] are standard norm
data <- mvrnorm(n = 1000,mu = c(0,0), Sigma = matrix(c(1,0.5,0.5,1),2,2))
# calculate the cdf of dimension 2
exp_cdf = ecdf(data[,2])
Fn = exp_cdf(data[,2])
# inverse transform sampling to get Exponetial distribution with rate = 2
x = -log(1-Fn + 10^(-5))/2
mean(x);cor(data[,1],x)
Out:
[1] 0.5035326
[1] 0.436236
From the outputs, the new x is a set of exponential(rate = 2) random numbers. Also, x and data[,1] are correlated with 0.43. The correlated variance is 0.43, not very close to my original setting value 0.5. It maybe a issue. I think covariance of sample generated should stay more closer to initial setting value. In general, I think my method is not quite decent, maybe you guys have some amazing code snippets.
My question
As a statistics graduate, I know there exist 10+ methods to generate multivariate random numbers theoretically. In this post, I want to collect bunch of code snippets to do it automatically using packages or handy . And then, I will compare them from different aspects, like time consuming and quality of data etc. Any ideas is appreciated!
Note
Some users think I am asking for package recommendation. However, I am not looking for any recommendation. I already knew commonly used statistical theroms and R packages. I just wanna know how to generate multivariate distributed random numbers with a fixed variance matrix decently and give a code example about generate norm + exp random numbers. I think there must exist more powerful code snippets to do it in a decent way! So I ask for help right now!
Sources:
generating-correlated-random-variables, math
use copulas to generate multivariate random numbers, stackoverflow
Ross simulation, theoretical book
R CRAN distribution task View

Generate beta-binomial distribution from existing vector

Is it possible to/how can I generate a beta-binomial distribution from an existing vector?
My ultimate goal is to generate a beta-binomial distribution from the below data and then obtain the 95% confidence interval for this distribution.
My data are body condition scores recorded by a veterinarian. The values of body condition range from 0-5 in increments of 0.5. It has been suggested to me here that my data follow a beta-binomial distribution, discrete values with a restricted range.
set1 <- as.data.frame(c(3,3,2.5,2.5,4.5,3,2,4,3,3.5,3.5,2.5,3,3,3.5,3,3,4,3.5,3.5,4,3.5,3.5,4,3.5))
colnames(set1) <- "numbers"
I see that there are multiple functions which appear to be able to do this, betabinomial() in VGAM and rbetabinom() in emdbook, but my stats and coding knowledge is not yet sufficient to be able to understand and implement the instructions provided on the function help pages, at least not in a way that has been helpful for my intended purpose yet.
We can look at the distribution of your variables, y-axis is the probability:
x1 = set1$numbers*2
h = hist(x1,breaks=seq(0,10))
bp = barplot(h$counts/length(x1),names.arg=(h$mids+0.5)/2,ylim=c(0,0.35))
You can try to fit it, but you have too little data points to estimate the 3 parameters need for a beta binomial. Hence I fix the probability so that the mean is the mean of your scores, and looking at the distribution above it seems ok:
library(bbmle)
library(emdbook)
library(MASS)
mtmp <- function(prob,size,theta) {
-sum(dbetabinom(x1,prob,size,theta,log=TRUE))
}
m0 <- mle2(mtmp,start=list(theta=100),
data=list(size=10,prob=mean(x1)/10),control=list(maxit=1000))
THETA=coef(m0)[1]
We can also use a normal distribution:
normal_fit = fitdistr(x1,"normal")
MEAN=normal_fit$estimate[1]
SD=normal_fit$estimate[2]
Plot both of them:
lines(bp[,1],dbetabinom(1:10,size=10,prob=mean(x1)/10,theta=THETA),
col="blue",lwd=2)
lines(bp[,1],dnorm(1:10,MEAN,SD),col="orange",lwd=2)
legend("topleft",c("normal","betabinomial"),fill=c("orange","blue"))
I think you are actually ok with using a normal estimation and in this case it will be:
normal_fit$estimate
mean sd
6.560000 1.134196

Why does dnorm() not return the standard deviation I inputted when I do sd(dnorm())?

This may be a dumb question, however I don't understand why sd(dnorm(1:100, mean=50, sd=15)) doesn't return the standard deviation as [1] 15.0 instead of what it actually returns which is [1] 0.009440673. When I do this with rnorm() sd(rnorm(100, mean=50, sd=15)) it returns what I would expect which is a number close to 15: [1] 17.00682. Can someone please explain why sd(dnorm(x,mean=mean,sd=sd)) doesn't return the standard deviation that I input to dnorm?
The dnorm function returns the density of the normal distribution with the mean (50) and standard deviation (15) you gave it.
On the other hand, rnorm will sample 100 numbers over a normal distribution, that's why you get standard deviations close to 15.
It's always helpful to plot your data. If you try hist(dnorm(1:100, mean=50, sd=15)) you'll see that the variability is very small (see below). As MkWTF points out, that's because dnorm returns the value of the probability density function of the normal distribution at value x given specified mean and sd.
rnorm in contrast generates random numbers with probability given by the probability density function of the normal distribution, which is why it allows a sensible estimate of the SD - the generated values follow that distribution.
The documentation for dnorm/pnorm/qnorm/rnorm is not great in my opinion (as someone who lacks a background in mathematics), but if you spend some time reading different online resources about these functions, and ensuring that you understand the meaning of the different underlying concepts (probability density functions, quantiles, random number generation, and (cumulative) distribution functions, it will become clear over time.
hist(dnorm(1:100, mean=50, sd=15))
Created on 2020-01-09 by the reprex package (v0.3.0)

Cox regression in MATLAB

I know there is COXPHFIT function in MATLAB to do Cox regression, but I have problems understanding how to apply it.
1) How to compare two groups of samples with survival data in days (survdays), censoring (cens) and some predictor value (x)? The groups defined by groups logical variable. Groups have different number of samples.
2) What is the baseline parameter in coxphfit? I did read the docs, but how should I choose the baseline properly?
It would be great if you know a site with good detailed examples on medical survival data. I found only the Mathworks demo that does not even mention coxphfit.
Do you know may be another 3rd party function for Cox regression?
UPDATE: The r tag added since the answer I've got is for R.
With survival analysis, the hazard function is the instantaneous death rate.
In these analyses, you are typically measuring what effect something has on this hazard function. For example, you may ask "does swallowing arsenic increase the rate at which people die?". A background hazard is the level at which people would die anyway (without swallowing arsenic, in this case).
If you read the docs for coxphfit carefully, you will notice that that function tries to calculate the baseline hazard; it is not something that you enter.
baseline The X values at which to
compute the baseline hazard.
EDIT: MATLAB's coxphfit function doesn't obviously work with grouped data. If you are happy to switch to R, then the anaylsis is a one-liner.
library(survival)
#Create some data
n <- 20;
dfr <- data.frame(
survdays = runif(n, 5, 15),
cens = runif(n) < .3,
x = rlnorm(n),
groups = rep(c("first", "second"), each = n / 2)
)
#The Cox ph analysis
summary(coxph(Surv(survdays, cens) ~ x / groups, dfr))
ANOTHER EDIT: That baseline parameter to MATLAB's coxphfit appears to be a normalising constant. R's coxph function doesn't have an equivalent parameter. I looked in Statistical Computing by Michael Crawley and it seems to suggest that the baseline hazard isn't important, since it cancels out when you calculate the likelihood of your individual dying. See Chapter 33, and p615-616 in particular. My knowledge of how the model works isn't deep enough to explain the discrepancy in the MATLAB and R implementations; perhaps you could ask on the Stack Exchange Stats Analysis site.

Resources