Convolution of densities - r

If x is the data and hist$density gives me the empirical densities how do I obtain the convolution of x with itself. Convolve() gives me a lot of possibilities but I am not sure what to use. I need a function for this: http://en.wikipedia.org/wiki/Convolution#Discrete_convolution (actually just the real case would suffice)
Simple test: convolution of Bernoullies gives me a Binomial
Pr=c(0.7, 0.3)
Right answer should be a Binomial of parameters n=2 p=0.3

Ok right answer is:
> convolve(Pr,rev(Pr),type="o")
[1] 0.49 0.42 0.09
a binomial of parameters n=2 and p=0.3. So to convolute densities one could use:
convolve(his$density, rev(his$density), type="o")
This works well with discrete distributions but may work (very) badly for continuous distributions.
Note: if u want the fdr use cumsum() on the result of the convolution
Suggestion: for continuous distributions table(x)/sum(table(x)) gives a more accurate input for convolution

Related

Generate random data from arbitrary CDF in R?

I have an arbitrary CDF that is applied to a point estimate. I have a number of these point estimates with associated CDFs, that I need to simulate random data for a Monte Carlo simulation.
The CDF I'm generating by doing a spline fit to the arbitrary points provided in a table. For example, the quantile 0.1 is a product of 0.13 * point estimate. The quantile 0.9 is a product of 7.57 * point estimate. It is fairly crude and is based on a large study comparing these models to real world system -- ignore that for now please.
I fit the CDF using a spline fit as shown here.
If I take the derivative of this, I get the shape of the pdf (image).
I modified the function "samplepdf" found here, Sampling from an Arbitrary Density, as follows:
samplecdf <- function(n, cdf, spdf.lower = -Inf, spdf.upper=Inf) {
my_fun <- match.fun(cdf)
invcdf <- function(u) {
subcdf <- function(t) my_fun(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, c(spdf.lower, spdf.upper))$root)
}
sapply(runif(n), invcdf)
}
This seems to work, OK - when I compare the quantiles I estimate from the randomly generated data they are fairly close to the initial values. However, when I look at the histogram something funny is happening at the tail where it is looks like my function is consistently generating more values than it should according to the pdf. This function consistently does that across all my point-estimates and even though I can look at the individual quantiles and they seem close, I can tell that the overall Monte Carlo simulation is demonstrating higher estimates for the 50% percentile than I expect. Here is a plot of my histogram of the random samples.
Any tips or advice would be very welcome. I think the best route would be to fit an exponential distribution to the CDF, but I'm struggling to do that. All "fitting" assumes that you have data that needs to be fitted -- this is more arbitrary than that.

Does R have something similar to TransformedDistribution in Mathematica?

I have a random variable X and a transformation f and I would like to know the probability distribution function of f(X), at least approximately. In Mathematica there is TransformedDistribution, but I could not find something similar in R. As I said, some kind of approximative solution would be fine, too.
You can check the distr package. For instance, say that y = x^2+2x+1, where x is normally distributed with mean 2 and standard deviation 5. You can:
require(distr)
x<-Norm(2,5)
y<-x^2+2*x+1
#y#r gives random samples. We make an histogram.
hist(y#r(10000))
#y#d and y#p are the density and the cumulative functions
y#d(80)
#[1] 0.002452403
y#p(80)
#[1] 0.8891796

Simulate a distribution with a given kurtosis and skewness in r? [duplicate]

Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly.
If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it.
Thanks
There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.
Here's an example of fitting a Johnson to some sample data:
require(SuppDists)
## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000, 4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)
## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")
## Print out the parameters
sJohnson(parms)
## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
The final plot looks like this:
You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.
Good luck!
EDIT
As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:
parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
Which produces the following:
Anyone have any ideas why Johnson seems biased when fit using moments?
This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.
There a few different ways of tackling this problem:
Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.
Saddlepoint approximations. In this paper:
Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.
We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.
Laguerre expansions:
Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.
The results in this paper seem more promising, but I haven't coded them up.
This question was asked more than 3 years ago, so I hope my answer doesn't come too late.
There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.
The problem is then to maximize S subject to the constraints:
1) $\int x p(x) dx = "first moment"$,
2) $\int x^2 p(x) dx = "second moment"$,
3) ... and so on
I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."
Here is a link that tries to implement this in R:
https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r
One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.
To generate 10 random values from that distribution try:
library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)
I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.
One suggested approach is as follows:
Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
Generate N normal variables (mean = 0, std = 1)
Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std
The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.
Caveats:
Fleishman will not work for all combinations of skewness and kurtois
Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform
Those parameters don't actually fully define a distribution. For that you need a density or equivalently a distribution function.
The entropy method is a good idea, but if you have the data samples you use more information compared to the use of only the moments! So a moment fit is often less stable. If you have no more information about how the distribution looks like then entropy is a good concept, but if you have more information, e.g. about the support, then use it! If your data is skewed and positive then using a lognormal model is a good idea. If you know also the upper tail is finite, then do not use the lognormal, but maybe the 4-parameter Beta distribution. If nothing is known about support or tail characteristics, then maybe a scaled and shifted lognormal model is fine. If you need more flexibility regarding kurtosis, then e.g. a logT with scaling + shifting is often fine. It can also help if you known that the fit should be near-normal, if this is the case then use a model which includes the normal distribution (often the case anyway), otherwise you may e.g. use a generalized secant-hyperbolic distribution. If you want to do all this, then at some point the model will have some different cases, and you should make sure that there are no gaps or bad transition effects.
As #David and #Carl wrote above, there are several packages dedicated to generate different distributions, see e.g. the Probability distributions Task View on CRAN.
If you are interested in the theory (how to draw a sample of numbers fitting to a specific distribution with the given parameters) then just look for the appropriate formulas, e.g. see the gamma distribution on Wiki, and make up a simple quality system with the provided parameters to compute scale and shape.
See a concrete example here, where I computed the alpha and beta parameters of a required beta distribution based on mean and standard deviation.

R function equal to excel CHIINV

I'm looking for a function which do the same thing as excel's CHIINV.
From Microsoft documentation, the definition of CHIINV is
Returns the inverse of the right-tailed probability of the chi-squared distribution
For example
=CHIINV(0.2,2) return 3.21
The closest function I can found in R is
geoR's dinvchisq
However,
dinvchisq(0.2,2) return 1.026062
Please help!
What you want is ?qchisq. This takes a probability and a degrees of freedom, and outputs the associated quantile. Consider:
> qchisq(p=0.2, df=2, lower.tail=FALSE)
[1] 3.218876
Furthermore, according the the documentation, dinvchisq() is the density function (the height of the pdf at a given quantile) of the inverse of the chi-squared distribution. That is, 1/dchisq(). You need the quantile function, not the density function, and you don't want the inverse of the chi-squared distribution (although the confusion seems natural coming from Excel's function).

R: Regression of the sum of distributions on an histogram

Here is the case:
I want to describe an histogram as the sum of several distributions, and thus to fit these distributions on that histogram. In ROOT/C++ that is pretty obvious, but I look for the equivalent in R. Here is a self-explanatory exemple:
## SUM OF TWO GAUSSIANS OF DIFFERENT WIDTHS
x=rnorm(n=1000,mean=0,sd=1)
y=rnorm(n=1000,mean=0,sd=3)
z=append(x,y)
b=seq(-10,10,by=0.25)
hist(z,breaks=b)
In this case the individual contributions (x) and (y) are known, and I can extract their density curves with a Kernel:
## NARROW GAUSSIAN
hist(x,prob=T,breaks=b)
dx=density(x,ker="epan")
lines(dx,col=3,lwd=2)
## WIDE GAUSSIAN
hist(y,prob=T,breaks=b)
dy=density(y,ker="epan")
lines(dy,col=2,lwd=2)
I would like to do something like
z~dx+dy
Where the fractions of dx and dy would be the parameters to be fitted.
Looking into the R documentation I have only found references to single regression and smoothing.
Does anyone have a clue or a sympathetic link?
Thanks in advance,
X.
I found a way, but ignoring the kernel:
x=rnorm(n=10000,mean=0,sd=1)
y=rnorm(n=10000,mean=0,sd=3)
z=append(x,y)
x=subset(x,abs(x)<=10)
y=subset(y,abs(y)<=10)
z=subset(z,abs(z)<=10)
hx=hist(x,prob=T,breaks=b)
hy=hist(y,prob=T,breaks=b)
hz=hist(z,prob=T,breaks=b)
lm(formula=as.formula(hz$intensities~hx$intensities+hy$intensities))
Call:
lm(formula = as.formula(hz$intensities ~ hx$intensities + hy$intensities))
Coefficients:
(Intercept) hx$intensities hy$intensities
4.344e-17 5.002e-01 4.998e-01
That assumes that the template histograms are reliable (enough entries, relevant binning).
I will meanwhile dig further to see how that can be applied to the fit of kernels, given that
lm(formula=as.formula(hz$intensities~dx$y+dy$y))
lm(formula=as.formula(z~dx$y+dy$y))
end up with the error:
variable lengths differ (found for 'dx$y')
as the kernel is estimated from the full set (x) and not the histogram hx.
Thanks, greetings to Massachusetts!

Resources