Approximate the distribution of a sum of binomial random variables in R - r

My goal is approximate the distribution of a sum of binomial variables.
I use the following paper The Distribution of a Sum of Binomial Random Variables by Ken Butler and Michael Stephens.
I want to write an R script to find Pearson approximation to the sum of binomials.
There is an R-package PearsonDS that allows do this in a simple way.
So I take the first example from the paper and try to find density of the Pearson distribution for this case.
Finally i get an error message "There are no probability distributions with these moments".
Could you please explain me what's wrong in the below code?
library(PearsonDS)
# define parameters for five binomial random varibles
n<-rep(5,5)
p<-seq(0.02,0.10,0.02)
# find the first four cumulants
k.1<-sum(n*p)
k.2<-sum(n*p*(1-p))
k.3<-sum(n*p*(1-p)*(1-2*p))
k.4<-sum(n*p*(1-p)*(1-6*p*(1-p)))
# find the skewness and kurtosis parameters
beta.1<-k.3^2/k.2^3
beta.2<-k.4/k.2^2
# define the moments and calculate
moments <- c(mean=k.1,variance=k.2,skewness=sqrt(beta.1),kurtosis=beta.2)
dpearson(1:7,moments=moments)
I get the error message "There are no probability distributions with these moments".

What you try to insert as kurtosis in your moments, is actually the excess kurtosis, which is just kurtosis - 3. From the help-page of dpearson():
moments:
optional vector/list of mean, variance, skewness, kurtosis (not excess kurtosis).
So adding 3 to beta.2 will provide you with the real kurtosis:
beta.1 <- (k.3^2)/(k.2^3)
beta.2 <- k.4/(k.2^2)
kurt <- beta.2 + 3
moments <- c(mean = k.1, variance = k.2, skewness = beta.1, kurtosis = kurt)
dpearson(1:7, moments=moments)
# [1] 0.3438773545 0.2788412385 0.1295129534 0.0411140817 0.0099279576
# [6] 0.0019551512 0.0003294087
To get a result like the one in the paper, we should investigate the cumulative distribution function and add 0.5 to correct for the bias caused by approximating a discrete distribution by a continuous one:
ppearson(1:7+0.5, moments = moments)
# [1] 0.5348017 0.8104394 0.9430092 0.9865434 0.9973715 0.9995578 0.9999339
A little background information:
The function threw an error because the relationship between kurtosis and skewness wasn't invalid: kurtosis is lower-bounded by the skewness in the following way: kurtosis >= (skewness)^2 - 1. The proof ain't pretty and is certainly beyond the scope of the question, but you can check out the references below if you like for different versions of this inequality.
Wilkins, J. Ernest. A Note on Skewness and Kurtosis. Ann. Math. Statist. 15 (1944), no. 3, 333--335. http://projecteuclid.org/euclid.aoms/1177731243.
K. Pearson. Mathematical contributions to the theory of evolution, XIX; second supplement to a memoir on skew variation. Philos. Trans. Roy. Soc. London Ser. A, 216 (1916), p. 432 http://rsta.royalsocietypublishing.org/content/216/538-548/429
Pearson, K. (1929). "Editorial note to 'Inequalities for moments of frequency functions and for various statistical constants'". Biometrika. 21 (1–4): 361–375. link

Related

R how to calculate confidence interval based on proportion

I'm new to R and trying to learn stats..
Here is one practice question that I'm trying to figure out
How should I use R code to create a function based on this math equation?
I have a dataframe like this
the "exposed" column from the df contains two groups, one is called"Test Group (Exposed)" the other one is called "Control Group". So the math function is referring to these two groups.
In another practice I have these codes here to calculate the confidence interval
# sample size
# OK for non normal data if n > 30
n <- 150
# calculate the mean & standard deviation
will_mean <- mean(will_sample)
will_s <- sd(will_sample)
# normal quantile function, assuming mean has a normal distribution:
qnorm(p=0.975, mean=0, sd=1) # 97.5th percentile for a N(0,1) distribution
# a.k.a. Z = 1.96 from the standard normal distribution
# calculate standard error of the mean
# standard error of the mean = mean +/- critical value x (s/sqrt(n))
# "q" functions in r give the value of the statistic at a given quantile
critical_value <- qt(p=0.975, df=n-1)
error <- critical_value * will_s/sqrt(n)
# confidence inverval
will_mean - error
will_mean + error
but I'm not sure how to do the exposed 2 groups
Don't worry it's quite easy if you have experience in at least one programming language, R is quite trivial.
The only remarkable difference between R and most of other programming languanges is that R was developed for statistical purposes.
You can compute what is the quantile for a certain significance level α (reminds to divide it by 2 for your formula) by using the function qnorm(). By default it is set up for standardized normal distribution, like in your case, but you can get more details using the documentation, reachable by the command ?qnorm().
Actually in the exercise you are not required to compute it, since you have to pass it as argument, but in reality you need to.
The code should be something like:
conf <- function(p1,p2,n1,n2,z){
part = z*(p1*(1-p1)/n1+p2*(1-p2)/n2)**(1/2)
return(c(p1-p2-part,
p1-p2+part))
}

Why does dnorm() not return the standard deviation I inputted when I do sd(dnorm())?

This may be a dumb question, however I don't understand why sd(dnorm(1:100, mean=50, sd=15)) doesn't return the standard deviation as [1] 15.0 instead of what it actually returns which is [1] 0.009440673. When I do this with rnorm() sd(rnorm(100, mean=50, sd=15)) it returns what I would expect which is a number close to 15: [1] 17.00682. Can someone please explain why sd(dnorm(x,mean=mean,sd=sd)) doesn't return the standard deviation that I input to dnorm?
The dnorm function returns the density of the normal distribution with the mean (50) and standard deviation (15) you gave it.
On the other hand, rnorm will sample 100 numbers over a normal distribution, that's why you get standard deviations close to 15.
It's always helpful to plot your data. If you try hist(dnorm(1:100, mean=50, sd=15)) you'll see that the variability is very small (see below). As MkWTF points out, that's because dnorm returns the value of the probability density function of the normal distribution at value x given specified mean and sd.
rnorm in contrast generates random numbers with probability given by the probability density function of the normal distribution, which is why it allows a sensible estimate of the SD - the generated values follow that distribution.
The documentation for dnorm/pnorm/qnorm/rnorm is not great in my opinion (as someone who lacks a background in mathematics), but if you spend some time reading different online resources about these functions, and ensuring that you understand the meaning of the different underlying concepts (probability density functions, quantiles, random number generation, and (cumulative) distribution functions, it will become clear over time.
hist(dnorm(1:100, mean=50, sd=15))
Created on 2020-01-09 by the reprex package (v0.3.0)

Simulate a distribution with a given kurtosis and skewness in r? [duplicate]

Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly.
If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it.
Thanks
There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.
Here's an example of fitting a Johnson to some sample data:
require(SuppDists)
## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000, 4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)
## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")
## Print out the parameters
sJohnson(parms)
## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
The final plot looks like this:
You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.
Good luck!
EDIT
As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:
parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")
Which produces the following:
Anyone have any ideas why Johnson seems biased when fit using moments?
This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.
There a few different ways of tackling this problem:
Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.
Saddlepoint approximations. In this paper:
Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.
We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.
Laguerre expansions:
Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.
The results in this paper seem more promising, but I haven't coded them up.
This question was asked more than 3 years ago, so I hope my answer doesn't come too late.
There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.
The problem is then to maximize S subject to the constraints:
1) $\int x p(x) dx = "first moment"$,
2) $\int x^2 p(x) dx = "second moment"$,
3) ... and so on
I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."
Here is a link that tries to implement this in R:
https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r
One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.
To generate 10 random values from that distribution try:
library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)
I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.
One suggested approach is as follows:
Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
Generate N normal variables (mean = 0, std = 1)
Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std
The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.
Caveats:
Fleishman will not work for all combinations of skewness and kurtois
Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform
Those parameters don't actually fully define a distribution. For that you need a density or equivalently a distribution function.
The entropy method is a good idea, but if you have the data samples you use more information compared to the use of only the moments! So a moment fit is often less stable. If you have no more information about how the distribution looks like then entropy is a good concept, but if you have more information, e.g. about the support, then use it! If your data is skewed and positive then using a lognormal model is a good idea. If you know also the upper tail is finite, then do not use the lognormal, but maybe the 4-parameter Beta distribution. If nothing is known about support or tail characteristics, then maybe a scaled and shifted lognormal model is fine. If you need more flexibility regarding kurtosis, then e.g. a logT with scaling + shifting is often fine. It can also help if you known that the fit should be near-normal, if this is the case then use a model which includes the normal distribution (often the case anyway), otherwise you may e.g. use a generalized secant-hyperbolic distribution. If you want to do all this, then at some point the model will have some different cases, and you should make sure that there are no gaps or bad transition effects.
As #David and #Carl wrote above, there are several packages dedicated to generate different distributions, see e.g. the Probability distributions Task View on CRAN.
If you are interested in the theory (how to draw a sample of numbers fitting to a specific distribution with the given parameters) then just look for the appropriate formulas, e.g. see the gamma distribution on Wiki, and make up a simple quality system with the provided parameters to compute scale and shape.
See a concrete example here, where I computed the alpha and beta parameters of a required beta distribution based on mean and standard deviation.

Using the sd command in R with a binomially distributed variable

I want to know whether the sd command in R works accurately when calculating the standard deviation of a binomial distribution.
Take the following example:
coin <- c("heads", "heads", "tails", "heads", "tails", "heads", "heads", "tails")
die <- as.factor(coin)
The standard deviation formula for such a distribution would be:
sd <- sqrt(n*p*(1-p))
where n is the number of trials, and p is the probability of success.
So we would calculate it in R as follows:
sqrt(8*(5/8)*(3/8))
[1] 1.369306
However, when we use the sd command, we get a different answer:
sd(coin)
[1] 0.5175492
Does the sd function in R not take into consideration the fact that the variable is not numeric. That explanation would make sense to me if R returned an error message, but it produces a result. Can you please clarify this for me? Thanks.
The sd function returns the (corrected) sample standard deviation (not the theoretic SD of a Bernoulli random variable). The sample SD is defined as sqrt( sum((x-x_bar)^2)/(N-1)). See ?sd and ?var Your example can be checked:
samp_var_die <- sum((as.numeric(die)-mean(as.numeric(die)))^2)/(length(die)-1)
samp_sd_die <- sqrt(samp_var_die)
samp_sd_die
#[1] 0.5175492
If you are interested in exploring the theoretic aspects of statistical distributions, there is a nice suite of packages devoted to this topic. Check out the distr-package and it's cousins: distrEllipse, distrEx, distrMod, distrRmetrics, distrSim, distrTeach, and RandVar. I found playing with functions and examples from those packages quite educational and entertaining.
By the way, that 1.3+ value you got would be the SD (theoretic sigma) around the estimate of 5 you would have gotten from that series of observations.

Difference between Hmisc wtd.var and SAS proc Mean generated weighted variance

I'm getting different results from R and SAS when I try to calculate a weighted variance. Does anyone know what might be causing this difference?
I create vectors of weights and values and I then calculate the weighted variance using the
Hmisc library wtd.var function:
library(Hmisc)
wt <- c(5, 5, 4, 1)
x <- c(3.7,3.3,3.5,2.8)
wtd.var(x,weights=wt)
I get an answer of:
[1] 0.0612381
But if I try to reproduce these results in SAS I get a quite different result:
data test;
input wt x;
cards;
5 3.7
5 3.3
4 3.5
1 2.8
;
run;
proc means data=test var;
var x;
weight wt;
run;
Results in an answer of
0.2857778
You probably have a difference in how the variance is calculated. SAS gives you an option, VARDEF, which may help here.
proc means data=test var vardef=WDF;
var x;
weight wt;
run;
That on your dataset gives a variance similar to r. Both are 'right', depending on how you choose to calculate the weighted variance. (At my shop we calculate it a third way, of course...)
Complete text from PROC MEANS documentation:
VARDEF=divisor specifies the divisor to use in the calculation of the
variance and standard deviation. The following table shows the
possible values for divisor and associated divisors.
Possible Values for VARDEF=
Value Divisor Formula for Divisor
DF degrees of freedom n - 1
N number of observations n
WDF sum of weights minus one ([Sigma]iwi) - 1
WEIGHT | WGT sum of weights [Sigma]iwi
The procedure computes the variance as CSS/Divisor, where CSS
is the corrected sums of squares and equals Sum((Xi-Xbar)^2). When you
weight the analysis variables, CSS equals sum(Wi*(Xi-Xwbar)^2), where
Xwbar is the weighted mean.
Default: DF Requirement: To compute the standard error of the mean,
confidence limits for the mean, or the Student's t-test, use the
default value of VARDEF=.
Tip: When you use the WEIGHT statement and
VARDEF=DF, the variance is an estimate of Sigma^2, where the
variance of the ith observation is Sigma^2/wi and wi is the
weight for the ith observation. This method yields an estimate of the
variance of an observation with unit weight.
Tip: When you use the
WEIGHT statement and VARDEF=WGT, the computed variance is
asymptotically (for large n) an estimate of Sigma^2/wbar, where
wbar is the average weight. This method yields an asymptotic
estimate of the variance of an observation with average weight.

Resources