Gamma equivalent to standard deviations - r

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?

Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446

The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

Related

Gamma Likelihood in R

I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)
A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).
But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Generate positive real numbers from rpois()

I am trying to create a Poisson simulation using rpois(). I have a distribution of two decimal place interest rates and I want to find out if these have Poisson rather than a normal distribution.
The rpois() function returns positive integers. I want it to return two decimal place positive numbers instead. I have tried the following
set.seed(123)
trialA <- rpois(1000, 13.67) # generate 1000 numbers
mean(trialA)
13.22 # Great! Close enough to 13.67
var(trialA)
13.24 # terrific! mean and variance should be the same
head(trialA, 4)
6 7 8 14 # Oh no!! I want numbers with two decimals places...??????
# Here is my solution...but it has a problem
# 1) Scale the initial distribution by multiplying lambda by 100
trialB <- rpois(1000, 13.67 * 100)
# 2) Then, divide the result by 100 so I get a fractional component
trialB <- trialB / 100
head(trialB, 4) # check results
13.56 13.62 13.26 13.44 # terrific !
# check summary results
mean(trialB)
13.67059 # as expected..great!
var(trialB)
0.153057 # oh no!! I want it to be close to: mean(trialB) = 13.67059
How can I use rpois() to generate positive two decimal place numbers that have a Poisson distribution.
I know that Poisson distributions are used for counts and that counts are positive integers but I also believe that Poisson distributions can be used to model rates. And these rates could be just positive integers divided by a scalar.
If you scale a Poisson distribution to change its mean, the result is no longer Poisson, and the mean and variance are no longer equal -- if you scale the mean by a factor s, then the variance changes by a factor s^2.
You probably want to use the Gamma distribution. The mean of the Gamma is shape*scale and the variance is shape*scale^2, so you have to use scale=1 to get real, positive numbers with equal mean and variance:
set.seed(1001)
r <- rgamma(1e5,shape=13.67,scale=1)
mean(r) ## 13.67375
var(r) ## 13.6694
You can round to two decimal places without changing the mean and variance very much:
r2 <- round(r,2)
mean(r2) ## 13.67376
var(r2) ## 13.66938
Compare with a Poisson distribution:
par(las=1,bty="l")
curve(dgamma(x,shape=13.67,scale=1),from=0,to=30,
ylab="Probability or prob. density")
points(0:30,dpois(0:30,13.67),type="h",lwd=2)

Sum of N independent standard normal variables

I wanted to simulate sum of N independent standard normal variables.
sums <- c(1:5000)
for (i in 1:5000) {
sums[i] <- sum(rnorm(5000,0,1))
}
I tried to draw N=5000 standard normal and sum them. Repeat for 5000 simulation paths.
I would expect the expectation of sums be 0, and variance of sums be 5000.
> mean(sums)
[1] 0.4260789
> var(sums)
[1] 5032.494
The simulated expectation is too big. When I tried it again, I got 1.309206 for the mean.
#ilir is correct, the value you get is essentially zero.
If you look at the plot, you get values between -200 and 200. 0.42 is for all intents and purposes 0.
You can test this with t.test.
> t.test(sums, mu = 0)
One Sample t-test
data: sums
t = -1.1869, df = 4999, p-value = 0.2353
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.167856 0.778563
sample estimates:
mean of x
-1.194646
There is no evidence that your mean values differs from zero (given the null hypothesis is true).
This is just plain normal that the mean does not fall exactly on 0, because it is an empirical mean computed from "only" 5000 realizations of the random variable.
However, the distribution of your realizations contained in the sumsvector should "look" Gaussian.
For example, when I try to plot the histogram and the qqplot obtained of 10000 realizations of the sum of 5000 gaussian laws (created in this way: sums <- replicate(1e4,sum(rnorm(5000,0,1)))), it looks normal, as you can see on the following figures:
hist(sums)
qqnorm(sums)
Sum of the independent normals is again normal, with mean the sum of the means and the variance the sum of variance. So sum(rnorm(5000,0,1)) is equivalent to rnorm(1,0,sqrt(5000)). The sample average of normals is again the normal variable. In your case you take a sample average of 5000 independent normal variables with zero mean and variance 5000. This is a normal variable with zero mean and unit variance, i.e. the standard normal.
So in your case mean(sums) is identical to rnorm(1). So any value from interval (-1.96,1.96) will come up 95% of the time.

R: Function that finds the range of 95% of all values?

Is there a function or an elegant way in the R language, to get the minimum range, that covers, say 95% of all values in a vector?
Any suggestions are very welcome :)
95% of the data will fall between the 2.5th percentile and 97.5th percentile. You can compute that value in R as follows:
x <- runif(100)
quantile(x,probs=c(.025,.975))
To get a sense of what's going on, here's a plot:
qts <- quantile(x,probs=c(.05,.95))
hist(x)
abline(v=qts[1],col="red")
abline(v=qts[2],col="red")
Note this is the exact/empirical 95% interval; there's no normality assumption.
It's not so hard to write such function:
find_cover_region <- function(x, alpha=0.95) {
n <- length(x)
x <- sort(x)
k <- as.integer(round((1-alpha) * n))
i <- which.min(x[seq.int(n-k, n)] - x[seq_len(k+1L)])
c(x[i], x[n-k+i-1L])
}
Function will find shortest interval. If there are intervals with the same length first (from -Inf) will be picked up.
find_cover_region(1:100, 0.70)
# [1] 1 70
find_cover_region(rnorm(10000), 0.9973) # three sigma, approx (-3,3)
# [1] -2.859 3.160 # results may differ
You could also look on highest density regions (e.g. in package hdrcde, function hdr). It's more statistical way to find shortest intervals with given cover probability (some kernel density estimators are involved).
The emp.hpd function in the TeachingDemos package will find the values in a vector that enclose a given percentage of the data (95%) that also give the shortest range between the values. If the data is roughly symmetric then this will be close to the results of using quantile, but if the data are skewed then this will give a shorter range.
If the values are distributed approximately like the normal distribution, you can use the standard deviation. First, calculate the mean µ and standard deviation of the distribution. 95% of the values will be in the interval of (µ - 1.960 * stdev, µ + 1.960 * stdev).

Resources