Gamma Likelihood in R - r

I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)

A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).

But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.

Related

How to use the ncp parameter in R's qt function?

I'm using R to make some calculations. This question is about R but also about statistics.
Say I have a dataset of paired samples consisting of a subject's blood platelet concentration after injection of placebo and then again after injection of medication for a number of subjects. I want to estimate the mean difference for the paired samples. I'm just learning about the t distribution. If I wanted to a 95% confidence interval for the mean difference using a Z-test, I could simply use:
mydata$diff <- mydata$medication - mydata$placebo
mu0 <- mean(mydata$diff)
sdmu <- sd(mydata$diff) / sqrt(length(mydata$diff))
qnorm(c(0.025, 0.975), mu, sdmu)
After much confusion and cross-checking with the t.test function, I've figured out that I can get the 95% confidence interval for a t-test with:
qt(c(0.025, 0.975), df=19) * sdmu + mu0
My understanding of this is as follows:
Tstatistic = (mu - mu0)/sdmu
Tcdf^-1(0.025) <= (mu - mu0) / sdmu <= Tcdf^-1(0.975)
=>
sdmu * Tcdf^-1(0.025) + mu0 <= mu <= sdmu * Tcdf^-1(0.975) + mu0
The reason this is confusing is that if I were using a Z-test, I would write it like this:
qnorm(c(0.025, 0.975), mu0, sdmu)
and it's not until I tried to figure out how to use the t distribution that I realised I could move the normal distribution parameters out of the function too:
qnorm(c(0.025, 0.975), 0, 1) * sdmu + mu0
Trying to wrap my head around what this means mathematically, it means that the Z-statistic (mu - mu0)/sdmu is always normally distributed with mean 0 and standard deviation of 1?
What has me stumped is that I'd like to move the t distribution parameters into the arguments to the function to cut down on the enormous mental overhead of thinking about this transformation.
However, according to my version of the R function qt's documentation, in order to do this, I would need to calculate the non-centrality parameter ncp. According to (my version of) the documentation, the ncp is explained as follows:
Let T= (mX - m0) / (S/sqrt(n)) where mX is the mean and S the sample standard deviation (sd) of X_1, X_2, …, X_n which are i.i.d. N(μ, σ^2) Then T is distributed as non-central t with df= n - 1 degrees of freedom and non-centrality parameter ncp = (μ - m0) * sqrt(n)/σ.
I can't wrap my head around this at all. At first it seems to fit into my framework because Tstatistic = (mu - m0) / sdmu. But isn't μ what I want the qt function (which is Tcdf-1) to return? How can it appear in the ncp, which I need to give as an input? And what about σ? What do μ and σ mean in this context?
Basically, how can I get the same result as qt(c(0.025, 0.975), df=19) * sdmu + mu0, without any terms outside of the function call, and could I have an explanation of how it works?
Let me try to explain without using any formulae.
First of all, the student t distribution and the normal distribution are two distinct probability distributions and (in most situations) are not supposed to give you the same results.
The t distribution is the appropriate probability distribution to test for a difference between two normally distributed samples. Since we do not know the population sd we have to stick with the one we get from the sample. And that distribution is not normal distributed anymore, it is t-distributed.
The z-distribution can be used to approximate the test. In this case, we use the z-distribution as approximation of the t-distribution. However, it is recommended not to do this with low degrees of freedom. Reason: the higher degrees of freedom a t distribution has it becomes increasingly similar to a normal distribution. Textbooks usually say that t and normal distribution with df>30 are similar enough to approximate t with normal distribution. In order to do that, you would have to normalise your data, first, so that mean = 0 and sd = 1. Then you can do the approximation using the z-distribution.
I usually recommend not to use this approximation. It was a reasonable crutch when calculations had to be done on paper using your head, a pen, and a bunch of tables. There exist many workarounds in basic statistics that were supposed to give you a reasonble result with less computation effort. With modern computers that is usually obsolete (in most cases at least).
The z distribution, by the way, is defined (by convention) as a normal distribution N(0, 1) i.e. a normal distribution with mean = 0 and sd = 1.
Finally, about the different ways these distributions are specified. The normal distribution is actually the only probability distribution that I know that you can specify by setting mean and sd directly (there are dozens of distributions, in case you're interested). The non-centrality parameter has a similar effect than the mean of the normal distribution. In a plot it moves the t-distribution along the x-axis. But it also changes its shape and skews it so that mean and ncp move away from each other.
This code will show how the ncp changes the shape and location of the t-distribution:
x <- seq(-5, 15, 0.1)
plot(x, dt(x, df = 10, ncp = 0), from = -4, to = +4, type = "l")
for(ncp in 1:6)
lines(x, dt(x, df = 10, ncp = ncp))

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

R - Gamma Cumulative Distribution Function

I want to calculate the Gamma CDF for an array of data that I have. I have calculated the alpha and beta parameters, however I am not sure of how to calculate the CDF in R,(Is there something like Matlab's gamcdf?).
I have seen some people use fitdistr, or pgamma, but I do not understand how to put the alpha and beta values or I do not need them at all?
Thanks.
A gamma distribution is defined by the two parameters, and given those two parameters, you can calculate the cdf for an array of values using pgamma.
# Let's make a vector
x = seq(0, 3, .01)
# Now define the parameters of your gamma distribution
shape = 1
rate = 2
# Now calculate points on the cdf
cdf = pgamma(x, shape, rate)
# Shown plotted here
plot(x,cdf)
Note that the Gamma has different ways it can be parameterized. Check ?pgamma for specifics to ensure your 2 parameters match.

Simulate Values that follow a Distribution Curve in R

I want to simulate demand values that follows different distributions (ex above: starts of linear> exponential>invlog>etc) I'm a bit confused by the notion of probability distributions but thought I could use rnorm, rexp, rlogis, etc. Is there any way I could do so?
I think it may be this but in R: Generating smoothed randoms that follow a distribution
Simulating random values from commonly-used probability distributions in R is fairly trivial using rnorm(), rexp(), etc, if you know what distribution you want to use, as well as its parameters. For example, rnorm(10, mean=5, sd=2) returns 10 draws from a normal distribution with mean 5 and sd 2.
rnorm(10, mean = 5, sd = 2)
## [1] 5.373151 7.970897 6.933788 5.455081 6.346129 5.767204 3.847219 7.477896 5.860069 6.154341
## or here's a histogram of 10000 draws...
hist(rnorm(10000, 5, 2))
You might be interested in an exponential distribution - check out hist(rexp(10000, rate=1)) to get the idea.
The easiest solution will be to investigate what probability distribution(s) you're interested in and their implementation in R.
It is still possible to return random draws from some custom function, and there are a few techniques out there for doing it - but it might get messy. Here's a VERY rough implementation of drawing randomly from probabilities defined by the region of x^3 - 3x^2 + 4 between zero and 3.
## first a vector of random uniform draws from the region
unifdraws <- runif(10000, 0, 3)
## assign a probability of "keeping" draws based on scaled probability
pkeep <- (unifdraws^3 - 3*unifdraws^2 + 4)/4
## randomly keep observations based on this probability
keep <- rbinom(10000, size=1, p=pkeep)
draws <- unifdraws[keep==1]
## and there it is!
hist(draws)
## of course, it's less than 10000 now, because we rejected some
length(draws)
## [1] 4364

Gamma equivalent to standard deviations

I have a gamma distribution fit to my data using libary(fitdistrplus). I need to determine a method for defining the range of x values that can be "reasonably" expected, analogous to using standard deviations with normal distributions.
For example, x values within two standard deviations from the mean could be considered to be the reasonable range of expected values from a normal distribution. Any suggestions for how to define a similar range of expected values based on the shape and rate parameters of a gamma distribution?
...maybe something like identifying the two values of x that between which contains 95% of the data?
Let's assume we have a random variable that is gamma distributed with shape alpha=2 and rate beta=3. We would expect this distribution to have mean 2/3 and standard deviation sqrt(2)/3, and indeed we see this in simulated data:
mean(rgamma(100000, 2, 3))
# [1] 0.6667945
sd(rgamma(100000, 2, 3))
# [1] 0.4710581
sqrt(2) / 3
# [1] 0.4714045
It would be pretty weird to define confidence ranges as [mean - gamma*sd, mean + gamma*sd]. To see why, consider if we selected gamma=2 in the example above. This would yield confidence range [-0.276, 1.609], but the gamma distribution can't even take on negative values, and 4.7% of data falls above 1.609. This is at the very least not a well balanced confidence interval.
A more natural choice might by to take the 0.025 and 0.975 percentiles of the distribution as a confidence range. We would expect 2.5% of data to fall below this range and 2.5% of data to fall above the range. We can use qgamma to determine that for our example parameters the confidence range would be [0.081, 1.857].
qgamma(c(0.025, 0.975), 2, 3)
# [1] 0.08073643 1.85721446
The mean expected value of a gamma is:
E[X] = k * theta
The variance is Var[X] = k * theta^2 where, k is shape and theta is scale.
But typically I would use 95% quantiles to indicate data spread.

Resources