how to sample from an upside down bell curve

how to sample from an upside down bell curve - r

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")

Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.

Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.

If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)

An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Related

Gamma Likelihood in R

I want to plot the posterior distribution for data sampled from gamma(2,3) with a prior distribution of gamma(3,3). I am assuming alpha=2 is known. But a graph of my posterior for different values of the rate parameter centers around 4. It should be 3. I even tried with a uniform prior to make things simpler. Can you please spot what's wrong? Thank you.
set.seed(101)
dat <- rgamma(100,shape=2,rate=3)
alpha <- 3
n <- 100
post <- function(beta_1) {
posterior<- (((beta_1^alpha)^n)/gamma(alpha)^n)*
prod(dat^(alpha-1))*exp(-beta_1*sum(dat))
return(posterior)
}
vlogl <- Vectorize(post)
curve(vlogl2,from=2,to=6)

A tricky question and possibly more related to statistics than to programming =). I initially made the same reasoning mistake as you, but subsequently realised to be more careful with the posterior and the roles of alpha and beta_1.
The prior is uniform (or flat) so the posterior distribution is proportional (not equal) to the likelihood.
The quantity you have assigned to the posterior is indeed the likelihood. Plugging in alpha=3, this evaluates to
(prod(dat^2)/(gamma(alpha)^n)) * beta_1^(3*n)*exp(-beta_1*sum(dat)).
This is the crucial step. The last two terms in the product depend on beta_1 only, so these two parts determine the shape of the posterior. The posterior distribution is thus gamma distributed with shape parameter 3*n+1 and rate parameter sum(dat). As the mode of the gamma distribution is the ratio of these two and sum(dat) is about 66 for this seed, we get a mode of 301/66 (about 4.55). This coincides perfectly with the ``posterior plot'' (again you plotted the likelihood which is not properly scaled, i.e. not properly integrating to 1) produced by your code (attached below).
I hope LifeisBetter now =).

But a graph of my posterior for different values of the rate parameter
centers around 4. It should be 3.
The mean of your data is 0.659 (~2/3). Given a gamma distribution with a shape parameter alpha = 3, we are trying to find likely values of the rate parameter, beta, that gave rise to the observed data (subject to our prior information). The mean of a gamma distribution is the shape parameter divided by the rate parameter. 100 observations should be enough to mostly overcome the somewhat informative prior (which had a mean of 1), so we should expect beta to take values somewhere in the region alpha/mean(dat), not 3.
alpha/mean(dat)
#> [1] 4.54915
I'm not going to show the derivation of the posterior distribution for beta without TeX, but it is a gamma distribution that includes the rate parameter from the prior distribution of beta (betaPrior = 3):
set.seed(101)
n <- 100
dat <- rgamma(n, 2, 3)
alpha <- 3
betaPrior <- 3
post <- function(x) dgamma(x, alpha*(n + 1), sum(dat) + betaPrior)
curve(post, 2, 6)
Notice that the mean of beta is at ~4.39 rather than ~4.55 because of the informative prior that had a mean of 1.

Random Walks and Gaussian (Normal) Distribution in R

I'm very new to coding in R(coding in general). I've created a distribution using a random walk within the following code:
set.seed(124)
norm <- rnorm(1000)
mean(norm)
mean(norm)^2
sd(norm)
d <- density(norm)
plot(d)
Now I want to create a function of n-steps using the above distribution. The function calculates the expected values based on the probability of moving n-steps to the left or right from the center. I have no idea where to begin.
Any direction would be greatly appreciated.
Thanks

If each normally distributed variate is your step size (positive moves right and negative moves left), then the cumulative sum of your random draws represents your current position. You can compute that with the cumsum function in R:
set.seed(144)
pos <- cumsum(rnorm(1000))
plot(seq_along(pos), pos, xlab="Step Number", ylab="Current Position")
Using replicate and logical operations, you can simulate any number of different questions about random walks. For instance "with what probability does the value of the random walk exceed 100 within the first 1000 steps" could be simulated with:
set.seed(144)
exceed.100 <- replicate(100000, any(cumsum(rnorm(1000)) >= 100))
mean(exceed.100)
# [1] 0.00173
From these 100k replicates, it looks like the probability is around 0.17% that the random walk will exceed 100 during the first 1000 steps.

How do I sample from a custom distribution?

I have the pdf of a distribution. This distribution is not a standard distribution and no functions exist in R to sample from it. How to I sample from this pdf using R?

This is more of a statistics question, as it requires sampling, but in general, you can take this approach to the problem:
Find a distribution f, whose pdf, when multiplied by any given constant k, is always greater than the pdf of the distribution in question, g.
For each sample, do the following steps:
Sample a random number x from the distribution f.
Calculate C = f(x)*k/g(x). This should be equal to or less than 1.
Draw a random number u from a uniform distribution U(0,1). If C < u, then go back to step 3. Otherwise keep x as the number and continue sampling if desired.
This process is known as rejection sampling, and is often used in random number generators that are not uniform.
The normal distribution and the uniform distribution are some of the more common distributions to sample from, but you can do other ones. Generally you want the shapes of k*f(x) and g(x) to be very close, so you don't have to reject a lot of samples.
Here's an example implementation:
#n is sample size
#g is pdf you want to sample from
#rf is sampling function for f
#df is density function for f
#k is multiplicative constant
#... is any necessary parameters for f
function.sample <- function(n,g,rf,df,k,...){
results = numeric(n)
counter = 0
while(counter < n){
x = rf(1,...)
x.pdf = df(x,...)
if (runif(0,1) >= x.pdf * k/g(x)){
results[counter+1] = x
counter = counter + 1
}
}
}
There are other methods to do random sampling, but this is usually the easiest, and it works well for most functions (unless their PDF is hard to calculate but their CDF isn't).

Derivative of Kernel Density

I am using density {stats} to construct a kernel "gaussian' density of a vector of variables. If I use the following example dataset:
x <- rlogis(1475, location=0, scale=1) # x is a vector of values - taken from a rlogis just for the purpose of explanation
d<- density(x=x, kernel="gaussian")
Is there some way to get the first derivative of this density d at each of the n=1475 points

Edit #2:
Following up on Greg Snow's excellent suggestion to use the analytical expression for the derivative of a Gaussian, and our conversation following his post, this will get you the exact slope at each of those points:
s <- d$bw;
slope2 <- sapply(x, function(X) {mean(dnorm(x - X, mean = 0, sd = s) * (x - X))})
## And then, to compare to the method below, plot the results against one another
plot(slope2 ~ slope)
Edit:
OK, I just reread your question, and see that you wanted slopes at each of the points in the input vector x. Here's one way you might approximate that:
slope <- (diff(d$y)/diff(d$x))[findInterval(x, d$x)]
A possible further refinement would be to find the location of the point within its interval, and then calculate its slope as the weighted average of the slope of the present interval and the interval to its right or left.
I'd approach this by averaging the slopes of the segments just to the right and left of each point. (A bit of special care needs to be taken for the first and last points, which have no segment to their left and right, respectively.)
dy <- diff(d$y)
dx <- diff(d$x)[1] ## Works b/c density() returns points at equal x-intervals
((c(dy, tail(dy, 1)) + c(head(dy, 1), dy))/2)/dx

The curve of a density estimator is just the sum of all the kernels, in your case a gaussian (divided by the number of points). The derivative of a sum is the sum of the derivatives and the derivative of a constant times a function is that constant times the derivative. So the derivative of the density estimate at a given point will just be the average of the slopes for the 1475 different gaussian curves at that given point. Each gaussian curve will have a mean corresponding to each of the data points and a standard deviation based on the bandwidth. So if you can calculate the slope for a gaussian, then finding the slope for the density estimate is just a mean of the 1475 slopes.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?

But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).

If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.

This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.

The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.

Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.