How do I sample from a custom distribution? - r

I have the pdf of a distribution. This distribution is not a standard distribution and no functions exist in R to sample from it. How to I sample from this pdf using R?

This is more of a statistics question, as it requires sampling, but in general, you can take this approach to the problem:
Find a distribution f, whose pdf, when multiplied by any given constant k, is always greater than the pdf of the distribution in question, g.
For each sample, do the following steps:
Sample a random number x from the distribution f.
Calculate C = f(x)*k/g(x). This should be equal to or less than 1.
Draw a random number u from a uniform distribution U(0,1). If C < u, then go back to step 3. Otherwise keep x as the number and continue sampling if desired.
This process is known as rejection sampling, and is often used in random number generators that are not uniform.
The normal distribution and the uniform distribution are some of the more common distributions to sample from, but you can do other ones. Generally you want the shapes of k*f(x) and g(x) to be very close, so you don't have to reject a lot of samples.
Here's an example implementation:
#n is sample size
#g is pdf you want to sample from
#rf is sampling function for f
#df is density function for f
#k is multiplicative constant
#... is any necessary parameters for f
function.sample <- function(n,g,rf,df,k,...){
results = numeric(n)
counter = 0
while(counter < n){
x = rf(1,...)
x.pdf = df(x,...)
if (runif(0,1) >= x.pdf * k/g(x)){
results[counter+1] = x
counter = counter + 1
}
}
}
There are other methods to do random sampling, but this is usually the easiest, and it works well for most functions (unless their PDF is hard to calculate but their CDF isn't).

Related

Random number simulation in R

I have been going through some random number simulation equations while i found out that as Pareto dosent have an inbuilt function.
RPareto is found as
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
can someone explain the intuitive meaning behind this.
It's a well known result that if X is a continuous random variable with CDF F(.), then Y = F(X) has a Uniform distribution on [0, 1].
This result can be used to draw random samples of any continuous random variable whose CDF is known: generate u, a Uniform(0, 1) random variable and then determine the value of x for which F(x) = u.
In specific cases, there may well be more efficient ways of sampling from F(.), but this will always work as a fallback.
It's likely (I haven't checked the accuracy of the code myself, but it looks about right) that the body of your function solves f(x) = u for known u in order to generate a random variable with a Pareto distribution. You can check it with a little algebra after getting the CDF from this Wikipedia page.

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

How to simulate from poisson distribution using simulations from exponential distribution

I am asked to implement an algorithm to simulate from a poisson (lambda) distribution using simulation from an exponential distribution.
I was given the following density:
P(X = k) = P(X1 + · · · + Xk ≤ 1 < X1 + · · · + Xk+1), for k = 1, 2, . . . .
P(X = k) is the poisson with lambda, and Xi is exponential distribution.
I wrote code to simulate the exponential distribution, but have no clue how to simulate a poisson. Could anybody help me about this? Thanks million.
My code:
n<-c(1:k)
u<-runif(k)
x<--log(1-u)/lambda
I'm working on the assumption you (or your instructor) want to do this from first principles rather than just calling the builtin Poisson generator. The algorithm is pretty straightforward. You count how many exponentials you can generate with the specified rate until their sum exceeds 1.
My R is rusty and this sounds like a homework anyway, so I'll express it as pseudo-code:
count <- 0
sum <- 0
repeat {
generate x ~ exp(lambda)
sum <- sum + x
if sum > 1
break
else
count <- count + 1
}
The value of count after you break from the loop is your Poisson outcome for this trial. If you wrap this as a function, return count rather than breaking from the loop.
You can improve this computationally in a couple of ways. The first is to notice that the 1-U term for generating the exponentials has a uniform distribution, and can be replaced by just U. The more significant improvement is obtained by writing the evaluation as maximize i s.t. SUM(-log(Ui) / rate) <= 1, so SUM(log(Ui)) >= -rate.
Now exponentiate both sides and simplify to get
PRODUCT(Ui) >= Exp(-rate).
The right-hand side of this is constant, and can be pre-calculated, reducing the amount of work from k+1 log evaluations and additions to one exponentiation and k+1 multiplications:
count <- 0
product <- 1
threshold = Exp(-lambda)
repeat {
generate u ~ Uniform(0,1)
product <- product * u
if product < threshold
break
else
count <- count + 1
}
Assuming you do the U for 1-U substitution for both implementations, they are algebraically equal and will yield identical answers to within the precision of floating point arithmetic for a given set of U's.
You can use rpois to generate Poisson variates as per above suggestion. However, my understanding of the question is that you wish to do so from first principles rather than using built-in functions. To do this, you need to use the property of the Poisson arrivals stating that the inter-arrival times are exponentially distributed. Therefore we proceed as follows:
Step 1: Generate a (large) sample from the exponential distribution and create vector of cumulative sums. The k-th entry of this vector is the waiting time to the k-th Poisson arrival
Step 2: Measure how many arrivals we see in a unit time interval
Step3: Repeat steps 1 and 2 many times and gather the results into a vector
This will be your sample from the Poisson distribution with the correct rate parameter.
The code:
lambda=20 # for example
out=sapply(1:100000, function(i){
u<-runif(100)
x<--log(1-u)/lambda
y=cumsum(x)
length(which(y<=1))
})
Then you can test the validity vs the built-in function via the Kolmogorov-Smirnov test:
ks.test(out, rpois(100000, lambda))

Defining exponential distribution in R to estimate probabilities

I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.

Mistake in Chi-Square Distribution

I'm trying to generate chi-square random variables according to the following algorithm:
where a(i) are independent, standard normal random variables witn m even and odd
respectively.
Wikipedia gives the following definition:
https://en.wikipedia.org/wiki/Chi-squared_distribution#Definition
The code I wrote is:
dch = double(1000)
t = double(1)
for(i in 1:1000) {
for(j in 1:m) {
x = runif(1, 0, 1)
t = t + x*x
}
dch[i] = t
}
but I am getting the wrong density plot.
So, where is/are the mistake/s and how can I fix them?
As Gregor suggested in comments, you are misinterpreting the inputs to the algorithm. One way to get a Chi-squared with m degrees of freedom is to sum m independent squared standard normals, but that's not the only distributional relationship we know. It turns out that a Chi-squared(2) is the same as an exponential distribution with a mean of 2, and exponentials are straightforward to generate with inverse transform sampling, a.k.a. inversion. So in principle, if m is even you want to generate m/2 exponential(2)'s and sum them. If m is odd, do the same but add one additional standard normal squared.
What all that means is that a straightforward implementation would have you doing m/2 logarithmic evaluations to generate the exponentials. It turns out you can apply the superposition property of exponentials so you only have to do one log evaluation. Since transcendental functions are computationally expensive, this improves the efficiency of the algorithm.
When the dust settles - the z on the second line of your algorithm is a standard normal, but the a's are Uniform(0,1)'s, not normals.

Resources