How do I best simulate an arbitrary univariate random variate using its probability function? - r

In R, what's the best way to simulate an arbitrary univariate random variate if only its probability density function is available?

Here is a (slow) implementation of the inverse cdf method when you are only given a density.
den<-dnorm #replace with your own density
#calculates the cdf by numerical integration
cdf<-function(x) integrate(den,-Inf,x)[[1]]
#inverts the cdf
inverse.cdf<-function(x,cdf,starting.value=0){
lower.found<-FALSE
lower<-starting.value
while(!lower.found){
if(cdf(lower)>=(x-.000001))
lower<-lower-(lower-starting.value)^2-1
else
lower.found<-TRUE
}
upper.found<-FALSE
upper<-starting.value
while(!upper.found){
if(cdf(upper)<=(x+.000001))
upper<-upper+(upper-starting.value)^2+1
else
upper.found<-TRUE
}
uniroot(function(y) cdf(y)-x,c(lower,upper))$root
}
#generates 1000 random variables of distribution 'den'
vars<-apply(matrix(runif(1000)),1,function(x) inverse.cdf(x,cdf))
hist(vars)

To clarify the "use Metropolis-Hastings" answer above:
suppose ddist() is your probability density function
something like:
n <- 10000
cand.sd <- 0.1
init <- 0
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- rnorm(1,mean=vals[i-1],sd=cand.sd)
newprob <- ddist(newval)
if (runif(1)<newprob/oldprob) {
vals[i] <- newval
} else vals[i] <- vals[i-1]
oldprob <- newprob
}
Notes:
completely untested
efficiency depends on candidate distribution (i.e. value of cand.sd).
For maximum efficiency, tune cand.sd to an acceptance rate of 25-40%
results will be autocorrelated ... (although I guess you could always
sample() the results to scramble them, or thin)
may need to discard a "burn-in", if your starting value is weird
The classical approach to this problem is rejection sampling (see e.g. Press et al Numerical Recipes)

Use cumulative distribution function http://en.wikipedia.org/wiki/Cumulative_distribution_function
Then just use its inverse.
Check here for better picture http://en.wikipedia.org/wiki/Normal_distribution
That mean: pick random number from [0,1] and set as CDF, then check Value
It is also called quantile function.

This is a comment but I don't have enough reputation to drop a comment to Ben Bolker's answer.
I am new to Metropolis, but IMHO this code is wrong because:
a) the newval is drawn from a normal distribution whereas in other codes it is drawn from a uniform distribution; this value must be drawn from the range covered by the random number. For example, for a gaussian distribution this should be something like runif(1, -5, +5).
b) the prob value must be updated only if acceptance.
Hope this help and hope that someone with reputation could correct this answer (especially mine if I am wrong).
# the distribution
ddist <- dnorm
# number of random number
n <- 100000
# the center of the range is taken as init
init <- 0
# the following should go into a function
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- runif(1, -5, +5)
newprob <- ddist(newval)
if (runif(1) < newprob/oldprob) {
vals[i] <- newval
oldprob <- newprob
} else vals[i] <- vals[i-1]
}
# Final view
hist(vals, breaks = 100)
# and comparison
hist(rnorm(length(vals)), breaks = 100)

Related

R's code to obtain a histogram following a chi-square distribution from uniform random numbers

I had a code in my text-book (Written in Japanese) to generate a chi-square distribution with 3-degrees of freedom from a uniform distribution. I improved on this and created a code to get a histogram that follows a chi-square distribution with 4-degrees of freedom. This is in good agreement with the distribution function of R, so I think it probably works correctly (See Box1, below).
I tried to refine Box1's code further to obtain a histogram following a chi-square distribution with the specified degrees of freedom, but it didn't work with many errors.
(See Box2)
My Question:
The Box2's code to generate a chi-square distribution from a uniform distribution does not work well.
Please help me to fix the errors of the Box2's Code.
Probably the generalization of " y<-ifelse(x<0.2,1,ifelse(x<0.4,2,ifelse(x<0.6,3,ifelse(x<0.8,4,5))))" didn't work in Box 2.
Box1:Code for obtaining a histogram that follows a chi-square distribution with 4 degrees of freedom (probably works correctly)
ite <- 10000
sc <- numeric(ite) #★1
A<- c(20,20,20,20,20) #★2
for(i in 1:ite){
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5)))) #★4
z1 <- table(y)
z2 <- A*5
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F)
curve(dchisq(x,4),add=T)
The code for Box 1 is designed based on the following facts;
If 500=sum(A)*5 uniform random numbers are divided into five rooms of the same size, the expected value of the number entering each room is 100.
Here, 1st room, 2nd room,...,and 5th room are defined by 0≦x<0.2,0.2≦x<0.4,.... and, 0.8≦x≦1. We can see this from the output of table(y) in the following Box’ 1. Of course, sum(table(y)) of Box 1' always results in 500.
Box1' Logic for making uniform random numbers(x) stepwise(y) on Box1's code
A<- c(20,20,20,20,20)
s<- runif(sum(A)*5) #★3
y<-ifelse(s<0.2,1,ifelse(s<0.4,2,ifelse(s<0.6,3,ifelse(s<0.8,4,5))))
table(y)
sum(table(y))
Box2:Code for obtaining a histogram following a chi-square distribution of degrees of freedom n (With many errors)
chiq_dist_n<-function(numb,itr){
A<-numeric(numb) #★2
aa<-numeric(numb) #★4-1
for(i in 1:numb){
A[i]=20
} #★2
ntot=sum(A)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
sc<-numeric(itr) #★1
y<-numeric(ntot*numb) #★4-3
for(i in 1:itr){
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
z1<-table(y)
z2<-A*ntot
z3<-(z1-z2)^2/z2
sum(z3)
sc[i]<-sum(z3)
}
return(sc)
}
hist(chiq_dist(10,1000),ylim=c(0,0.35),breaks="Scott",freq=F)
The part of the Box2 code that generates y was cut out into Box2'.
If you look at the table(y) of Box2', you can see that too many y[i] are zero.
I want the output of table(y) in Box 2' to be roughly the same as the output of table(y) in Box 1'.
Box2' Logic for making uniform random numbers(x) stepwise(y) on Box2's code
A<- c(20,20,20,20,20)
ntot=sum(A)
numb=length(A)
aa<-numeric(numb)
for(i in 1:numb){
if (i ==1){aa[i]= A[i]/ntot
}else{
aa[i]=aa[i-1]+(A[i]/ntot)
}
} #★4-2
y<-numeric(ntot*numb)
x<-runif(ntot*numb)
for(k in 1:ntot*numb){
for(j in 1:numb){
if (x[k]<aa[numb-j+1]) {
y[k]<-j
} else {}
}
}#★3
table(y)
You don't need a ifelse to break a random uniform distribution, you can just use cut() and specify the number of breaks, for example:
set.seed(111)
v = runif(10)
[1] 0.59298128 0.72648112 0.37042200 0.51492383 0.37766322 0.41833733
[7] 0.01065785 0.53229524 0.43216062 0.09368152
cut(v,breaks=seq(0,1,length.out=numb+2),labels=1:5)
[1] 3 4 2 3 2 3 1 3 3 1
I am not so sure about A or what it does, but for simulating chisquare, I suppose you do a random sample of the labels 1:(df+1) where df is the degree of freedom. If we fix that number of samplings at 500, then we know that the expected for each break would be 500/(df+1).
So without changing too much of your code.
chiq_dist_n<-function(numb,ite){
sc <- numeric(ite)
for(i in 1:ite){
x<- runif(500) #★3
y<- cut(x,breaks=seq(0,1,length.out=numb+2),labels=1:(numb+1))
z1 <- table(y)
z2 <- length(x)/(numb+1)
z3 <- (z1-z2)^2 /z2
sc[i] <- sum(z3)
}
hist(sc,ylim=c(0,0.35),breaks="Scott",freq=F,main=paste0("df=",numb))
curve(dchisq(x,numb),add=T)
}
And we try from 4 to 9:
par(mfrow=c(3,2))
par(mar=c(2.5,2.5,2.5,2.5))
for(i in seq(2,12,2)){
chiq_dist_n(i,10000)
}

Implement a Monte Carlo Simulation Method to Estimate an Integral in R

I am trying to implement a Monte carlo simulation method to estimate an integral in R. However, I still get wrong answer. My code is as follows:
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
t <- integrate(f,0,1)
n <- 10000 #Assume we conduct 10000 simulations
int_gral <- Monte_Car(n)
int_gral
You are not doing Monte-Carlo here. Monte-Carlo is a simulation method that helps you approximating integrals using sums/mean based on random variables.
You should do something in this flavor (you might have to verify that it's correct to say that the mean of the f output can approximates your integral:
f <- function(n){
x <- runif(n)
return(
((cos(x))/x)*exp(log(x)-3)^3
)
}
int_gral <- mean(f(10000))
What your code does is taking a number n and return ((cos(n))/n)*exp(log(n)-3)^3 ; there is no randomness in that
Update
Now, to get a more precise estimates, you need to replicate this step K times. Rather than using a loop, you can use replicate function:
K <- 100
dist <- data.frame(
int = replicate(K, mean(f(10000)))
)
You get a distribution of estimators for your integral :
library(ggplot2)
ggplot(dist) + geom_histogram(aes(x = int, y = ..density..))
and you can use mean to have a numerical value:
mean(dist$int)
# [1] 2.95036e-05
You can evaluate the precision of your estimates with
sd(dist$int)
# [1] 2.296033e-07
Here it is small because N is already large, giving you a good precision of first step.
I have managed to change the codes as follows. Kindly confirm to me that I am doing the right thing.
regards.
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
set.seed(234)
n<-10000
for (i in 1:10000) {
x<-runif(n)
I<-sum(f(x))/n
}
I

R generating binomial Random variables from exponential random variables

I have 100000 exponential random variables generated withrexp and I am asked to generate 100000 binomial random variables from them using built in R functions.
I really don't know how can I generate one random variable from another. I searched some resources on internet but they were mostly about generating poisson from exponential which are very related because exponential distribution can be interpreted as time intervals of poisson. making poisson can be easily achieved by applying cumsum on exponentials and using cut function to make some bins including number of occurrences in a time interval.
But I don't know how is it possible to generate binomial from exponential.
The function rbin below generates binomial rv's from exponential rv's. The reason why might be a question for CrossValidated, not for StackOverflow, which is about code.
rbin <- function(n, size, p){
onebin <- function(i, size, thres){
I <- 0L
repeat{
S <- sum(rexp(I + 1)/(size + 1 - seq_len(I + 1)))
if(S > thres) break
I <- I + 1L
}
I
}
thres <- -log(1 - p)
sapply(seq_len(n), onebin, size, thres)
}
set.seed(1234)
u <- rbin(100000, 1, 0.5)
v <- rbinom(100000, 1, 0.5)
X <- cbind(u, v)
cbind(Mean = colMeans(X), Var = apply(X, 2, var))
# Mean Var
#u 0.50124 0.2500010
#v 0.49847 0.2500002

Calculating pmf and cdf for 20 sided dice in R

I would like to create two functions that would calculate the probability mass function (pmf) and cumulative distribution function (cdf) for a dice of 20 sides.
In the function I would use one argument, y for the side(from number 1 to 20). I should be able to put a vector and it would return the value for each of the variable.
If the value entered is non-discrete, it should then return zero in the result and a warning message.
This is what have solved so far for PMF:
PMF= function(side) {
a = NULL
for (i in side)
{
a= dbinom(1, size=1, prob=1/20)
print(a)
}
}
And this is what I got for CDF:
CDF= function(side) {
a = NULL
for (i in side)
{
a= pnorm(side)
print(a)
}
}
I am currently stuck with the warning message and the zero in result. How can I assing in the function the command line for that?
Next,how can I plot these two functions on the same plot on a specific interval (for example 1,12)?
Did I use the right function for calculating cdf and pmf?
I would propose the following simplifications:
PMF <- function(side) {
x <- rep(0.05, length(side))
bad_sides <- ! side %in% 1:20 # sides that aren't in 1:20 are bad
x[bad_sides] <- 0 # set bad sides to 0
# warnings use the warning() function. See ?warning for details
if (any(bad_sides)) warning("Sides not integers between 1 and 20 have 0 probability!")
# print results is probably not what you want, we'll return them instead.
return(x)
}
For the CDF, I assume you mean the probability of rolling a number less than or equal to the side given, which is side / 20. (pnorm is the wrong function... it gives the CDF of the normal distribution.)
CDF <- function(side) {
return(pmin(1, pmax(0, floor(side) / 20)))
}
Technically, the CDF is defined for non-integer values. The CDF of 1.2 is just the same as the CDF of 1, so I use floor here. If you want to make it more robust, you could make it min(1, floor(side) / 20) to make sure it doesn't exceed 1, and similarly a max() with 0 to make sure it's not negative. Or you could just try not to give it negative values or values over 20.
Plotting:
my_interval <- 1:12
plot(range(my_interval), c(0, 1), type = "n")
points(my_interval, PMF(my_interval))
lines(my_interval, CDF(my_interval), type = "s")

Use inverse CDF to generate random variable in R

First, I have no idea wether the professor gave the wrong question. Anyway, I tried to generate F(x)~U(0,1), where CDF F(x)=1-(1+x)exp(-x) (For this CDF, you could not calculate x=g(F(x)) by hand). And then calculate the root of F(x) to achieve what the question want.
Because the root range from 0 to INF, uniroot() is out of question. Therefore, I use Newton Method to write one.
Then, my code is like this:
f=function(x) {
ifelse(x>=0,x*exp(-x),0)
}
in.C=function(n) {
a=runif(n)
G=NULL
for(i in 1:n) {
del=1
x=2
while(abs(del)>1e-12){
del=(1-(1+x)*exp(-x)-a[i])/f(x)
x=x-del
}
G[i]=x
}
G
}
system.time(tt<-in.C(100000))
However, if the F(x) is too small, and one step in Newton Method, the result may be less than zero, then errors will happen. Further, I revised my code like this:
f=function(x) {
ifelse(x>=0,x*exp(-x),0)
}
in.C=function(n) {
a=runif(n)
G=NULL
for(i in 1:n) {
del=1
x=2
while(abs(del)>1e-12){
if(x>=0){ del=(1-(1+x)*exp(-x)-a[i])/f(x)
x=x-del
}
else break
}
if(x>=0) G[i]=x
}
G[!is.na(G)]
}
system.time(tt<-in.C(100000))
hist(tt, breaks=70, right=F, freq=F)
curve(f(x),from=0,to=20,add=T)
Clearly, the code is wrong, because I rejected the result near zero.
So, my quetion is whether my code can be revised to calculate right, if not, whether there is another way to do it. Any assitance is appreciated.
You can use uniroot(...) for this.
[Note: If the point of this exercise is to implement your own version of a Newton Raphson technique, let me know and I'll delete the answer.]
If I'm understanding this correctly, you want to generate random samples from a distribution with probability density function f and cumulative density F where
f = x*exp(-x)
F = 1 - (1+x)*exp(-x)
As you imply, this can be done by generating a random sample from U[0,1] and transforming that according to the inverse CDF of F. The procedure is very similar to the ones posted here and here, except that you already have an expression for the CDF.
f <- function(x) x*exp(-x)
F <- function(x) 1-(1+x)*exp(-x)
F.inv <- function(y){uniroot(function(x){F(x)-y},interval=c(0,100))$root}
F.inv <- Vectorize(F.inv)
x <- seq(0,10,length.out=1000)
y <- seq(0,1,length.out=1000)
par(mfrow=c(1,3))
plot(x,f(x),type="l",main="f(x)")
plot(x,F(x),type="l",main="CDF of f(x)")
plot(y,F.inv(y),type="l",main="Inverse CDF of f(x)")
Then, generate X ~ U[0,1] and Z = F.inv(X).
set.seed(1)
X <- runif(1000,0,1) # random sample from U[0,1]
Z <- F.inv(X)
par(mfrow=c(1,1))
hist(Z, freq=FALSE, breaks=c(seq(0,10,length=30),Inf), xlim=c(0,10))
lines(x,f(x),type="l",main="Density function", col="red",lty=2)

Resources