Use inverse CDF to generate random variable in R - r

First, I have no idea wether the professor gave the wrong question. Anyway, I tried to generate F(x)~U(0,1), where CDF F(x)=1-(1+x)exp(-x) (For this CDF, you could not calculate x=g(F(x)) by hand). And then calculate the root of F(x) to achieve what the question want.
Because the root range from 0 to INF, uniroot() is out of question. Therefore, I use Newton Method to write one.
Then, my code is like this:
f=function(x) {
ifelse(x>=0,x*exp(-x),0)
}
in.C=function(n) {
a=runif(n)
G=NULL
for(i in 1:n) {
del=1
x=2
while(abs(del)>1e-12){
del=(1-(1+x)*exp(-x)-a[i])/f(x)
x=x-del
}
G[i]=x
}
G
}
system.time(tt<-in.C(100000))
However, if the F(x) is too small, and one step in Newton Method, the result may be less than zero, then errors will happen. Further, I revised my code like this:
f=function(x) {
ifelse(x>=0,x*exp(-x),0)
}
in.C=function(n) {
a=runif(n)
G=NULL
for(i in 1:n) {
del=1
x=2
while(abs(del)>1e-12){
if(x>=0){ del=(1-(1+x)*exp(-x)-a[i])/f(x)
x=x-del
}
else break
}
if(x>=0) G[i]=x
}
G[!is.na(G)]
}
system.time(tt<-in.C(100000))
hist(tt, breaks=70, right=F, freq=F)
curve(f(x),from=0,to=20,add=T)
Clearly, the code is wrong, because I rejected the result near zero.
So, my quetion is whether my code can be revised to calculate right, if not, whether there is another way to do it. Any assitance is appreciated.

You can use uniroot(...) for this.
[Note: If the point of this exercise is to implement your own version of a Newton Raphson technique, let me know and I'll delete the answer.]
If I'm understanding this correctly, you want to generate random samples from a distribution with probability density function f and cumulative density F where
f = x*exp(-x)
F = 1 - (1+x)*exp(-x)
As you imply, this can be done by generating a random sample from U[0,1] and transforming that according to the inverse CDF of F. The procedure is very similar to the ones posted here and here, except that you already have an expression for the CDF.
f <- function(x) x*exp(-x)
F <- function(x) 1-(1+x)*exp(-x)
F.inv <- function(y){uniroot(function(x){F(x)-y},interval=c(0,100))$root}
F.inv <- Vectorize(F.inv)
x <- seq(0,10,length.out=1000)
y <- seq(0,1,length.out=1000)
par(mfrow=c(1,3))
plot(x,f(x),type="l",main="f(x)")
plot(x,F(x),type="l",main="CDF of f(x)")
plot(y,F.inv(y),type="l",main="Inverse CDF of f(x)")
Then, generate X ~ U[0,1] and Z = F.inv(X).
set.seed(1)
X <- runif(1000,0,1) # random sample from U[0,1]
Z <- F.inv(X)
par(mfrow=c(1,1))
hist(Z, freq=FALSE, breaks=c(seq(0,10,length=30),Inf), xlim=c(0,10))
lines(x,f(x),type="l",main="Density function", col="red",lty=2)

Related

Implement a Monte Carlo Simulation Method to Estimate an Integral in R

I am trying to implement a Monte carlo simulation method to estimate an integral in R. However, I still get wrong answer. My code is as follows:
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
t <- integrate(f,0,1)
n <- 10000 #Assume we conduct 10000 simulations
int_gral <- Monte_Car(n)
int_gral
You are not doing Monte-Carlo here. Monte-Carlo is a simulation method that helps you approximating integrals using sums/mean based on random variables.
You should do something in this flavor (you might have to verify that it's correct to say that the mean of the f output can approximates your integral:
f <- function(n){
x <- runif(n)
return(
((cos(x))/x)*exp(log(x)-3)^3
)
}
int_gral <- mean(f(10000))
What your code does is taking a number n and return ((cos(n))/n)*exp(log(n)-3)^3 ; there is no randomness in that
Update
Now, to get a more precise estimates, you need to replicate this step K times. Rather than using a loop, you can use replicate function:
K <- 100
dist <- data.frame(
int = replicate(K, mean(f(10000)))
)
You get a distribution of estimators for your integral :
library(ggplot2)
ggplot(dist) + geom_histogram(aes(x = int, y = ..density..))
and you can use mean to have a numerical value:
mean(dist$int)
# [1] 2.95036e-05
You can evaluate the precision of your estimates with
sd(dist$int)
# [1] 2.296033e-07
Here it is small because N is already large, giving you a good precision of first step.
I have managed to change the codes as follows. Kindly confirm to me that I am doing the right thing.
regards.
f <- function(x){
((cos(x))/x)*exp(log(x)-3)^3
}
set.seed(234)
n<-10000
for (i in 1:10000) {
x<-runif(n)
I<-sum(f(x))/n
}
I

How to calculate the area under each end of a sine curve

Given this data set:
y<-c(-13,16,35,40,28,36,43,33,40,33,22,-5,-27,-31,-29,-25,-26,-31,-26,-24,-25,-29,-23,4)
t<-1:24
My goal is to calculate two areas. The first area would integrate only data from the first part of the curve found above the Zero line. The second area would integrate data from the second part of the curve found below the zero line.
First I would like to fit a sine wave to this data. Using this excellent answer:
https://stats.stackexchange.com/questions/60994/fit-a-sinusoidal-term-to-data
I was able to fit a sine wave (I will be using the periodic with second harmonic which looks to have a better fit)
ssp <- spectrum(y)
per <- 1/ssp$freq[ssp$spec==max(ssp$spec)]
reslm <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t))
summary(reslm)
rg <- diff(range(y))
plot(y~t,ylim=c(min(y)-0.1*rg,max(y)+0.1*rg))
lines(fitted(reslm)~t,col=4,lty=2) # dashed blue line is sin fit
# including 2nd harmonic really improves the fit
reslm2 <- lm(y ~ sin(2*pi/per*t)+cos(2*pi/per*t)+sin(4*pi/per*t)+cos(4*pi/per*t))
summary(reslm2)
lines(fitted(reslm2)~t,col=3) # solid green line is periodic with second harmonic
abline(h=0,lty=2)
Next I would like to calculate the area under the curve that is only positive, as well as the area under the curve that is exclusively negative. I've had luck looking at similar answers using the AUC functions in the Bolstad2 and Mess packages. But my data points do not fall neatly on zero line, and I do not know how to break up the sine function into areas only above the Zero line and only below the Zero line.
First things first. To get an exact calculation, you will need to work with the exact function of the 2nd harmonic fourier. Secondly, the beauty of harmonics functions is that they are repetitive. So if you want to find where your function reaches 0, you merely need to expand your interval to so you can be sure to find more than 2 roots.
First we get the exact function from the regression model
fourierfnct <- function(t){
fnct <- reslm2$coeff[1]+
reslm2$coeff[2]*sin(2*pi/per*t)+
reslm2$coeff[3]*cos(2*pi/per*t)+
reslm2$coeff[4]*sin(4*pi/per*t)+
reslm2$coeff[5]*cos(4*pi/per*t)
return(fnct)
}
secondly,you can write a function which can find the roots (where the function is 0). R provides a uniroot function which you can use to find multiple roots in a loop.
manyroots <- function(f,inter,period){
roots <- array(NA, inter)
for(i in 1:(length(inter)-1)){
roots[i] <- tryCatch({
return_value <- uniroot(f,c(inter[i],inter[i+1]))$root
}, error = function(err) {
return_value <- -1
})
}
retroots <- roots[-which(roots==-1)]
return(retroots)
}
then you simply calculate the roots, and use them to integrate the function across those boundaries.
roots <- manyroots(fourierfnct,seq(0,25),per)
integrate(fourierfnct, roots[1],roots[2])
#300.6378 with absolute error < 3.3e-12
integrate(fourierfnct, roots[2],roots[3])
#-284.6378 with absolute error < 3.2e-12
This may not be the solution you are looking for, but you could try this:
# Create a new t vector but with more subdivisions
t2 = seq(1,24,length.out = 10000)
# Evaluate your model on this t2
y2 = predict(reslm2, newdata = data.frame(t = t2))
lines(t2[y2>=0],y2[y2>=0],col="red")
# Estimate the area where the curve is greater than 0
sum(diff(t2)[1]*y2[y2>0])
# Estimate the area where the curve is less than 0
sum(diff(t2)[1]*y2[y2<0])

Calculating pmf and cdf for 20 sided dice in R

I would like to create two functions that would calculate the probability mass function (pmf) and cumulative distribution function (cdf) for a dice of 20 sides.
In the function I would use one argument, y for the side(from number 1 to 20). I should be able to put a vector and it would return the value for each of the variable.
If the value entered is non-discrete, it should then return zero in the result and a warning message.
This is what have solved so far for PMF:
PMF= function(side) {
a = NULL
for (i in side)
{
a= dbinom(1, size=1, prob=1/20)
print(a)
}
}
And this is what I got for CDF:
CDF= function(side) {
a = NULL
for (i in side)
{
a= pnorm(side)
print(a)
}
}
I am currently stuck with the warning message and the zero in result. How can I assing in the function the command line for that?
Next,how can I plot these two functions on the same plot on a specific interval (for example 1,12)?
Did I use the right function for calculating cdf and pmf?
I would propose the following simplifications:
PMF <- function(side) {
x <- rep(0.05, length(side))
bad_sides <- ! side %in% 1:20 # sides that aren't in 1:20 are bad
x[bad_sides] <- 0 # set bad sides to 0
# warnings use the warning() function. See ?warning for details
if (any(bad_sides)) warning("Sides not integers between 1 and 20 have 0 probability!")
# print results is probably not what you want, we'll return them instead.
return(x)
}
For the CDF, I assume you mean the probability of rolling a number less than or equal to the side given, which is side / 20. (pnorm is the wrong function... it gives the CDF of the normal distribution.)
CDF <- function(side) {
return(pmin(1, pmax(0, floor(side) / 20)))
}
Technically, the CDF is defined for non-integer values. The CDF of 1.2 is just the same as the CDF of 1, so I use floor here. If you want to make it more robust, you could make it min(1, floor(side) / 20) to make sure it doesn't exceed 1, and similarly a max() with 0 to make sure it's not negative. Or you could just try not to give it negative values or values over 20.
Plotting:
my_interval <- 1:12
plot(range(my_interval), c(0, 1), type = "n")
points(my_interval, PMF(my_interval))
lines(my_interval, CDF(my_interval), type = "s")

Fitting an inverse function

I have a function which looks like:
g(x) = f(x) - a^b / f(x)^b
g(x) - known function, data vector provided.
f(x) - hidden process.
a,b - parameters of this function.
From the above we get the relation:
f(x) = inverse(g(x))
My goal is to optimize parameters a and b such that f(x) would be as close as possible
to a normal distribution. If we look on a f(x) Q-Q normal plot (attached), my purpose is to minimize the distance between f(x) to the straight line which represents the normal distribution, by optimizing parameters a and b.
I wrote the below code:
g_fun <- function(x) {x - a^b/x^b}
inverse = function (f, lower = 0, upper = 2000) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
f_func = inverse(function(x) g_fun(x))
enter code here
# let's made up an example
# g(x) values are known
g <- c(-0.016339, 0.029646, -0.0255258, 0.003352, -0.053258, -0.018971, 0.005172,
0.067114, 0.026415, 0.051062)
# Calculate f(x) by using the inverse of g(x), when a=a0 and b=b0
for (i in 1:10) {
f[i] <- f_fun(g[i])
}
I have two question:
How to pass parameters a and b to the functions?
How to perform this optimization task, meaning find a and b such that f(x) would approximate normal distribution.
Not sure how you were able to produce the Q-Q plot since your provided examples do not work. You are not specifying the values of a and b and you are defining f_func but calling f_fun. Anyway here is my answer to your questions:
How to pass parameters a and b to the functions? - Just pass them as
arguments to the functions.
How to perform this optimization task, meaning find a and b such that f(x) would approximate normal distribution? - The same way any optimization task is done. Define a cost function, then minimize it.
Here is the revised code: I have added a and b as parameters, removed the inverse function and incorporated it inside f_func, which can now take vector input so no need for a for loop.
g_fun <- function(x,a,b) {x - a^b/x^b}
f_func = function(y,a,b,lower = 0, upper = 2000){
sapply(y,function(z) { uniroot(function(x) g_fun(x,a,b) - z, lower = lower, upper = upper)$root})
}
# g(x) values are known
g <- c(-0.016339, 0.029646, -0.0255258, 0.003352, -0.053258, -0.018971, 0.005172,
0.067114, 0.026415, 0.051062)
f <- f_func(g,1,1) # using a = 1 and b = 1
#[1] 0.9918427 1.0149329 0.9873386 1.0016774 0.9737270 0.9905320 1.0025893
#[8] 1.0341199 1.0132947 1.0258569
f_func(g,2,10)
[1] 1.876408 1.880554 1.875578 1.878138 1.873094 1.876170 1.878304 1.884049
[9] 1.880256 1.882544
Now for the optimization part, it depends on what you mean by f(x) would approximate normal distribution. You can compare mean square error from the qq-line if you want. Also since you say approximate, how close is good enough? You can go with shapiro.test and keep searching till you find p-value below 0.05 (be ware that there may not be a solution)
shapiro.test(f_func(g,1,2))$p
[1] 0.9484821
cost <- function(x,y) shapiro.test(f_func(g,x,y))$p
Now that we have a cost function how do we go about minimizing it. There are many many different ways to do numerical optimization. Take a look at optim function http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optim.html.
optim(c(1,1),cost)
This final line does not work, but without proper data and context this is as far as I can go. Hope this helps.

How do I best simulate an arbitrary univariate random variate using its probability function?

In R, what's the best way to simulate an arbitrary univariate random variate if only its probability density function is available?
Here is a (slow) implementation of the inverse cdf method when you are only given a density.
den<-dnorm #replace with your own density
#calculates the cdf by numerical integration
cdf<-function(x) integrate(den,-Inf,x)[[1]]
#inverts the cdf
inverse.cdf<-function(x,cdf,starting.value=0){
lower.found<-FALSE
lower<-starting.value
while(!lower.found){
if(cdf(lower)>=(x-.000001))
lower<-lower-(lower-starting.value)^2-1
else
lower.found<-TRUE
}
upper.found<-FALSE
upper<-starting.value
while(!upper.found){
if(cdf(upper)<=(x+.000001))
upper<-upper+(upper-starting.value)^2+1
else
upper.found<-TRUE
}
uniroot(function(y) cdf(y)-x,c(lower,upper))$root
}
#generates 1000 random variables of distribution 'den'
vars<-apply(matrix(runif(1000)),1,function(x) inverse.cdf(x,cdf))
hist(vars)
To clarify the "use Metropolis-Hastings" answer above:
suppose ddist() is your probability density function
something like:
n <- 10000
cand.sd <- 0.1
init <- 0
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- rnorm(1,mean=vals[i-1],sd=cand.sd)
newprob <- ddist(newval)
if (runif(1)<newprob/oldprob) {
vals[i] <- newval
} else vals[i] <- vals[i-1]
oldprob <- newprob
}
Notes:
completely untested
efficiency depends on candidate distribution (i.e. value of cand.sd).
For maximum efficiency, tune cand.sd to an acceptance rate of 25-40%
results will be autocorrelated ... (although I guess you could always
sample() the results to scramble them, or thin)
may need to discard a "burn-in", if your starting value is weird
The classical approach to this problem is rejection sampling (see e.g. Press et al Numerical Recipes)
Use cumulative distribution function http://en.wikipedia.org/wiki/Cumulative_distribution_function
Then just use its inverse.
Check here for better picture http://en.wikipedia.org/wiki/Normal_distribution
That mean: pick random number from [0,1] and set as CDF, then check Value
It is also called quantile function.
This is a comment but I don't have enough reputation to drop a comment to Ben Bolker's answer.
I am new to Metropolis, but IMHO this code is wrong because:
a) the newval is drawn from a normal distribution whereas in other codes it is drawn from a uniform distribution; this value must be drawn from the range covered by the random number. For example, for a gaussian distribution this should be something like runif(1, -5, +5).
b) the prob value must be updated only if acceptance.
Hope this help and hope that someone with reputation could correct this answer (especially mine if I am wrong).
# the distribution
ddist <- dnorm
# number of random number
n <- 100000
# the center of the range is taken as init
init <- 0
# the following should go into a function
vals <- numeric(n)
vals[1] <- init
oldprob <- 0
for (i in 2:n) {
newval <- runif(1, -5, +5)
newprob <- ddist(newval)
if (runif(1) < newprob/oldprob) {
vals[i] <- newval
oldprob <- newprob
} else vals[i] <- vals[i-1]
}
# Final view
hist(vals, breaks = 100)
# and comparison
hist(rnorm(length(vals)), breaks = 100)

Resources