Efficiently generating discrete random numbers - r

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))

If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02

How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75

If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Related

How to generate n random numbers from negative binomial distribution?

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]
Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Inverse CDF method to simulate a random sample

I have a problem where I have written this piece of code, however I think there might be an issue with it.
This is the question:
Write an R function called pr1 that simulates a random sample of size n from the distribution with the CDF which is given as..
F_X(x) = 0 for x<=10
(x-10)^3/1000 for 10<x<20
1 for x=>20
x = 10 ( 1 + u^(1/3)) #I have used the inverse CDF method here and I now want to simulate a random sample of size n from the distribution.
Here is my code:
pr1 = function(n)
{ u = runif(n,0,1)
x = 10 * ( 1 + u^(1/3))
x }
pr1(5)
#This was just to check an example with n=5
My question is, since the CDF is 10< x <20, will this affect my code in any way?
Thank you
Are you confusing the range of X with the sample size? The former is restricted to the range (10, 20), the latter can be any positive integer.
You can do a sanity check on your inversion by considering U = 0, which should (and does) yield the minimum of the range of X, and U = 1, which should and does yield the maximum value of the range. There is no need to range restrict your inversion, the restriction is built into the use of U(0,1)'s on the input side, combined with the fact that CDFs are monotonically non-decreasing. Thus no value of U such that 0 < U < 1 can yield an outcome outside the range 10 < X < 20.
Since you want to simulate a piece-wise function, your R function should contain some flow controls like if.
Here's a start:
pr1 = function(n, drawing_range){
x <- sample(drawing_range, size = n) # random drawing of x
if (x <= 10)
output <- 0
else if ( 10 < x < 20 )
output <- (x-10)^3/1000
else
output <- 1
output
}
n is the number of draws. drawing_range is the population from which you draw; for example it can be from [-999, 999] in which case you input -999:999.

Producing RNG vectors in R that have pre-defined sum of pdf or sum of cdf

I am a new R user and I am trying to produce vectors with numbers randomly generated based on a specific distribution (with the rnorm command for example) with the vectors having a pre-defined sum of probability densities or sum of cumulative distributions.
For example, when generating vectors x1, x2 … xn I want them to obey either
sum(pnorm(x1)) = sum(pnorm(x2)) = … sum(pnorm(xn))
or
sum(pnorm(xi)) = ”fixed value”
or do the same but with dnorm. In other words, is there a possibility to set such parameters when using rnorm or any other RNG in R?
Tips and suggestions for strategies instead of complete solutions would also be greatly appreciated.
Many thanks in advance for your time.
1.
In the case of a Gaussian distribution,
sampling from (X1,...,Xn) under the condition that X1+...+Xn=s
is just sampling from a
conditional Gaussian distribution.
The vector (X1,X2,...,Xn,X1+...+Xn) has a Gaussian distribution, with zero mean,
and variance matrix
1 0 0 ... 0 1
0 1 0 ... 0 1
0 0 1 ... 0 1
...
0 0 0 ... 1 1
1 1 1 ... 1 n.
We can therefore sample from it as follows.
s <- 1 # Desired sum
n <- 10
mu1 <- rep(0,n)
mu2 <- 0
V11 <- diag(n)
V12 <- as.matrix(rep(1,n))
V21 <- t(V12)
V22 <- as.matrix(n)
mu <- mu1 + V12 %*% solve(V22, s - mu2)
V <- V11 - V12 %*% solve(V22,V21)
library(mvtnorm)
# Random vectors (in each row)
x <- rmvnorm( 100, mu, V )
# Check the sum and the distribution
apply(x, 1, sum)
hist(x[,1])
qqnorm(x[,1])
For an arbitrary distribution, this approach would require you to compute the conditional distribution, which may not be easy.
2.
There is another easy special case: a uniform distribution.
To uniformly sample n (positive) numbers that sum up to 1,
you can take n-1 numbers,
uniformly in [0,1],
and sort them: they define n intervals,
whose lengths turn sum up to 1, and happen to be uniformly distributed.
Since those points form a Poisson process,
you can also generate them with an exponential distribution.
x <- rexp(n)
x <- x / sum(x) # Sums to 1, and each coordinate is uniform in [0,1]
This idea is explained (with a lot of pictures) in the following article:
Portfolio Optimization for VaR, CVaR, Omega and Utility with General Return Distributions,
(W.T. Shaw, 2011), pages 6 to 8.
3.
(EDIT) I had initially misread the question, which was asking about sum(pnorm(x)), not sum(x). This turns out to be easier.
If X has a Gaussian distribution, then pnorm(X) has a uniform distribution:
the problem is then to sample from a uniform distribution, with a prescribed sum.
n <- 10
s <- 1 # Desired sum
p <- rexp(n)
p <- p / sum(p) * s # Uniform, sums to s
x <- qnorm(p) # Gaussian, the p-values sum to s

Generate Random Numbers with Std Dev x and Fixed Product

I want generate a series of returns x such that the standard deviation of the returns are say 0.03 and the product of 1+x = 1. To summarise, there are two conditions for the returns:
1) sd(x) == 0.03
2) prod(1+x) == 1
Is this possible and if so, how can I implement it in R?
Thank you.
A slightly more sophisticated approach is to use knowledge of the log-normal distribution: from ?dlnorm, Var= exp(2*mu + sigma^2)*(exp(sigma^2) - 1). We want the geometric mean to equal 1, so the mean on the log scale should be 0. We have Var = exp(sigma^2)*(exp(sigma^2)-1), can't obviously solve this analytically but we can use uniroot:
Find the correct log-variance:
vfun <- function(s2,v=0.03^2) { exp(s2)*(exp(s2)-1)-v }
s2 <- uniroot(vfun,interval=c(1e-6,100))$root
Generate values:
set.seed(1001)
x <- rnorm(1000,mean=0,sd=sqrt(s2))
x <- exp(x-mean(x))-1 ## makes sum(x) exactly zero
prod(1+x) ## exactly 1
sd(x)
This produces values with a standard deviation not exactly equal to 0.03, but close. If we wanted we could fix this too ...
A very simple approach is to simply simulate returns until you have a set that satisfies your requirements. You will need to specify a tolerance to your requirements, though (see here why).
nn <- 10
epsilon <- 1e-3
while ( TRUE ) {
xx <- rnorm(nn,0,0.03)
if ( abs(sd(xx)-0.03)<epsilon & abs(prod(1+xx)-1)<epsilon ) break
}
xx
yields
[1] 0.007862226 -0.011437600 -0.038740969 0.028614022 0.006986953
[6] -0.004131429 0.030846398 -0.037977057 0.046448318 -0.025294236

Resources