How to generate n random numbers from negative binomial distribution? - r

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]

Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

Related

Calculate PMF for a fair coin tossed 10 times

A fair coin is tossed 10 times. Find the Probability Mass Function (PMF) of X and the length of the longest run of heads observed. Need to write R code to accomplish this task. Following are some of the functions to use for this task:
as.integer(), intToBits(), rev(), rle().
I have the starting idea of the function to use, but do not have sufficient knowledge to tie it together to calculate the PMF and calculate the length of the longest run.
toBinary <- function(n){
paste0(as.integer(rev(intToBits(n)[1:10])),collapse = "")
}
toBinary(4)
toBinary(1023)
for(i in 0:1023){
print(toBinary(i))
}
I have added the following lines to complete the task:
# number of trials
trials <- 10
# probability of success
success <- 0.5
# x is number of random variable X
x <- 0:trials
# number of probabilities for a binomial distribution
prob <- dbinom(x,trials,success)
prob
# create a table with the data from above
prob_table<-cbind(x,prob)
prob_table
# specify the column names from the probability table
colnames(prob_table)<-c("x", "P(X=x)")
prob_table

rnorm in R not iid

I'm generating small samples (e.g. 24 obs) of normally distributed variable in R. It seems that the resulting variable has a systematically negative autocorrelation.
Code below generates 1000 samples of 24 observations of x and calculates the first three autocorrelations. These are not huge on average (-0.075 to 0.045) but the averages are always negative. Increasing sample size (N) decreases the autocorrelation towards zero. However, my questions is: Why are the random numbers in a small sample negatively autocorrelated?
K <- 1000
N <- 24
ac <- NULL
for (k in 1:K) {
x <- rnorm(n=N)
ac <- rbind(ac, pacf(x, plot=F)$acf[1:3,1,1])
}
apply(ac, 2, mean)
[1] -0.04925651 -0.07523400 -0.04542514

Why is Adam optimization unable to converge in linear regression?

I am studying Adam optimizer. This is a toy problem. In R, I generate some artificial data:
Y = c0 + c1 * x1 + c2 * x2 + noise
In the above equation, x1, x2 and noise are normal random numbers I generated in R, theta = [c0, c1, c2] is the parameter I try to estimate with Adam optimizer. For this simple regression problem, I can use analytical method to determine the theta parameter which is the k in my R codes below.
Regarding Adam algorithm, I use the formulae from this site
Overview: Adam
I change the step size eta in this parametric study. The final theta from Adam algorithm is not the same as the analytical solution k in my R codes.
I checked my codes many times. I run the codes line by line and cannot understand why Adam algorithm cannot converge.
Added:
I changed the algorithm to AMSGrad. It perform better than Adam in this case. However, AMSGrad does not converge.
rm(list = ls())
n=500
x1=rnorm(n,mean=6,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(intercept=rep(1,n),x1,x2))
Y=-20+51*x1-15*x2+rnorm(n,mean=0,sd=2);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=4000 # total number of steps
epsilon = 0.0001 # set precision
eta=0.04 # step size
beta1=0.9
beta2=0.999
t1=integer(iterations)
t2=matrix(0,iterations,3)
t3=integer(iterations)
epsilon1=1E-8 # small number defined for numerical computation
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(intercept=rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=2),1,np) # Initialize theta:1 x np matrix
m_i=matrix(0,1,np) # initialization, zero vector
v_i=matrix(0,1,np) # initialization, zero vector
for(i in 1:iterations){
error=theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=1/N*error%*%X # Gradient grad is 1 x np vector
m_i=beta1*m_i+(1-beta1)*grad # moving average of gradient, 1 x np vector
v_i=beta2*v_i+(1-beta2)*grad^2 # moving average of squared gradients, 1 x np vector
# corrected moving averages
m_corrected=m_i/(1-beta1^i)
v_corrected=v_i/(1-beta2^i)
d_theta=eta/(sqrt(v_corrected)+epsilon1)*m_corrected
theta=theta-d_theta
L=sqrt(sum((d_theta)^2)) # calculating the L2 norm
t1[i]=L # record the L2 norm in each step
if ((is.infinite(L))||(is.nan(L))) {
print("Learning rate is too large. Lowering the rate may help.")
break
}
else if (L<=epsilon) {
print("Algorithm convergence is reached.")
break # checking whether convergence is obtained or not
}
# if (i==1){
# browser()
# }
}
plot(t1,type="l",ylab="norm",lwd=3,col=rgb(0,0,1))
k
theta

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Producing RNG vectors in R that have pre-defined sum of pdf or sum of cdf

I am a new R user and I am trying to produce vectors with numbers randomly generated based on a specific distribution (with the rnorm command for example) with the vectors having a pre-defined sum of probability densities or sum of cumulative distributions.
For example, when generating vectors x1, x2 … xn I want them to obey either
sum(pnorm(x1)) = sum(pnorm(x2)) = … sum(pnorm(xn))
or
sum(pnorm(xi)) = ”fixed value”
or do the same but with dnorm. In other words, is there a possibility to set such parameters when using rnorm or any other RNG in R?
Tips and suggestions for strategies instead of complete solutions would also be greatly appreciated.
Many thanks in advance for your time.
1.
In the case of a Gaussian distribution,
sampling from (X1,...,Xn) under the condition that X1+...+Xn=s
is just sampling from a
conditional Gaussian distribution.
The vector (X1,X2,...,Xn,X1+...+Xn) has a Gaussian distribution, with zero mean,
and variance matrix
1 0 0 ... 0 1
0 1 0 ... 0 1
0 0 1 ... 0 1
...
0 0 0 ... 1 1
1 1 1 ... 1 n.
We can therefore sample from it as follows.
s <- 1 # Desired sum
n <- 10
mu1 <- rep(0,n)
mu2 <- 0
V11 <- diag(n)
V12 <- as.matrix(rep(1,n))
V21 <- t(V12)
V22 <- as.matrix(n)
mu <- mu1 + V12 %*% solve(V22, s - mu2)
V <- V11 - V12 %*% solve(V22,V21)
library(mvtnorm)
# Random vectors (in each row)
x <- rmvnorm( 100, mu, V )
# Check the sum and the distribution
apply(x, 1, sum)
hist(x[,1])
qqnorm(x[,1])
For an arbitrary distribution, this approach would require you to compute the conditional distribution, which may not be easy.
2.
There is another easy special case: a uniform distribution.
To uniformly sample n (positive) numbers that sum up to 1,
you can take n-1 numbers,
uniformly in [0,1],
and sort them: they define n intervals,
whose lengths turn sum up to 1, and happen to be uniformly distributed.
Since those points form a Poisson process,
you can also generate them with an exponential distribution.
x <- rexp(n)
x <- x / sum(x) # Sums to 1, and each coordinate is uniform in [0,1]
This idea is explained (with a lot of pictures) in the following article:
Portfolio Optimization for VaR, CVaR, Omega and Utility with General Return Distributions,
(W.T. Shaw, 2011), pages 6 to 8.
3.
(EDIT) I had initially misread the question, which was asking about sum(pnorm(x)), not sum(x). This turns out to be easier.
If X has a Gaussian distribution, then pnorm(X) has a uniform distribution:
the problem is then to sample from a uniform distribution, with a prescribed sum.
n <- 10
s <- 1 # Desired sum
p <- rexp(n)
p <- p / sum(p) * s # Uniform, sums to s
x <- qnorm(p) # Gaussian, the p-values sum to s

Resources