Producing RNG vectors in R that have pre-defined sum of pdf or sum of cdf - r

I am a new R user and I am trying to produce vectors with numbers randomly generated based on a specific distribution (with the rnorm command for example) with the vectors having a pre-defined sum of probability densities or sum of cumulative distributions.
For example, when generating vectors x1, x2 … xn I want them to obey either
sum(pnorm(x1)) = sum(pnorm(x2)) = … sum(pnorm(xn))
or
sum(pnorm(xi)) = ”fixed value”
or do the same but with dnorm. In other words, is there a possibility to set such parameters when using rnorm or any other RNG in R?
Tips and suggestions for strategies instead of complete solutions would also be greatly appreciated.
Many thanks in advance for your time.

1.
In the case of a Gaussian distribution,
sampling from (X1,...,Xn) under the condition that X1+...+Xn=s
is just sampling from a
conditional Gaussian distribution.
The vector (X1,X2,...,Xn,X1+...+Xn) has a Gaussian distribution, with zero mean,
and variance matrix
1 0 0 ... 0 1
0 1 0 ... 0 1
0 0 1 ... 0 1
...
0 0 0 ... 1 1
1 1 1 ... 1 n.
We can therefore sample from it as follows.
s <- 1 # Desired sum
n <- 10
mu1 <- rep(0,n)
mu2 <- 0
V11 <- diag(n)
V12 <- as.matrix(rep(1,n))
V21 <- t(V12)
V22 <- as.matrix(n)
mu <- mu1 + V12 %*% solve(V22, s - mu2)
V <- V11 - V12 %*% solve(V22,V21)
library(mvtnorm)
# Random vectors (in each row)
x <- rmvnorm( 100, mu, V )
# Check the sum and the distribution
apply(x, 1, sum)
hist(x[,1])
qqnorm(x[,1])
For an arbitrary distribution, this approach would require you to compute the conditional distribution, which may not be easy.
2.
There is another easy special case: a uniform distribution.
To uniformly sample n (positive) numbers that sum up to 1,
you can take n-1 numbers,
uniformly in [0,1],
and sort them: they define n intervals,
whose lengths turn sum up to 1, and happen to be uniformly distributed.
Since those points form a Poisson process,
you can also generate them with an exponential distribution.
x <- rexp(n)
x <- x / sum(x) # Sums to 1, and each coordinate is uniform in [0,1]
This idea is explained (with a lot of pictures) in the following article:
Portfolio Optimization for VaR, CVaR, Omega and Utility with General Return Distributions,
(W.T. Shaw, 2011), pages 6 to 8.
3.
(EDIT) I had initially misread the question, which was asking about sum(pnorm(x)), not sum(x). This turns out to be easier.
If X has a Gaussian distribution, then pnorm(X) has a uniform distribution:
the problem is then to sample from a uniform distribution, with a prescribed sum.
n <- 10
s <- 1 # Desired sum
p <- rexp(n)
p <- p / sum(p) * s # Uniform, sums to s
x <- qnorm(p) # Gaussian, the p-values sum to s

Related

R - Standardize matrix to have unit diagonals

I am seeking to generate the below matrix:
Θ = B + δIp ∈ Rp×p, where Ip is the identity matrix, each off-diagonal entry
in B (symmetric matrix) is generated independently and equals 0.5 with probability
0.1 or 0 with probability 0.9. The parameter δ > 0 is chosen such that Θ is positive definite. The matrix is standardized to have unit diagonals (transforming from covariance matrix to correlation matrix).
I think that I have most of the code, but I'm unsure of how to standardize the matrix to have unit diagonals syntactically in R (and theoretically, why that is a useful feature of a matrix).
# set number of cols/rows
p <- 5
set.seed(123)
# generate matrix B with values of 0.5 given probabilities
B <- matrix(sample(c(0,0.5), p^2, replace=TRUE, prob=c(0.9,0.1)), p)
# call the matrix lower triangle, need a symmetric matrix
i <- lower.tri(B)
B[i] <- t(B)[i]
diag(B) <- rep(0, p)
# finding parameter delta, such that Θ is positive definite.
(delta <- -min(eigen(B, symmetric=TRUE, only.values=TRUE)$values))
# set theta (delta is 2.8802)
theta <- B + 2.89*(diag(p))
# now to standardize the matrix to have unit diagonals ?
There are many ways to do this, but the following is very fast in timing experiments:
v <- 1/sqrt(diag(theta))
B <- theta * outer(v, v)
This divides all rows and columns by their standard deviations, which are the square roots of the diagonal elements.
It will fail whenever any diagonal is zero (or negative): but in that case such standardization isn't possible. Computing the square roots and their reciprocals first allows you to learn as soon as possible--with minimal computation--whether the procedure will succeed.
BTW, a direct way to compute B in the first part of your code (which has a zero diagonal) is
B <- as.matrix(structure(sample(c(0,1/2), p*(p-1)/2, replace=TRUE, prob=c(.9,.1),
Size=p, Diag=TRUE, class="dist"))
This eliminates the superfluous sampling.

How to generate n random numbers from negative binomial distribution?

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]
Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

Why is Adam optimization unable to converge in linear regression?

I am studying Adam optimizer. This is a toy problem. In R, I generate some artificial data:
Y = c0 + c1 * x1 + c2 * x2 + noise
In the above equation, x1, x2 and noise are normal random numbers I generated in R, theta = [c0, c1, c2] is the parameter I try to estimate with Adam optimizer. For this simple regression problem, I can use analytical method to determine the theta parameter which is the k in my R codes below.
Regarding Adam algorithm, I use the formulae from this site
Overview: Adam
I change the step size eta in this parametric study. The final theta from Adam algorithm is not the same as the analytical solution k in my R codes.
I checked my codes many times. I run the codes line by line and cannot understand why Adam algorithm cannot converge.
Added:
I changed the algorithm to AMSGrad. It perform better than Adam in this case. However, AMSGrad does not converge.
rm(list = ls())
n=500
x1=rnorm(n,mean=6,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(intercept=rep(1,n),x1,x2))
Y=-20+51*x1-15*x2+rnorm(n,mean=0,sd=2);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=4000 # total number of steps
epsilon = 0.0001 # set precision
eta=0.04 # step size
beta1=0.9
beta2=0.999
t1=integer(iterations)
t2=matrix(0,iterations,3)
t3=integer(iterations)
epsilon1=1E-8 # small number defined for numerical computation
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(intercept=rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=2),1,np) # Initialize theta:1 x np matrix
m_i=matrix(0,1,np) # initialization, zero vector
v_i=matrix(0,1,np) # initialization, zero vector
for(i in 1:iterations){
error=theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=1/N*error%*%X # Gradient grad is 1 x np vector
m_i=beta1*m_i+(1-beta1)*grad # moving average of gradient, 1 x np vector
v_i=beta2*v_i+(1-beta2)*grad^2 # moving average of squared gradients, 1 x np vector
# corrected moving averages
m_corrected=m_i/(1-beta1^i)
v_corrected=v_i/(1-beta2^i)
d_theta=eta/(sqrt(v_corrected)+epsilon1)*m_corrected
theta=theta-d_theta
L=sqrt(sum((d_theta)^2)) # calculating the L2 norm
t1[i]=L # record the L2 norm in each step
if ((is.infinite(L))||(is.nan(L))) {
print("Learning rate is too large. Lowering the rate may help.")
break
}
else if (L<=epsilon) {
print("Algorithm convergence is reached.")
break # checking whether convergence is obtained or not
}
# if (i==1){
# browser()
# }
}
plot(t1,type="l",ylab="norm",lwd=3,col=rgb(0,0,1))
k
theta

How to generate samples from MVN model?

I am trying to run some code on R based on this paper here through example 5.1. I want to simulate the following:
My background on R isn't great so I have the following code below, how can I generate a histogram and samples from this?
xseq<-seq(0, 100, 1)
n<-100
Z<- pnorm(xseq,0,1)
U<- pbern(xseq, 0.4, lower.tail = TRUE, log.p = FALSE)
Beta <- (-1)^U*(4*log(n)/(sqrt(n)) + abs(Z))
Some demonstrations of tools that will be of use:
rnorm(1) # generates one standard normal variable
rnorm(10) # generates 10 standard normal variables
rnorm(1, 5, 6) # generates 1 normal variable with mu = 5, sigma = 6
# not needed for this problem, but perhaps worth saying anyway
rbinom(5, 1, 0.4) # generates 5 Bernoulli variables that are 1 w/ prob. 0.4
So, to generate one instance of a beta:
n <- 100 # using the value you gave; I have no idea what n means here
u <- rbinom(1, 1, 0.4) # make one Bernoulli variable
z <- rnorm(1) # make one standard normal variable
beta <- (-1)^u * (4 * log(n) / sqrt(n) + abs(z))
But now, you'd like to do this many times for a Monte Carlo simulation. One way you might do this is by building a function, having beta be its output, and using the replicate() function, like this:
n <- 100 # putting this here because I assume it doesn't change
genbeta <- function(){ # output of this function will be one copy of beta
u <- rbinom(1, 1, 0.4)
z <- rnorm(1)
return((-1)^u * (4 * log(n) / sqrt(n) + abs(z)))
}
# note that we don't need to store beta anywhere directly;
# rather, it is just the return()ed value of the function we defined
betadraws <- replicate(5000, genbeta())
hist(betadraws)
This will have the effect of making 5000 copies of your beta variable and putting them in a histogram.
There are other ways to do this -- for instance, one might just make a big matrix of the random variables and work directly with it -- but I thought this would be the clearest approach for starting out.
EDIT: I realized that I ignored the second equation entirely, which you probably didn't want.
We've now made a vector of beta values, and you can control the length of the vector in the first parameter of the replicate() function above. I'll leave it as 5000 in my continued example below.
To get random samples of the Y vector, you could use something like:
x <- replicate(5000, rnorm(17))
# makes a 17 x 5000 matrix of independent standard normal variables
epsilon <- rnorm(17)
# vector of 17 standard normals
y <- x %*% betadraws + epsilon
# y is now a 17 x 1 matrix (morally equivalent to a vector of length 17)
and if you wanted to get many of these, you could wrap that inside another function and replicate() it.
Alternatively, if you didn't want the Y vector, but just a single Y_i component:
x <- rnorm(5000)
# x is a vector of 5000 iid standard normal variables
epsilon <- rnorm(1)
# epsilon_i is a single standard normal variable
y <- t(x) %*% betadraws + epsilon
# t() is the transpose function; y is now a 1 x 1 matrix

Efficiently generating discrete random numbers

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:
Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
Generate a uniform(0, 1) random number u
If u < cdf[1] choose 1
else if u < cdf[2] choose 2
else if u < cdf[3] choose 3
*...
Example
First generate an cdf:
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
Next generate N uniform random numbers:
N = 1000
u = runif(N)
Now sample the value:
##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.
cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)
system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
user system elapsed
0.01 0.00 0.02
How about using cut:
N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
user system elapsed
1.03 0.03 1.07
head(table(as.numeric(cut(u,cdf))))
1 2 3 4 5 6
51 95 165 172 148 75
If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by #Hemmo.
However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):
Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).
i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )
while( any(tmp > 0) ) {
tmp <- tmp - dpois(i, 5)
randvals <- randvals + (tmp > 0)
i <- i + 1
}
randvals2 <- qpois( tmp2, 5 )
all.equal(randvals, randvals2)

Resources