When I want to generate a random number with runif() within a specific interval under exclusion of a particular value (e.g. 0.5) I can write this function ex.runif() who does the job, but it is hundreds of times slower than the normal runif(). Could anyone point me to a better solution?
ex.runif <- function(n, excl, min, max) {
# ex.runif() excludes the specific value 'excl'
q <- excl
while (q == excl) {
q <- runif(n, min = min, max = max)
}
return(q)
}
set.seed(42)
ex.runif(1, .5, .25, .75) # exclude .5, interval [.25, .75]
# [1] 0.707403
library(microbenchmark)
microbenchmark(ex.runif(1, .5, .25, .75), runif(1, min = .25, max = .75))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# ex.runif 692.439 704.685 721.51135 715.2735 722.9275 962.373 100 b
# runif 2.041 2.551 3.49044 2.8070 3.3170 21.176 100 a
If the set of values that you want to exclude is finite, then, in most cases, there is no need for a function like that. The reason is that the uniform distribution is continuous and any finite number of values are taken with probability zero. That is, q == excl is, in terms of probability theory, true with probability zero.
For instance,
set.seed(42)
ex.runif(5, .5, .25, .75)
# [1] 0.7074030 0.7185377 0.3930698 0.6652238 0.5708728
set.seed(42)
runif(5, 0.25, 0.75)
# [1] 0.7074030 0.7185377 0.3930698 0.6652238 0.5708728
The same is most likely going to happen under any other seed as well. Thus, you may just keep using runif.
#duckmayr makes a good point about numeric precision. In fact, as the interval [min, max] is getting narrower, q == excl becomes true with increasingly high probability and, in some applications, it may even become relevant.
However, if in theory you indeed need to exclude only a single value 0.5, then performing a check like q == excl might even do harm by excluding unnecessary draws.
For instance, in my case .Machine$double.eps is 2.220446e-16. Then the probability of getting a draw from [0.5 - .Machine$double.eps / 4, 0.5 + .Machine$double.eps / 4] when [min,max] is [0.5 - 10^(-k), 0.5 + 10^(-k)] and making a false conclusion is 2 * (2.220446e-16 / 4) / (2 * 10^(-k)) or around 0.55 * 10^(k-16).
Related
This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 24 days ago.
I am comparing these two forms of drawing random numbers from a beta and a Gaussian distribution. What are their differences? Why are they different?
The first way (_1) simulates from a Uniform(0,1) and then applies the inverse CDF of the Beta (Normal) distribution on those uniform draws to get draws from the Beta (Normal) distribution.
While the second way (_2) uses the default function to generate random numbers from the distribution.
Beta Distribution
set.seed(1)
beta_1 <- qbeta(runif(1000,0,1), 2, 5)
set.seed(1)
beta_2 <- rbeta(1000, 2,5)
> summary(beta_1); summary(beta_2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.009481 0.164551 0.257283 0.286655 0.387597 0.895144
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.006497 0.158083 0.261649 0.284843 0.396099 0.841760
Here every number is different.
Normal distribution
set.seed(1)
norm_1 <- qnorm(runif(1000, 0,1), 0, 0.1)
set.seed(1)
norm_2 <- rnorm(1000, 0, 0.1)
> summary(norm_1); summary(norm_2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3008048 -0.0649125 -0.0041975 0.0009382 0.0664868 0.3810274
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.300805 -0.069737 -0.003532 -0.001165 0.068843 0.381028
Here the numbers are almost the same except in the mean and median
Shouldn't all be equal? Because I am generating random numbers from distributions with the same parameters
I think your question boils down to the assumption about the random number generator. If rnorm used the same RNG as runif under the hood, then your expectation would hold. It does not use the same RNG. The normal distribution RNG and uniform RNG are separate. See ?RNGkind. Without that exact match, you are left with the statistical tests below:
Is the mean of norm_1 different from the mean of norm_2?
t.test(x = norm_1, y = norm_2)
p-value > 0.05 indicates there is insufficient evidence to reject the null hypothesis that the means are equal at the 0.05 type I error level
Are the distributions different?
ks.test(x = norm_1, y = norm_2)
p-value > 0.05 indicates there is insufficient evidence to reject the null hypothesis that the distributions are equal at the 0.05 type I error level
I tried to sample a Bernoulli parameter at home using two different ways.
I flip a coin and assign 1 to heads and 0 to tails
I roll a six sided dice and assign the result 1 to a roll of the 3 highest numbers and the result 0 to a roll of the 3 lowest numbers.
I did this only twenty times instead of thousand times but the principle is the same. I got the following results:
result 0
result 1
Method 1
11
9
Method 2
8
12
Q: Why did I not get the same result for both methods?
A: Well, it is of course because they are samples and are supposed to be variable everytime.
If I would be able to reset some random seed to remove the variability, then this still doesn't matter because they are different methods.
Why is there no use of inverse transform sampling?
The normal distribution actually does use inverse transform sampling. The following command returns the same value of 0.3735462
set.seed(1)
rnorm(1,1,1)
set.seed(1)
qnorm(runif(1),1,1)
Also the rbeta uses inverse transform sampling and the following returns the same 0.7344913 and 0.2655087, which are only different by the relationship Y = 1-X (so internally there is some inversion)
alpha = 1
beta = 1
set.seed(1)
rbeta(1,alpha,beta)
set.seed(1)
qbeta(runif(1),alpha,beta)
The beta function becomes different when when $\alpha$ and $\beta$ are not both equal to one. This is because the inverse sampling is not very efficient and the the rbeta function will do some algorithm that creates the sample in a different way. Below is a code with the algorithm for the case that $min(\alpha,\beta) \leq 1$.
See for more about the algorithm: Hung, Ying-Chao, Narayanaswamy Balakrishnan, and Yi-Te Lin. "Evaluation of beta generation algorithms." Communications in Statistics-Simulation and Computation 38.4 (2009): 750-770.
You can see a few points that are calculated differently. The algorithm has a few steps where it starts redrawing random numbers, and it does this because redrawing numbers is easier than computing the inverse transform for a difficult case.
alpha = 0.9
beta = 0.9
#### Cheng's BC algorithm
### used if min(alpha,beta)<=1
### initialize
set.seed(1)
p = min(alpha,beta)
q = max(alpha,beta)
a = p+q
b = p^-1
delta = 1+q-p
k1 = delta*(0.0138889+0.0416667*p)/(q*b-0.777778)
k2 = 0.25 + (0.5+0.25/delta)*p
sample = function() {
### Perform steps of algorithm in a loop
step = 1
while(step<6) {
if (step == 1) {
U1 = runif(1)
U2 = runif(1)
if (U1 < 0.5)
{step = 2}
else
{step = 3}
}
if (step == 2) {
Y = U1*U2
Z = U1*Y
if (0.25*U2 + Z-Y >= k1) {
step = 1
} else {
step = 5
}
}
if (step == 3) {
Z = U1^2*U2
if (Z > 0.25) {
step = 4
} else {
V = b*log(U1/(1-U1))
W = q*exp(V)
step = 6
}
}
if (step == 4) {
if (Z < k2) {
step = 5
} else {
step = 1
}
}
if (step == 5) {
V = b*log(U1/(1-U1))
W = q*exp(V)
if (a*(log(a/(p+W))+V) - 1.3862944 < log(Z)) {
step = 1
} else {
step = 6
}
}
}
if (q == alpha) {
X = W/(p+W)
} else {
X = p/(p+W)
}
return(X)
}
sample()
n = 20
beta_orig = sapply(1:n,function(x) {
set.seed(x)
rbeta(1,alpha,beta)
})
beta_quantile = sapply(1:n,function(x) {
set.seed(x)
qbeta(runif(1),alpha,beta)
})
beta_BC = sapply(1:n,function(x) {
set.seed(x)
sample()
})
plot(beta_orig,beta_BC, pch = 1, xlim = c(0,1), ylim = c(0,1))
points(beta_orig,beta_quantile, col = 2, pch = 3)
legend(0.3,1, c("rbeta compared to inverse transform sampling", "rbeta compared to manual"), pch=c(3,1), col = c(2,1), cex = 0.85)
Some weird effect
In the code above I was resetting the random seed for each computation. The inverse transform is only the same for the first number. When you compute multiple numbers then only the first number is the same.
The following code
set.seed(1)
rnorm(6,1,1)
set.seed(1)
qnorm(runif(6),1,1)
set.seed(2)
rnorm(6,1,1)
set.seed(2)
qnorm(runif(6),1,1)
returns
[1] 0.3735462 1.1836433 0.1643714 2.5952808 1.3295078 0.1795316
[1] 0.3735462 0.6737666 1.1836433 2.3297993 0.1643714 2.2724293
[1] 0.1030855 1.1848492 2.5878453 -0.1303757 0.9197482 1.1324203
[1] 0.10308546 1.53124079 1.18484918 0.03810797 2.58784531 2.58463150
What you see here is that rnorm function skips a number. The reason is because it samples two random numbers to create more precision.
See these lines in the source ode of the norm_rand() function that R uses https://svn.r-project.org/R/trunk/src/nmath/snorm.c
define BIG 134217728 /* 2^27 */
/* unif_rand() alone is not of high enough precision */
u1 = unif_rand();
u1 = (int)(BIG*u1) + unif_rand();
return qnorm5(u1/BIG, 0.0, 1.0, 1, 0);
I want to efficiently solve a degree-7 polynomial in k.
For example, with the following set of 7 unconditional probabilities,
p <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
the overall event probability is approximately 25% :
> 1 - prod(1 - p)
[1] 0.2506676
And if I want to approximate a constant k to proportionally change all elements of p so that the overall event probability is now approximately 30%, I can do this using an equation solver (such as Wolfram Alpha), which may use Newton's method or bisection to approximate k in:
here, k is approximately 1.23:
> 1 - prod(1 - 1.23*p)
[1] 0.3000173
But what if I want to solve this for many different overall event probabilities, how can I efficiently do this in R?
I've looked at the function SMfzero in the package NLRoot, but it's still not clear to me how I can achieve it.
EDIT
I've benchmarked the solutions so far. On the toy data p above:
Unit: nanoseconds
expr min lq mean median uq max neval
approximation_fun 800 1700 3306.7 3100 4400 39500 1000
polynom_fun 1583800 1748600 2067028.6 1846300 2036300 16332600 1000
polyroot_fun 596800 658300 863454.2 716250 792100 44709000 1000
bsoln_fun 48800 59800 87029.6 85100 102350 613300 1000
find_k_fun 48500 60700 86657.4 85250 103050 262600 1000
NB, I'm not sure if its fair to compare the approximation_fun with the others but I did ask for an approximate solution so it does meet the brief.
The real problem is a degree-52 polynomial in k. Benchmarking on the real data:
Unit: microseconds
expr min lq mean median uq max neval
approximation_fun 1.9 3.20 7.8745 5.50 14.50 55.5 1000
polynom_fun 10177.2 10965.20 12542.4195 11268.45 12149.95 80230.9 1000
bsoln_fun 52.3 60.95 91.4209 71.80 117.75 295.6 1000
find_k_fun 55.0 62.80 90.1710 73.10 118.40 358.2 1000
This can be solved with the polynom library.
library(polynom)
library(purrr)
p <- runif(3, 0, 1)
p
#> [1] 0.1072518 0.5781922 0.3877427
# Overall probability
1 - prod(1 - p)
#> [1] 0.7694434
# Target overall probability
target_op <- 0.3
# calculate polynomial to solve for k
poly_list <- p %>%
map(~polynomial(c(1, -.))) %>%
as.polylist()
# List of linear polynomials to be multiplied:
poly_list
#> [[1]]
#> 1 - 0.1072518*x
#>
#> [[2]]
#> 1 - 0.5781922*x
#>
#> [[3]]
#> 1 - 0.3877427*x
# we want to solve this polynomial
poly <- 1 - prod(poly_list) - target_op
poly
#> -0.3 + 1.073187*x - 0.3277881*x^2 + 0.02404476*x^3
roots <- solve(poly)
good_roots <-
roots %>%
# keep only real values
keep(~Im(.) == 0) %>%
Re() %>%
# only positive
keep(~.>0)
good_roots
#> [1] 0.1448852
k <- good_roots[[1]]
1 - prod(1 - k*p)
#> [1] 0.3
Created on 2021-04-28 by the reprex package (v1.0.0)
Following #IaroslavDomin's solutions, but constructing the coefficients for this particular case by hand, then using polyroot():
Here's a sequence of three functions (compute individual coeffs, put them together into a vector, find positive real roots):
## construct ith binomial coefficients: the sum of the products
## of all i-element combinations
bcoef <- function(p,i) {
sum(apply(combn(p,i),2,prod))
}
## compute all binomial coefficients and put them together
## into the vector of coeffs for 1-prod(1-k*p)
mypoly <- function(p,target=0.3) {
c(-target,-1*sapply(seq_along(p), bcoef, p =-p))
}
## compute real positive solutions
soln <- function(p, target=0.3) {
roots <- polyroot(mypoly(p))
roots <- Re(roots[abs(Im(roots))<1e-16])
roots <- roots[roots>0]
if (length(roots)>1) warn(">1 solution")
return(roots)
}
Try it out for a couple of cases:
p1 <- c(0.1072518,0.5781922, 0.3877427)
s1 <- soln(p1)
1-prod(1-s1*p1)
p2 <- c(0.0496772, 0.04584501, 0.04210299, 0.04026439, 0.03844668, 0.03487194, 0.03137491)
s2 <- soln(p2)
1-prod(1-s2*p2)
If you don't want to be clever, then brute force is perfectly adequate (56 microseconds on my machine when length(p) is 52):
bsoln <- function(p, target=0.3) {
f <- function(k) { (1-prod(1-k*p)) - target }
return(uniroot(f, c(0,20))$root)
}
asoln <- function(p, target=0.3) {
return(- log(1 - target) / sum(p))
}
I started to run benchmarks and gave up; I don't like the format of microbenchmark output and the approximate solution is too fast for rbenchmark::benchmark() to time accurately. In any case, one run of bsoln() with length(p)==52 takes on the order of 50 microseconds, so you're going to have to run this a whole bunch of times before speed becomes problematic ...
Another option would be to just search for a root on a segment without specifically calculating polynomial coefficients. This can be done e.g. with the uniroot function.
Only one not-so-trivial thing we need to do here is to specify the segment. k is obviously >=0 - so that would be the left point. Then we know that all the k*p values should be probabilities, hence <=1. Therefore k <= 1/max(p) - that's the right point.
And so the code is:
find_k <- function(p, taget_op) {
f <- function(x) 1 - prod(1 - x*p) - target_op
max_k <- 1/max(p)
res <- uniroot(f, c(0, max_k))
res$root
}
p <- runif(1000, 0, 1)
target_op <- 0.3
k <- find_k(p, target_op)
k
#> [1] 0.000710281
1 - prod(1 - k*p)
#> [1] 0.2985806
Created on 2021-04-29 by the reprex package (v1.0.0)
This works pretty fast even for 1000 probabilities.
I am trying to create boxplots where the medians of my variables are aligned at 0. Their range do not have to be fixed at [-1, 1], but I would like their min and max to fall within this range. Is there an operation that I could use?
I was able to normalize my variables in [-1, 1], but I am aiming at having the medians aligned at 0, and their range just falling within (not being fixed at) [-1, 1].
Here's a function to do that. It finds which extreme is farther from the median and then uses the median and the max distance from the median to scale everything into a range between -1 and 1, with median at the center. This will break if the data has no range (ie min = median = max), as that will result in an infinite rescaling factor, but I'm not sure what the expected behavior should be in that case.
rescale_center_median <- function(my_numbers) {
my_median = median(my_numbers, na.rm = TRUE)
my_range = range(my_numbers, na.rm = TRUE)
scale_factor = max(abs(my_range-my_median))
(my_numbers - my_median) / scale_factor
}
Testing:
set.seed(42)
rescale_center_median(rnorm(10))
# [1] 0.60393025 -0.58015650 -0.01258313 0.15241963 0.01258313 -0.29963620 0.68991628
# [8] -0.29262249 1.00000000 -0.27308102
median(scaled_numbers)
#[1] 0
> range(scaled_numbers)
[1] -0.4922334 1.0000000
This may be a very stupid question but
Does anyone know why i am not getting the second bit equals to the mean (100)?
#beta=4, alpha=5, mean=20
qgamma(0.5, 5, 1/4)
# 18.68364
#beta=2500, alpha=0.04, mean=100
qgamma(0.5,0.04,1/2500)
# 0.00004320412
It is because you are using the quantile function, and qgamma(0.5, shape, scale) corresponds to the median - not the mean as you are expecting.
See the example below;
x <- rgamma(50000, shape = 0.04, scale = 2500)
mean(x)
# [1] 98.82911
median(x)
# [1] 3.700012e-05
So I'm using Monte Carlo method to evaluate definite integral of a bunch of functions.
To start with,
y = x ^ (-0.5) ; for x in [0.01,1]
for which, my code in R looks like this
#
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s==1)*(b-a)/m
print(nn)
#
Answer (nn) : 0.99
Actual answer: 1.8
I cannot figure out where I'm going wrong with this. Have I done something wrong?
A number less than 1 to the power of something negative will always be greater than anything less than one, so you shouldn't be surprised when you get a vector of all 1s.
The rectangle you're using is too short (a height of 1). In reality, it should be 10 tall (since 0.01^-0.5=10) is the maximum value.
Then you take the total area of the rectangle and multiply it by the average of s, so the revised code looks like this:
s <- NULL
m<- 100
a<- 0.01
b<- 1
set.seed(5)
x<-runif(m,a,b)
y<-10*runif(m,0,1)
for (i in 1:m){
if(y[i]<(x[i]^(-0.5))){
s[i] <- 1
}
else{
s[i] <-0
}
}
nn<-sum(s)*(b-a)/m*10#note that the addition of the area of the rectangle
print(nn)
I got a result of 1.683, which is a lot closer to the real answer.
Edit: made a superfluous multiplication, answer revised slightly
As user1362215 points out, your function should be contained in the rectangle. You get closer to the solution if you increase n. Here is a vectorised solution. Results are in the range.
# Hit and miss
f <- function(x) x ^ (-0.5)
n <- 1000000
a <- 0.01
b <- 1
#ceiling(max(f((seq(0.01,1,by=0.001)))))
#[1] 10
set.seed(5)
x <- runif(n,a,b)
y <- 10*runif(n,0,1)
R <- sum(y < f(x))/n
(b-a)*10*R
#[1] 1.805701
# Repeat a few times to look at the distribution
set.seed(5)
n <- 100000
r <- replicate(1000,sum(10*runif(n,0,1) < f(runif(n,a,b)))/n *(b-a)*10)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.755 1.792 1.800 1.800 1.809 1.845
# Sample mean method for comparison
set.seed(5)
r <- replicate(1000, mean(f(runif(n, a,b)))*(b-a))
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.788 1.798 1.800 1.800 1.803 1.813
Re your edit: I am assuming the x*2 + y^2, [-1,1] you are referring to a circle rather than a function f(z). So really to estimate area of unit circle/Pi by simulation.
f2 <- function(x) sqrt(1-x^2)
s <- seq(-1 , 1 ,by=0.001)
plot(s,f2(s))
# Get the max value of function within the range
c <- ceiling(max(f2(s)))
# [1] 1
n <- 1000000
a <- -1
b <- 1
set.seed(5)
x <- runif(n,a,b)
y <- c*runif(n,0,1)
R <- sum(y < f2(x))/n
(b-a)*c*R
#[1] 1.57063 # multiply it by 2 to get full area
pi/2
#[1] 1.570796
A Monte Carlo alternative to acceptance/rejection is to uniformly generate x values, average the resulting y = f(x) values to estimate the average height, and multiply that by the interval length to get the estimated area. I don't know R well enough, so here it is in Ruby to illustrate the algorithm:
def f(x)
x ** -0.5
end
sum = 0.0
10000.times { sum += f(0.01 + 0.99 * rand) }
print (1.0 - 0.01) * (sum / 10000)
I'm getting results in the range 1.8 +/- 0.02
You can also improve the precision of your estimator by using antithetic random variates - for each x you generate, also use the symmetric x value mirrored about the median of the x's.
Using #user20650's code for guidance for how to do this in R, you can estimate Pi / 2 as follows:
f <- function(x) sqrt(1-x^2)
n <- 100000
a <- -1
b <- 1
range <- b-a
set.seed(5)
r <- replicate(1000, mean(f(runif(n,a,b))) * range)
hist(r)
summary(r)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.566 1.570 1.571 1.571 1.572 1.575
No bounding function is needed for this approach, and generally it yields greater precision than the acceptance/rejection approach.