Drawing from two distributions with a probability in R - r

I am trying to draw from two different distributions with a probability 100000 times. Unfortunately I can't see what is wrong with my for loop, however, it only adds 1 value to simulated_data instead of the desired 100,000 values.
Question 1: How can I fix this?
Question 2: Is there a far more efficient method where I don't have to loop through 100,000 items in a list?
#creating a vector of probabilities
probabilities <- rep(0.99,100000)
#creating a vector of booleans
logicals <- runif(length(probabilities)) < probabilities
#empty list for my simulated data
simulated_data <- c()
#drawing from two different distributions depending on the value in logicals
for(i in logicals){
if (isTRUE(i)) {
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 1)
}else{
simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 10)
}
}

It seems that you want to create a final sample where each element is taken randomly from either sample1 or sample2, with probabilities 0.99 and 0.01.
The correct approach would be to generate both samples, each containing the same number of elements and then select randomly from either one.
The correct approach would be:
# Generate both samples
n = 100000
sample1 = rnorm(n,0,1)
sample2 = rnorm(n,0,10)
# Create the logical vector that will decide whether to take from sample 1 or 2
s1_s2 = runif(n) < 0.99
# Create the final sample
sample = ifelse(s1_s2 , sample1, sample2)
In this case, it is not guaranteed that there are exactly 0.99*n samples from sample1 and 0.01*n from sample2. In fact:
> sum(sample == sample1)
[1] 98953
This is close to 0.99*n, as expected, but not exactly.

Create a vector with the desired fraction of values from each distribution and then create a random permutation of the values:
N = 10000
frac =0.99
rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( abs(rand_mix) >1.96)
FALSE TRUE
9364 636
> (100000-636)/100000
[1] 0.99364
> table( rnorm(10000) >6)
FALSE
10000
The fraction is fixed. If you wante a possibly random fraction (but close to 0.99 statistically) then try this:
> table( sample( c( rnorm(10e6), rnorm(10e4, sd=10) ), 10e4) > 1.96 )
FALSE TRUE
97151 2849
Compare with:
> N = 100000
> frac =0.99
> rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( rand_mix > 1.96 )
FALSE TRUE
97117 2883

Here is a nice solution for anyone here:
n <- 100000
prob1 <- 0.99
prob2 <- 1-prob1
dist1 <- rnorm(prob1*n, 0, 1)
dist2 <- rnorm(prob2*n, 0, 10)
actual_sample <- c(dist1, dist2)

Related

Creating more pseudo-random matrices same time in R? Comparing the points sign matching?

I can make one pseudo-random matrix with the following :
nc=14
nr=14
set.seed(111)
M=matrix(sample(
c(runif(58,min=-1,max=0),runif(71, min=0,max=0),
runif(nr*nc-129,min=0,max=+1))), nrow=nr, nc=nc)
The more important question: I need 1000 matrices with the same amount of negative, positive and zero values, just the location in the matrices need to be various.
I can make matrices one by one, but I want to do this task faster.
The less important question: If I have the 1000 matrices, I need to identify for every point of the matrices, that how many positive negative or zero value got there, for example:
MATRIX_A
[,1]
[9,] -0,2
MATRIX_B
[,1]
[9,] -0,5
MATRIX_C
[,1]
[9,] 0,1
MATRIX_D
[,1]
[9,] 0,0
MATRIX_E
[,1]
[9,] 0,9
What I need:
FINAL_MATRIX_positive
[,1]
[9,] (2/5*100)=40% or 0,4 or 2
,because from 5 matrix in this point were 2 positive value, and also need this for negative and zero values too.
If it isn't possible to do this in R, I can compare them "manually" in Excel.
Thank you for your help!
Actually you are almost there!
You can try the code below, where replicate can make 1000 times for generating the random matrix, and Reduce gets the statistics of each position:
nc <- 14
nr <- 14
N <- 1000
lst <- replicate(
N,
matrix(sample(
c(
runif(58, min = -1, max = 0),
runif(71, min = 0, max = 0),
runif(nr * nc - 129, min = 0, max = +1)
)
), nrow = nr, nc = nc),
simplify = FALSE
)
pos <- Reduce(`+`,lapply(lst,function(M) M > 0))/N
neg <- Reduce(`+`,lapply(lst,function(M) M < 0))/N
zero <- Reduce(`+`,lapply(lst,function(M) M == 0))/N
I use a function for your simulation scheme:
my_sim <- function(n_neg = 58, n_0 = 71, n_pos = 67){
res <- c(runif(n_neg, min=-1, max=0),
rep(0, n_0),
runif(n_pos, min=0, max=+1))
return(sample(res))
}
Then, I simulate your matrices (I store them in a list):
N <- 1000
nr <- 14
nc <- nr
set.seed(111)
my_matrices <- list()
for(i in 1:N){
my_matrices[[i]] <- matrix(my_sim(), nrow = nr, ncol = nc)
}
Finally, I compute the proportion of positive numbers for the position row 1 and column 9:
sum(sapply(my_matrices, function(x) x[1,9]) > 0)/N
# [1] 0.366
However, if you are interested in all the positions, these lines will do the job:
aux <- lapply(my_matrices, function(x) x > 0)
FINAL_MATRIX_positive <- 0
for(i in 1:N){
FINAL_MATRIX_positive <- FINAL_MATRIX_positive + aux[[i]]
}
FINAL_MATRIX_positive <- FINAL_MATRIX_positive/N
# row 1, column 9
FINAL_MATRIX_positive[1, 9]
# [1] 0.366

Generate random numbers in R satisfying constraints

I need help with a code to generate random numbers according to constraints.
Specifically, I am trying to simulate random numbers ALFA and BETA from, respectively, a Normal and a Gamma distribution such that ALFA - BETA < 1.
Here is what I have written but it does not work at all.
set.seed(42)
n <- 0
repeat {
n <- n + 1
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1)
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 10000) break
}
Due to vectorization, it will be faster to generate the numbers "all at once" rather than in a loop:
set.seed(42)
N = 1e5
a = rnorm(N, 10, 2)
b = rgamma(N, 8, 1)
d = a - b
alfa = a[d < 1]
beta = b[d < 1]
length(alfa)
# [1] 36436
This generated 100,000 candidates, 36,436 of which met your criteria. If you want to generate n samples, try setting N = 4 * n and you'll probably generate more than enough, keep the first n.
Your loop has 2 problems: (a) you need curly braces to enclose multiple lines after an if statement. (b) you are using n as an attempt counter, but it should be a success counter. As written, your loop will only stop if the 10000th attempt is a success. Move n <- n + 1 inside the if statement to fix:
set.seed(42)
n <- 0
alfa = numeric(0)
beta = numeric(0)
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) {
n <- n + 1
alfa[n] <- a
beta[n] <- b
l = length(alfa)
if (l == 500) break
}
}
But the first way is better... due to "growing" alfa and beta in the loop, and generating numbers one at a time, this method takes longer to generate 500 numbers than the code above takes to generate 30,000.
As commented by #Gregor Thomas, the failure of your attempt is due to the missing of curly braces to enclose the if statement. If you would like to skip {} for if control, maybe you can try the code below
set.seed(42)
r <- list()
repeat {
a <- rnorm(1, 10, 2)
b <- rgamma(1, 8, 1)
d <- a - b
if (d < 1) r[[length(r)+1]] <- cbind(alfa = a, beta = b)
if (length(r) == 100000) break
}
r <- do.call(rbind,r)
such that
> head(r)
alfa beta
[1,] 9.787751 12.210648
[2,] 9.810682 14.046190
[3,] 9.874572 11.499204
[4,] 6.473674 8.812951
[5,] 8.720010 8.799160
[6,] 11.409675 10.602608

Manual simulation of Markov Chain in R

Consider the Markov chain with state space S = {1, 2}, transition matrix
and initial distribution α = (1/2, 1/2).
Simulate 5 steps of the Markov chain (that is, simulate X0, X1, . . . , X5). Repeat the simulation 100
times. Use the results of your simulations to solve the following problems.
Estimate P(X1 = 1|X0 = 1). Compare your result with the exact probability.
My solution:
# returns Xn
func2 <- function(alpha1, mat1, n1)
{
xn <- alpha1 %*% matrixpower(mat1, n1+1)
return (xn)
}
alpha <- c(0.5, 0.5)
mat <- matrix(c(0.5, 0.5, 0, 1), nrow=2, ncol=2)
n <- 10
for (variable in 1:100)
{
print(func2(alpha, mat, n))
}
What is the difference if I run this code once or 100 times (as is said in the problem-statement)?
How can I find the conditional probability from here on?
Let
alpha <- c(1, 1) / 2
mat <- matrix(c(1 / 2, 0, 1 / 2, 1), nrow = 2, ncol = 2) # Different than yours
be the initial distribution and the transition matrix. Your func2 only finds n-th step distribution, which isn't needed, and doesn't simulate anything. Instead we may use
chainSim <- function(alpha, mat, n) {
out <- numeric(n)
out[1] <- sample(1:2, 1, prob = alpha)
for(i in 2:n)
out[i] <- sample(1:2, 1, prob = mat[out[i - 1], ])
out
}
where out[1] is generated using only the initial distribution and then for subsequent terms we use the transition matrix.
Then we have
set.seed(1)
# Doing once
chainSim(alpha, mat, 1 + 5)
# [1] 2 2 2 2 2 2
so that the chain initiated at 2 and got stuck there due to the specified transition probabilities.
Doing it for 100 times we have
# Doing 100 times
sim <- replicate(chainSim(alpha, mat, 1 + 5), n = 100)
rowMeans(sim - 1)
# [1] 0.52 0.78 0.87 0.94 0.99 1.00
where the last line shows how often we ended up in state 2 rather than 1. That gives one (out of many) reasons why 100 repetitions are more informative: we got stuck at state 2 doing just a single simulation, while repeating it for 100 times we explored more possible paths.
Then the conditional probability can be found with
mean(sim[2, sim[1, ] == 1] == 1)
# [1] 0.4583333
while the true probability is 0.5 (given by the upper left entry of the transition matrix).

Delete matrix rows based on a threshold coverage with another matrix in r

I have a matrix composed of sites and species. Some species have a certain trait value but not all of them.
I want to keep only the site-species matrix rows that contain enough trait information, in my case more than 60%.
So far, I have the following for-loop but I would like to have a faster version of this code. How can I optimize this and skip the for-loop part?
# site-species matrix
A <- matrix(c(0, 0.2, 0.2, 0.6, 0.3, 0.3, 0, 0.4), byrow = T, nrow = 2)
colnames(A) <- paste0("sp_", seq(ncol(A)))
rownames(A) <- paste0("site_", seq(nrow(A)))
# trait information
B <- data.frame(sp = paste0("sp_", seq(1:ncol(A))),
value = c(NA, NA, 2, 3))
# For-loop to get the coverage percentage for each row
pcover <- c()
for(i in 1:nrow(A)){
non_null_A <- A[i, ][A[i, ] > 0]
B_match <- match(names(non_null_A), B[, "sp"])
B_value <- B[B_match, "value"]
pcover <- rbind(pcover,
sum(!is.na(B_value)) / length(B_value) * 100)
}
A
A[pcover > 60, , drop = FALSE] # in this case, the second site is removed
The idea is that you have two conditions working together:
is A positive ?
is B$value NA ?
We compute these tests from the start and use only vectorized code :
Apos <- A[,B$sp] > 0 # or just A>0 here but I assumed from your code that you'd needed this
pcover <- 100* colSums(t(Apos) & !is.na(B$value)) /rowSums(Apos)
pcover
# site_1 site_2
# 66.66667 33.33333
A[pcover > 60, , drop = FALSE]
# sp_1 sp_2 sp_3 sp_4
# site_1 0 0.2 0.2 0.6

Sampling with weight for 1 number

I want to sample from a big pool of numbers e.g. -2000:5000.
I would like thought to set the weight for a certain number to 80%.
Everything other possibility should be equal for every other number.
Doing this for a small sample is easy:
sample(-2:2, 10, replace = TRUE, prob=c(0.05, 0.05, 0.80, 0.05, 0.05))
this would output:
[1] 0 0 0 0 0 -1 0 0 0 0
How can I do this for a big range of numbers?
Just manipulate your probability vector pragmatically. So you have
values <- -2:2
special_value <- 0
Then you can do
probs <- rep(1, length(values))
probs[values==special_value] <- (length(values )-1)*.8/(1-.8)
Then use
sample(values, 10, replace = TRUE, prob=probs)
You only need to run a separate vector with all the probabilities. Take on the account that when you create the probabilities vector, you can use a logic test to define "p1 = 0.8" if it is my desired number and "p2 = 0.2/n" in any other case. The code run as follows:
n <- 37 # your num
x <- -100:1000 # your sequence
probs <- ifelse(x == n, 0.8, (1 - 0.8) /length(x) ) # determine probabilities taking on account they have to add up to 1
sum(probs)
sample(x = x, size = 1000, prob = probs)
Let me know if it helps.

Resources