How to do large combinations with condition in R efficiently? - r

Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!

If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286

You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286

Related

Generating "Non-Random" Numbers in R?

I know how to generate 100 random numbers in R (without replacement):
random_numbers = sample.int(100, 100, replace = FALSE)
I was now curious about learning how to generate 100 "non random" numbers (without replacement). The first comes to mind is to generate a random number, and the next number will be the old number + 1 with a probability of 0.5 or an actual random number with probability 0.5. Thus, these numbers are not "fully random".
This was my attempt to write this code for numbers in a range of 0 to 100 (suppose I want to repeat this procedure 100 times):
library(dplyr)
all_games <- vector("list", 100)
for (i in 1:100){
index_i = i
guess_sets <- 1:100
prob_i = runif(n=1, min=1e-12, max=.9999999999)
guess_i = ifelse(prob_i> 0.5, sample.int(1, 100, replace = FALSE), guess_i + 1)
guess_sets_i <- setdiff(guess_sets_i, guess_i)
all_games_i = as.list(index_i, guess_i, all_games_i)
all_games[[i]] <- all_games_i
}
all_games <- do.call("rbind", all_games)
I tried to make a list that stores all guesses such that the range for the next guess automatically excludes numbers that have already been guessed, but I get this error:
Error in sample.int(1, 100, replace = FALSE) :
cannot take a sample larger than the population when 'replace = FALSE'
Ideally, I am trying to get the following results (format doesn't matter):
index_1 : 5,6,51,4,3,88,87,9 ...
index_2 77,78,79,2,65,3,1,99,100,4...
etc.
Can someone please show me how to do this? Are there easier ways in R to generate "non-random numbers"?
Thank you!
Note: I think an extra line of logic needs to be added - Suppose I guess the number 100, after guessing the number 100 I must guess a new random number since 100+1 is not included in the original range. Also, if I guess the number 5, 17 then 4 - and after guessing 4, the loop tells me to guess 4+1, this is impossible because 5 has already been guessed. In such a case, I would also have to guess a new random number?
It would be tricky to make your algorithm very efficient in R... it doesn't lend itself nicely to vectorization. Here's how I'd write it directly as a for loop:
semirandom = function(n) {
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
result = numeric(n)
result[1] = sample.int(n, size = 1)
for(i in 2:length(result)) {
if(runif(1) < .5 &&
result[i - 1] < n &&
!((result[i - 1] + 1) %in% result)) {
result[i] = result[i - 1] + 1
} else {
result[i] = safe_sample(x = setdiff(1:n, result), size = 1)
}
}
result
}
# generate 10 semirandom numbers 5 times
replicate(semirandom(10), n = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 6 4 4 2 6
# [2,] 3 5 5 3 7
# [3,] 4 3 6 4 5
# [4,] 5 1 2 5 2
# [5,] 7 9 3 6 3
# [6,] 9 10 10 1 1
# [7,] 10 2 8 9 4
# [8,] 2 8 1 8 10
# [9,] 1 7 9 10 9
# [10,] 8 6 7 7 8
You get the error cannot take a sample larger than the population when 'replace = FALSE' because you attempt to extract 100 values from a vector of length one without replacement.
The following draws numbers between 1 and 100, draws each number not more than once, has a 50 percent chance of drawing the previous number + 1 and a 50 percent chance of drawing another random number, if the previous number + 1 has not been drawn yet, and a 100 percent chance to draw another random number, if the previous number + 1 has been drawn.
i <- sample.int(100, 1)
j <- i
for(x in 1:99) {
if((i + 1L) %in% j) {
i <- sample((1:100)[-j], 1L)
} else {
if(runif(1L) > 0.5 || i == 100L) {
i <- sample((1:100)[-j], 1L)
} else {
i <- i + 1L
}
}
j <- c(j, i)
}

R: Logical Conditions Not Being Respected

I am working with the R programming language. I am trying to build a loop that performs the following :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
Since I do not know how to keep generating random numbers until a condition is met, I tried to generate a large amount of random numbers hoping that the condition is met (there is probably a better way to write this):
results <- list()
for (i in 1:100){
# do until break
repeat {
# repeat many random numbers
a = rnorm(10000,10,1)
b = rnorm(10000,10,1)
# does any pair meet the requirement
if (any(a > 12 & b > 12)) {
# put it in a data.frame
d_i = data.frame(a,b)
# end repeat
break
}
}
# select all rows until the first time the requirement is met
# it must be met, otherwise the loop would not have ended
d_i <- d_i[1:which(d_i$a > 10 & d_i$b > 10)[1], ]
# prep other variables and only keep last row (i.e. the row where the condition was met)
d_i$index = seq_len(nrow(d_i))
d_i$iteration = as.factor(i)
e_i = d_i[nrow(d_i),]
results[[i]] <- e_i
}
results_df <- do.call(rbind.data.frame, results)
Problem: When I look at the results, I noticed that the loop is incorrectly considering the condition to be met, for example:
head(results_df)
a b index iteration
4 10.29053 10.56263 4 1
5 10.95308 10.32236 5 2
3 10.74808 10.50135 3 3
13 11.87705 10.75067 13 4
1 10.17850 10.58678 1 5
14 10.14741 11.07238 1 6
For instance, in each one of these rows - both "a" and "b" are smaller than 12.
Does anyone know why this is happening and can someone please show me how to fix this problem?
Thanks!
How about this way? As you tag while-loop, I tried using it.
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
dim(res)
[1] 100 3

take a sample that has a specific mean

Let's say I have a population like {1,2,3, ..., 23} and I want to generate a sample so that the sample's mean equals 6.
I tried to use the sample function, using a custom probability vector, but it didn't work:
population <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)
mean(population)
minimum <- min(population)
maximum <- max(population)
amplitude <- maximum - minimum
expected <- 6
n <- length(population)
prob.vector = rep(expected, each=n)
for(i in seq(1, n)) {
if(expected > population[i]) {
prob.vector[i] <- (i - minimum) / (expected - minimum)
} else {
prob.vector[i] <- (maximum - i) / (maximum - expected)
}
}
sample.size <- 5
sample <- sample(population, sample.size, prob = prob.vector)
mean(sample)
The mean of the sample is about the mean of the population (oscillates around 12), and I wanted it to be around 6.
A good sample would be:
{3,5,6,8,9}, mean=6.2
{2,3,4,8,9}, mean=5.6
The problem is different from sample integer values in R with specific mean because I have a specific population and I can't just generate arbitrary real numbers, they must be inside the population.
The plot of the probability vector:
You can try this:
m = local({b=combn(1:23,5);
d = colMeans(b);
e = b[,d>5.5 &d<6.5];
function()sample(e[,sample(ncol(e),1)])})
m()
[1] 8 5 6 9 3
m()
[1] 6 4 5 3 13
breakdown:
b=combn(1:23,5) # combine the numbers into 5
d = colMeans(b) # find all the means
e = b[,d>5.5 &d<6.5] # select only the means that are within a 0.5 range of 6
sample(e[,sample(ncol(e),1)]) # sample the values the you need

R keep randomly generating numbers until all numbers within specified range are present

My aim is to randomly generate a vector of integers using R, which is populated by numbers between 1-8. However, I want to keep growing the vector until all the numbers between 1:8 are represented at least once, e.g. 1,4,6,2,2,3,5,1,4,7,6,8.
I am able to generate single numbers or a sequence of numbers using sample
x=sample(1:8,1, replace=T)
>x
[1] 6
I have played around with the repeat function to see how it might work with sample and I can at least get the generation to stop when one specific number occurs, e.g.
repeat {
print(x)
x = sample(1:8, 1, replace=T)
if (x == 3){
break
}
}
Which gives:
[1] 3
[1] 6
[1] 6
[1] 6
[1] 6
[1] 6
[1] 2
I am struggling now to work out how to stop number generation once all numbers between 1:8 are present. Additionally, I know that the above code is only printing the sequence as it is generated and not storing it as a vector. Any advice pointing me in the right direction would be really appreciated!
This is fine for 1:8 but might not always be a good idea.
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:8, 1))
if(all(1:8 %in% foo)) break
}
foo
# [1] 8 8 3 7 6 5 6 2 6 6 4 6 8 3 4 8 8 1
If you have more than 1:8, it may be better to obtain the average number of tries (N) required to get all the numbers at least once and then sample N numbers such that all numbers are sampled at least once.
set.seed(42)
vec = 1:8
N = ceiling(sum(length(vec)/(1:length(vec))))
foo = sample(c(vec, sample(vec, N - length(vec), TRUE)))
foo
# [1] 3 6 8 3 8 8 6 4 5 6 1 6 4 6 6 3 5 7 2 2 7 8
Taking cue off of d.b, here's a slightly more verbose method that is more memory-efficient (and a little faster too, though I doubt speed is your issue):
Differences:
pre-allocate memory in chunks (size 100 here), mitigates the problem with extend-by-one vector work; allocating and extending 100 (or even 1000) at a time is much lower cost
compare only the newest number instead of all numbers each time (the first n-1 numbers have already been tabulated, no need to do that again)
Code:
microbenchmark(
r2evans = {
emptyvec100 <- integer(100)
counter <- 0
out <- integer(0)
unseen <- seq_len(n)
set.seed(42)
repeat {
if (counter %% 100 == 0) out <- c(out, emptyvec100)
counter <- counter+1
num <- sample(n, size=1)
unseen <- unseen[unseen != num]
out[counter] <- num
if (!length(unseen)) break
}
out <- out[1:counter]
},
d.b = {
foo = integer(0)
set.seed(42)
while(TRUE){
foo = c(foo, sample(1:n, 1))
if(all(1:n %in% foo)) break
}
}, times = 100, unit = 'us')
# Unit: microseconds
# expr min lq mean median uq max neval
# r2evans 1090.007 1184.639 1411.531 1228.947 1320.845 11344.24 1000
# d.b 1242.440 1372.264 1835.974 1441.916 1597.267 14592.74 1000
(This is intended neither as code-golf nor speed-optimization. My primary goal is to argue against extend-by-one vector work, and suggest a more efficient comparison technique.)
As d.b further suggested, this works fine for 1:8 but may run into trouble with larger numbers. If we extend n up:
(Edit: with d.b's code changes, the execution times are much closer, and not nearly as exponential looking. Apparently the removal of unique had significant benefits to his code.)

How to skip a step and increase the number of iterations in a for loop in R

We have a big for loop in R for simulating various data where for some iterations the data generate in such a way that a quantity comes 0 inside the loop, which is not desirable and we should skip that step of data generation. But at the same time we also need to increase the number of iterations by one step because of such skip, otherwise we will have fewer observations than required.
For example, while running the following code, we get z=0 in iteration 1, 8 and 9.
rm(list=ls())
n <- 10
z <- NULL
for(i in 1:n){
set.seed(i)
a <- rbinom(1,1,0.5)
b <- rbinom(1,1,0.5)
z[i] <- a+b
}
z
[1] 0 1 1 1 1 2 1 0 0 1
We desire to skip these steps so that we do not have any z=0 but we also want a vector z of length 10. It may be done in many ways. But what I particularly want to see is how we can stop the iteration and skip the current step when z=0 is encountered and go to the next step, ultimately obtaining 10 observations for z.
Normally we do this via a while loop, as the number of iterations required is unknown beforehand.
n <- 10L
z <- integer(n)
m <- 1L; i <- 0L
while (m <= n) {
set.seed(i)
z_i <- sum(rbinom(2L, 1, 0.5))
if (z_i > 0L) {z[m] <- z_i; m <- m + 1L}
i <- i + 1L
}
Output:
z
# [1] 1 1 1 1 1 2 1 1 1 1
i
# [1] 14
So we sample 14 times, 4 of which are 0 and the rest 10 are retained.
More efficient vectorized method
set.seed(0)
n <- 10L
z <- rbinom(n, 1, 0.5) + rbinom(n, 1, 0.5)
m <- length(z <- z[z > 0L]) ## filtered samples
p <- m / n ## estimated success probability
k <- round(1.5 * (n - m) / p) ## further number of samples to ensure successful (n - m) non-zero samples
z_more <- rbinom(k, 1, 0.5) + rbinom(k, 1, 0.5)
z <- c(z, z_more[which(z_more > 0)[seq_len(n - m)]])
Some probability theory of geometric distribution has been used here. Initially we sample n samples, m of which are retained. So the estimated probability of success in accepting samples is p <- m/n. According to theory of Geometric distribution, on average, we need at least 1/p samples to observe a success. Therefore, we should at least sample (n-m)/p more times to expect (n-m) success. The 1.5 is just an inflation factor. By sampling 1.5 times more samples we hopefully can ensure (n-m) success.
According to Law of large numbers, the estimate of p is more precise when n is large. Therefore, this approach is stable for large n.
If you feel that 1.5 is not large enough, use 2 or 3. But my feeling is that it is sufficient.

Resources