In a coin flip game, you flip a fair coin until the difference between the number of heads and number of tails is 3. You are paid $8 at the end, but you have to pay $1 for each flip of the coins. Use N =100000 simulations and find the expected amount you could win.
I was able to use the following code for 1 game but it breaks for N=100,000 as it is too long. How can I turn this into a function that will give me expected payoff for 100,000 games?
#For 1 game
# number of games to play
n_games <- 1
bias <- 0.5
game_payoff <- c()
for (i in seq_len(n_games)) {
cost <- 0
flip_record <- c()
payoff <- c()
repeat{
cost <- cost + 1
flip <- rbinom(1, 1, prob = bias)
flip_record <- c(flip_record, flip)
# number of 0s/tails
n_tails <- length(flip_record) - sum(flip_record)
# number of 1s/heads
n_heads <- sum(flip_record)
if (abs(n_tails - n_heads) == 3) {
# record game payoff
game_payoff <- c(game_payoff, 8 - cost)
# print game payoff
print(paste0("single game payoff: ", 8 - cost))
break
}
}
}
Here is a solution.
I have added an argument bankruptcy in order to avoid very long games.
In the example run below, only 10,000 games (n_games) are played.
game <- function(prob = 0.5, pay = 8L, bankruptcy = 1e6) {
cost <- 0L
difference <- 0L
flip_record <- integer(bankruptcy)
# loop until the difference between the number of heads and
# number of tails is 3 or the coin was flipped
# more times than what the player can afford to flip
while(abs(difference) < 3 && cost < bankruptcy) {
flip <- sample(c(-1, 1), 1, prob = c(prob, 1 - prob))
cost <- cost + 1L
difference <- difference + flip
flip_record[cost] <- flip
}
win <- (abs(difference) == 3L)*pay
flip_record <- flip_record[seq.int(cost)]
list(win = win, cost = cost, flips = flip_record)
}
set.seed(2022)
n_games <- 1e4
prob <- 0.5
# the games are independent, this could use parallelization
system.time(
games_played <- replicate(n_games, game(), simplify = FALSE)
)
#> user system elapsed
#> 22.62 16.36 39.00
# extract the relevant values from the returned list
Wins <- sapply(games_played, `[[`, 'win')
Costs <- sapply(games_played, `[[`, 'cost')
# example differences sequences (game 1 and game 10)
Flips <- sapply(games_played, `[[`, 'flips')
cumsum(Flips[[1]])
#> [1] -1 -2 -1 -2 -1 -2 -1 0 1 0 1 2 3
cumsum(Flips[[10]])
#> [1] 1 0 -1 0 1 2 3
# it pays more times than not
table(Wins - Costs > 0)
#>
#> FALSE TRUE
#> 4258 5742
# the proportion of times the player turns a profit
mean(Wins - Costs > 0)
#> [1] 0.5742
# but when the players loose they loose more
# than the profit they get
mean(Wins - Costs) * n_games
#> [1] -9898
# the same number, different way of computing it
# and with formatted output
formatC(sum(Wins - Costs), big.mark = ",")
#> [1] "-9,898"
# the distribution of wins/losses is left skewed
# with a long tail => it doesn't pay to play
profit <- table(Wins - Costs)
barplot(profit)
Created on 2022-10-06 with reprex v2.0.2
It will be more performant to vectorize over the games:
N <- 1e5
d <- rep(1L, N) # current flip difference for all N games; always 1 after the first flip
n <- 1L # the number of flips so far
flip <- c(-1L, 1L)
cost <- 0L # initialize the total cost of playing N games
while (length(d)) {
d <- d + sample(flip, length(d), TRUE) # each flip adds or subtracts 1 from the difference
idx <- which(abs(d) < 3L) # which games have not ended?
cost <- cost + (n <- n + 1L)*(length(d) - length(idx)) # add the cost from the ended games
d <- d[idx] # remove the ended games from the simulation
}
# expected payoff
8 - cost/N
#> [1] -0.99824
Related
I just saw a YouTube video from Numberphile on the Yellowstone sequence (A098550). It's base on a sequence starting with 1 and 2, with subsequent terms generated by the rules:
no repeated terms
always pick the lowest integer
gcd(a_n, a_(n-1)) = 1
gcd(a_n, a_(n-2)) > 1
The first 15 terms would be: 1 2 3 4 9 8 15 14 5 6 25 12 35 16 7
A Q&D approach in R could be something like this, but understandably, this becomes very slow at attempts to make longer sequences. It also make some assumptions about the highest number that is possible within the sequence (as info: the sequence of 10,000 items never goes higher than 5000).
What can we do to make this faster?
library(DescTools)
a <- c(1, 2, 3)
p <- length(a)
# all natural numbers
all_ints <- 1:5000
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
next_a_set <- all_ints[which(!all_ints %in% a)]
# rule 3 - search the remaining set for numbers that have gcd == 1
next_a_option <- next_a_set[which(
sapply(
next_a_set,
function(x) GCD(a[n], x)
) == 1
)]
# rule 4 - search the remaining number for gcd > 1
next_a <- next_a_option[which(
sapply(
next_a_option,
function(x) GCD(a[n - 1], x)
) > 1
)]
# select the lowest
a <- c(a, min(next_a))
n <- n + 1
}
Here's a version that's about 20 times faster than yours, with comments about the changes:
# Set a to the final length from the start.
a <- c(1, 2, 3, rep(NA, 997))
p <- 3
# Define a vectorized gcd() function. We'll be testing
# lots of gcds at once. This uses the Euclidean algorithm.
gcd <- function(x, y) { # vectorized gcd
while (any(y != 0)) {
x1 <- ifelse(y == 0, x, y)
y <- ifelse(y == 0, 0, x %% y)
x <- x1
}
x
}
# Guess at a reasonably large vector to work from,
# but we'll grow it later if not big enough.
allnum <- 1:1000
# Keep a logical record of what has been used
used <- c(rep(TRUE, 3), rep(FALSE, length(allnum) - 3))
for (n in p:1000) {
# rule 1 - remove all number that are in sequence already
# nothing to do -- used already records that.
repeat {
# rule 3 - search the remaining set for numbers that have gcd == 1
keep <- !used & gcd(a[n], allnum) == 1
# rule 4 - search the remaining number for gcd > 1
keep <- keep & gcd(a[n-1], allnum) > 1
# If we found anything, break out of this loop
if (any(keep))
break
# Otherwise, make the set of possible values twice as big,
# and try again
allnum <- seq_len(2*length(allnum))
used <- c(used, rep(FALSE, length(used)))
}
# select the lowest
newval <- which.max(keep)
# Assign into the appropriate place
a[n+1] <- newval
# Record that it has been used
used[newval] <- TRUE
}
If you profile it, you'll see it spends most of its time in the gcd() function. You could probably make that a lot faster by redoing it in C or C++.
The biggest change here is pre-allocation and restricting the search to numbers that have not yet been used.
library(numbers)
N <- 5e3
a <- integer(N)
a[1:3] <- 1:3
b <- logical(N) # which numbers have been used already?
b[1:3] <- TRUE
NN <- 1:N
system.time({
for (n in 4:N) {
a1 <- a[n - 1L]
a2 <- a[n - 2L]
for (k in NN[!b]) {
if (GCD(k, a1) == 1L & GCD(k, a2) > 1L) {
a[n] <- k
b[k] <- TRUE
break
}
}
if (!a[n]) {
a <- a[1:(n - 1L)]
break
}
}
})
#> user system elapsed
#> 1.28 0.00 1.28
length(a)
#> [1] 1137
For a fast C++ algorithm, see here.
Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?
I tried this but it got too big
library(tidyverse)
n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)
The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.
Thanks!
If you can tolerate doing this randomly, then
set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
if (mean(tmp) > 4) {
scores <- tmp
break
}
iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
# [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3
Notes:
The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.
FYI: if you're looking for an average of 4.2, two things to note:
change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps
tol <- 0.02
# ...
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
# ...
where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:
set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
tmp <- sample(1:5, size = n, replace = TRUE)
# if (mean(tmp) > 4) {
if (abs(mean(tmp) - 4.2) < tol) {
scores <- tmp
break
}
iter <- iter - 1
}
iter
# [1] 0 # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)
if you instead set tol = 0.02 then you will find something:
tol <- 0.02
# ...
scores
# [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286
You can try the code below
n <- 14
avg <- 4.2
repeat{
x <- sample(1:5, n, replace = TRUE)
if (sum(x) == round(avg * n)) break
}
and you will see
> x
[1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286
Is there a method to generate random integers in R such that any two consecutive numbers are different? It is probably along the lines of x[k+1] != x[k] but I can't work out how to put it all together.
Not sure if there is a function available for that. Maybe this function can do what you want:
# n = number of elements
# sample_from = draw random numbers from this range
random_non_consecutive <- function(n=10,sample_from = seq(1,5))
{
y=c()
while(length(y)!=n)
{
y= c(y,sample(sample_from,n-length(y),replace=T))
y=y[!c(FALSE, diff(y) == 0)]
}
return(y)
}
Example:
random_non_consecutive(20,c(2,4,6,8))
[1] 6 4 6 2 6 4 2 8 4 2 6 2 8 2 8 2 8 4 8 6
Hope this helps.
The function above has a long worst-case runtime. We can keep that worst-case more constant with for example the following implementation:
# n = number of elements
# sample_from = draw random numbers from this range
random_non_consecutive <- function(n=10,sample_from = seq(1,5))
{
y= rep(NA, n)
prev=-1 # change this if -1 is in your range, to e.g. max(sample_from)+1
for(i in seq(n)){
y[i]=sample(setdiff(sample_from,prev),1)
prev = y[i]
}
return(y)
}
Another approach is to over-sample and remove the disqualifying ones as follows:
# assumptions
n <- 5 # population size
sample_size <- 1000
# answer
mu <- sample_size * 1/n
vr <- sample_size * 1/n * (1 - 1/n)
addl_draws <- round(mu + vr, 0)
index <- seq(1:n)
sample_index <- sample(index, sample_size + addl_draws, replace = TRUE)
qualified_sample_index <- sample_index[which(diff(sample_index) != 0)]
qualified_sample_index <- qualified_sample_index[1:sample_size]
# In the very unlikely event the number of qualified samples < sample size,
# NA's will fill the vector. This will print those N/A's
print(which(is.na(qualified_sample_index) == TRUE))
Lets assume the integer x. I want to split this quantity in n mostly equal chunks and save the values in a vector. E.g. if x = 10 and n = 4 then the resulting vector would be:
(3,3,2,2)
and if n = 3:
(4,3,3)
Note: The order of the resulting vector does not matter
While this will create a (probably unnecessary) large object when x is large, it is still pretty quick:
x <- 10
n <- 4
tabulate(cut(1:x, n))
#[1] 3 2 2 3
On a decent modern machine dividing 10M records into 100K groups, it takes only 5 seconds:
x <- 1e7
n <- 1e5
system.time(tabulate(cut(1:x, n)))
# user system elapsed
# 5.07 0.06 5.13
Here are some solutions.
1) lpSolve Solve this integer linear program. It should be fast even for large x (but not if n is also large). I also tried it for x = 10,000 and n = 3 and it returned the solution immediately.
For example, for n = 4 and x = 10 it corresponds to
min x4 - x1 such that 0 <= x1 <= x2 <= x3 <= x4 and
x1 + x2 + x3 + x4 = 10 and
x1, x2, x3, x4 are all integer
The R code is:
library(lpSolve)
x <- 10
n <- 4
D <- diag(n)
mat <- (col(D) - row(D) == 1) - D
mat[n, ] <- 1
obj <- replace(numeric(n), c(1, n), c(-1, 1))
dir <- replace(rep(">=", n), n, "=")
rhs <- replace(numeric(n), n, x)
result <- lp("min", obj, mat, dir, rhs, all.int = TRUE)
result$solution
## [1] 2 2 3 3
and if we repeat the above with n = 3 we get:
## [1] 3 3 4
2) lpSolveAPI The lpSolveAPI package's interface to lpSolve supports a sparse matrix specification which may reduce storage if n is large although it may still be slow if n is sufficiently large. Rewriting (1) using this package we have:
library(lpSolveAPI)
x <- 10
n <- 4
mod <- make.lp(n, n)
set.type(mod, 1:n, "integer")
set.objfn(mod, c(-1, 1), c(1, n))
for(i in 2:n) add.constraint(mod, c(-1, 1), ">=", 0, c(i-1, i))
add.constraint(mod, rep(1, n), "=", x)
solve(mod)
get.variables(mod)
## [1] 2 2 3 3
3) Greedy Heuristic This alternative uses no packages. It starts with a candidate solution having n-1 values of x/n rounded down and one remaining value. On each iteration it tries to improve the current solution by subtracting one from the largest values and adding 1 to the same number of smallest values. It stops when it can make no further improvement in the objective, diff(range(soln)).
Note that for x <- 1e7 and n <- 1e5 it is quite an easy to solve since n divides evenly into x. In particular system.time(tabulate(cut(...))) reports 18 sec on my machine and for the same problem the code below takes 0.06 seconds as it gets the answer after 1 iteration.
For x <- 1e7 and n <- 1e5-1 system.time(tabulate(cut(...))) reports 16 seconds on my machine and for the same problem the code below takes 4 seconds finishing after 100 iterations.
In the example below, taken from the question, 10/4 rounded down is 2 so it starts out with c(2, 2, 2, 4). On the first iteration it gets c(2, 2, 3, 3). On the second iteration it cannot get any improvement and so returns the answer.
x <- 10
n <- 4
a <- x %/% n
soln <- replace(rep(a, n), n, x - (n-1)*a)
obj <- diff(range(soln))
iter <- 0
while(TRUE) {
iter <- iter + 1
soln_new <- soln
mx <- which(soln == max(soln))
ix <- seq_along(mx)
soln_new[ix] <- soln_new[ix] + 1
soln_new[mx] <- soln_new[mx] - 1
soln_new <- sort(soln_new)
obj_new <- diff(range(soln_new))
if (obj_new >= obj) break
soln <- soln_new
obj <- obj_new
}
iter
## [1] 2
soln
## [1] 2 2 3 3
We have a big for loop in R for simulating various data where for some iterations the data generate in such a way that a quantity comes 0 inside the loop, which is not desirable and we should skip that step of data generation. But at the same time we also need to increase the number of iterations by one step because of such skip, otherwise we will have fewer observations than required.
For example, while running the following code, we get z=0 in iteration 1, 8 and 9.
rm(list=ls())
n <- 10
z <- NULL
for(i in 1:n){
set.seed(i)
a <- rbinom(1,1,0.5)
b <- rbinom(1,1,0.5)
z[i] <- a+b
}
z
[1] 0 1 1 1 1 2 1 0 0 1
We desire to skip these steps so that we do not have any z=0 but we also want a vector z of length 10. It may be done in many ways. But what I particularly want to see is how we can stop the iteration and skip the current step when z=0 is encountered and go to the next step, ultimately obtaining 10 observations for z.
Normally we do this via a while loop, as the number of iterations required is unknown beforehand.
n <- 10L
z <- integer(n)
m <- 1L; i <- 0L
while (m <= n) {
set.seed(i)
z_i <- sum(rbinom(2L, 1, 0.5))
if (z_i > 0L) {z[m] <- z_i; m <- m + 1L}
i <- i + 1L
}
Output:
z
# [1] 1 1 1 1 1 2 1 1 1 1
i
# [1] 14
So we sample 14 times, 4 of which are 0 and the rest 10 are retained.
More efficient vectorized method
set.seed(0)
n <- 10L
z <- rbinom(n, 1, 0.5) + rbinom(n, 1, 0.5)
m <- length(z <- z[z > 0L]) ## filtered samples
p <- m / n ## estimated success probability
k <- round(1.5 * (n - m) / p) ## further number of samples to ensure successful (n - m) non-zero samples
z_more <- rbinom(k, 1, 0.5) + rbinom(k, 1, 0.5)
z <- c(z, z_more[which(z_more > 0)[seq_len(n - m)]])
Some probability theory of geometric distribution has been used here. Initially we sample n samples, m of which are retained. So the estimated probability of success in accepting samples is p <- m/n. According to theory of Geometric distribution, on average, we need at least 1/p samples to observe a success. Therefore, we should at least sample (n-m)/p more times to expect (n-m) success. The 1.5 is just an inflation factor. By sampling 1.5 times more samples we hopefully can ensure (n-m) success.
According to Law of large numbers, the estimate of p is more precise when n is large. Therefore, this approach is stable for large n.
If you feel that 1.5 is not large enough, use 2 or 3. But my feeling is that it is sufficient.