I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.
The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?
Here is my code:
prob <- 0
people <- 1
while (prob <= 0.7) {
people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
birthday <- sample(365, size = people, replace = TRUE)
prob <- length(which(birthday == 365)) / people
}
return(prob)
My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?
I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.
This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.
The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.
I am not an R programmer, but the following Ruby should translate pretty easily:
# Use rationals to avoid cumulative float errors.
# Makes it slower but accurate.
P_MISS_TARGET = 1 - 1/365r
p_all_miss = P_MISS_TARGET
threshold = 3r / 10 # seeking P{all miss target} < 0.3
n = 1
while p_all_miss > threshold
p_all_miss *= P_MISS_TARGET
n += 1
end
puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
which produces:
With 439 people, the probability all miss is 0.29987476838793214
Addendum
I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:
require 'quickstats' # Stats "gem" available from rubygems.org
def trial
n = 1
# Keep adding people to the count until one of them hits the target
n += 1 while rand(1..365) != 365
return n
end
def quantile(percentile = 0.7, number_of_trials = 1_000)
# Create an array containing results from specified number of trials.
# Defaults to 1000 trials
counts = Array.new(number_of_trials) { trial }
# Sort the array and determine the empirical target percentile.
# Defaults to 70th percentile
return counts.sort[(percentile * number_of_trials).to_i]
end
# Tally the statistics of 100 quantiles and report results,
# including margin of error, formatted to 3 decimal places.
stats = QuickStats.new
100.times { stats.new_obs(quantile) }
puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
Five runs produce outputs such as:
440.120+/-3.336
440.650+/-3.495
435.820+/-3.558
439.500+/-3.738
442.290+/-3.909
which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.
Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.
Indeed, your probability will (almost) never reach 0.7, because you hardly will hit the point where exactly 1 person has got birthday = 365. When people gets larger, there will be more people having a birthday = 365, and the probability for exactly 1 person will decrease.
Furthermore, to calculate a probability for a given number of persons, you should draw many samples and then calculate the probability. Here is a way to achieve that:
N = 450 # max. number of peoples being tried
probs = array(numeric(), N) # empty array to store found probabilities
# try for all people numbers in range 1:N
for(people in 1:N){
# do 200 samples to calculate prop
samples = 200
successes = 0
for(i in 1:samples){
birthday <- sample(365, size = people, replace = TRUE)
total_last_day <- sum(birthday == 365)
if(total_last_day >= 1){
successes <- successes + 1
}
}
# store found prop in array
probs[people] = successes/samples
}
# output of those people numbers that achieved a probability of > 0.7
which(probs>0.7)
As this is a simulation, the result depends on the run. Increasing the sample rate would make the result more stable.
You are solving the wrong problem. The question is, "How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?". What you are finding now is "How many people are needed such that 70% have their birthdays on the last day of December?". The answer to the second question is close to zero. But the first one is much simpler.
Replace prob <- length(which(birthday == 365)) / people with check = any(birthday == 365) in your logic because at least one of them has to be born on Dec 31. Then, you will be able to find if that number of people will have at least one person born on Dec 31.
After that, you will have to rerun the simulation multiple times to generate empirical probability distribution (kind of Monte Carlo). Only then you can check for probability.
Simulation Code
people_count = function(i)
{
set.seed(i)
for (people in 1:10000)
{
birthday = sample(365, size = people, replace = TRUE)
check = any(birthday == 365)
if(check == TRUE)
{
pf = people
break
}
}
return(pf)
}
people_count() function returns the number of people required to have so that at least one of them was born on Dec 31. Then I rerun the simulation 10,000 times.
# Number of simulations
nsim = 10000
l = lapply(1:nsim, people_count) %>%
unlist()
Let's see the distribution of the number of people required.
To find actual probability, I'll use cumsum().
> cdf = cumsum(l/nsim)
> which(cdf>0.7)[1]
[1] 292
So, on average, you would need 292 people to have more than a 70% chance.
In addition to #pjs answer, I would like to provide one myself, written in R. I attempted to solve this question by simulation rather than an analytical approach, and I am sharing it in case it is helpful for someone else who also has the same problem. Its not that well written but the idea is there:
# create a function which will find if anyone is born on last day
last_day <- function(x){
birthdays <- sample(365, size = x, replace = TRUE) #randomly get everyone's birthdays
if(length(which(birthdays == 365)) >= 1) {
TRUE #find amount of people born on last day and return true if >1
} else {
FALSE
}
}
# find out how many people needed to get 70%
people <- 0 #set number of people to zero
prob <- 0 #set prob to zero
while (prob <= 0.7) { #loop does not stop until it hits 70%
people <- people + 1 #increase the number of people every iteration
prob <- mean(replicate(10000, last_day(people))) #run last_day 10000 times to find the mean of probability
}
print(no_of_people)
last_day() only return TRUE or FALSE. So I run last_day() 10000 times in the loop for every iteration to find out, out of 10000 times, how many times does it have one or more people born on the last day (This will give the probability). I then keep the loop running until the probability is 70% or more, then print the number of people.
The answer I get from running the loop once is 440 which is quite close to the answer provided by #pjs.
Related
R and probability noob here. I'm looking to create a histogram that shows the distribution of how many attempts it took to return a heads, repeated over 1000+ simulated runs on the equivalent of an unfairly weighted coin (0.1 heads, 0.9 tails).
From my understanding, this is not a geometric distribution or binomial distribution (but might make use of either of these to create the simulated results).
The real-world (ish) scenario I am looking to model this for is a speedrun of the game Zelda: Ocarina of Time. One of the goals in this speedrun is to obtain an item from a character that has a 1 in 10 chance of giving the player the item each attempt. As such, the player stops attempting once they receive the item (which they have a 1/10 chance of receiving each attempt). Every run, runners/viewers will keep track of how many attempts it took to receive the item during that run, as this affects the time it takes the runner to complete the game.
This is an example of what I'm looking to create:
(though with more detailed labels on the x axis if possible). In this, I manually flipped a virtual coin with a 1/10 chance of heads over and over. Once I got a successful result I recorded how many attempts it took into a vector in R and then repeated about 100 times - I then mapped this vector onto a histogram to visualise what the distribution would look like for the usual amount of attempts it will take to get a successful result - basically, i'd like to automate this simulation instead of me having to manually flip the virtual unfair coin, write down how many attempts it took before heads, and then enter it into R myself).
I'm not sure if this is quite what you're looking for, but if you create a function for your manual coin flipping, you can just use replicate() to call it many times:
foo <- function(p = 0.1) {
i <- 0
failure <- TRUE
while ( failure ) {
i <- i + 1
if ( sample(x = c(TRUE, FALSE), size = 1, prob = c(p, 1-p)) ) {
failure <- FALSE
}
}
return(i)
}
set.seed(42)
number_of_attempts <- replicate(1000, foo())
hist(number_of_attempts, xlab = "Number of Attempts Until First Success")
As I alluded to in my comment though, I'm not sure why you think the geometric distribution is inappropriate.
It "is used for modeling the number of failures until the first success" (from the Wikipedia on it).
So, we can just sample from it and add one; the approaches are equivalent, but this will be faster when your number of samples is high:
number_of_attempts2 <- rgeom(1000, 0.1) + 1
hist(number_of_attempts2, xlab = "Number of Attempts Until First Success")
I would use the 'rle' function since you can make a lot of simulations in a short period of time. Use this to count the run of tails before a head:
> n <- 1e6
> # generate a long string of flips with unfair coin
> flips <- sample(0:1,
+ n,
+ replace = TRUE,
+ .... [TRUNCATED]
> counts <- rle(flips)
> # now pull out the "lengths" of "0" which will be the tails before
> # a head is flipped
> runs <- counts$lengths[counts$value == 0]
> sprintf("# of simulations: %d max run of tails: %d mean: %.1f\n",
+ length(runs),
+ max(runs),
+ mean(runs))
[1] "# of simulations: 90326 max run of tails: 115 mean: 10.0\n"
> ggplot()+
+ geom_histogram(aes(runs),
+ binwidth = 1,
+ fill = 'blue')
and you get a chart like this:
Histograph of runs
I would tabulate the cumsum.
p=.1
N <- 1e8
set.seed(42)
tosses <- sample(0:1, N, T, prob=c(1-p, p))
attempts <- tabulate(cumsum(tosses))
length(attempts)
# [1] 10003599
hist(attempts, freq=F, col="#F48024")
New to R. Not sure how to go about this problem. Essentially I want to create a Monte Carlo simulation. However this is how it's supposed to go:
There are only 3 people (A,B,C) And only come into contact once in this order: (A-->B) (B-->C). Starting out, A is 100% sick, while B and C are 0% sick. Each time a person comes into contact with another, they have a 50% chance of being sick How would i go and replicate that in R? I understand the mathematics behind this but unsure how to code it. This is what i have so far:
a='positive'
corona = c('positive','negative')
sample(x = corona, size = 1, replace = TRUE)
I know the output will only give me the results of whether B is sick or not. How would i continue to see if C is sick?
You could write a helper function to do the "infecting". Let's assume that 1 is "sick" and 0 is "not sick"
infect <- function(x, rate=.5) {
current_sick <- x==1
new_sick <- runif(sum(current_sick)) < rate
x[current_sick] <- as.numeric(new_sick) # turn TRUE/FALSE to 1/0
x
}
This function looks where all the 1's are, then for each of those individuals, it tosses a coin to see of the next person will be sick (not changing any of the non-sick values)
Then to test with 1000 people, you can do
A <- rep(1, 1000) # all sick
B <- infect(A)
C <- infect(B)
This draws 1000 A's. And we go though two transmission steps, doing all 1000 samples at once. If you run mean(C) you should get something close to 0.25.
I am trying to develop code that will tell me the likelihood of rolling at least one six given 1 thru 20 die using. I am specifically trying to build a single piece of code that loops through the problem space. generates this information. The question has left me at a loss.
I have tried using the sample function and looked at contingency tables.
die1 = sample(1:6,n,replace=T)
die2 = sample(1:6,n,replace=T)
sum_of_dice = die1 + die2
counts = table(sum_of_dice)
proba_empiric = counts/sum(counts)
barplot(proba_empiric)
The above provides the basis for a probability but not for the joint probability of two die.
The final code should be able to tell me the likelihood of rolling a six on 1 die, 2 die, 3 die, all the way to twenty die.
One way to simulate the probability of rolling at least one 6 using 1 to 20 die is to use rbinom():
sapply(1:20, function(x) mean(rbinom(10000, x, 1/6) > 0))
[1] 0.1675 0.3008 0.4174 0.5176 0.5982 0.6700 0.7157 0.7704 0.8001 0.8345 0.8643 0.8916 0.9094 0.9220 0.9310
[16] 0.9471 0.9547 0.9623 0.9697 0.9718
If I am understanding you correctly, you have 20 dice and you want to know the probability of atleast one six happening in them.
We can write a function to roll one die
roll_die <- function() sample(6, 1)
Then write another function which rolls 20 dice and checks if there is atleast one six in it
roll_20_die <- function() {
any(replicate(20, roll_die()) == 6)
}
and replicate this function sufficient number of times to get the probability ratio
n <- 10000
table(replicate(n, roll_20_die()))/n
# FALSE TRUE
#0.0244 0.9756
I found this piece of code from a the textbook "Statistics and Data analysis for financial engineering," but I am confused about certain line in this code:
This code tried to answer the question of What is the probability that the value of the stock will be below $950,000 at the close of at least one of the next 45 trading days? They provide the mean and SD too.
Code:
niter = 1e5 # number of iterations
below = rep(0,niter) # set up storage
set.seed(2009)
for (i in 1:niter)
{
r = rnorm(45,mean=.05/253,
sd=.23/sqrt(253)) # generate random numbers
logPrice = log(1e6) + cumsum(r)
minlogP = min(logPrice) # minimum price over next 45 days
below[i] = as.numeric(minlogP < log(950000))
}
mean(below)
A few questions:
I dont understand about logPrice = log(1e6) + cumsum(r), why we use log(1e6) and why we have cumsum(r)?
What is the purpose of this: below[i] = as.numeric(minlogP < log(950000))
why do we use log(950000)? why do we need to log?
I'm guessing that current price is $100,000 and hence log(1e6). The return has to be accumulated over period of 45 days and therefore cumsum(r)
Well you are checking if price falls below $950,000
In quant the stock return is normally distributed and stock price (always +ve) is log-normal.
In some cases, a loop needs to run for a random number of iterations that ranges from min to max, inclusive. One working solution is to do something like this:
int numIterations = randomInteger(min, max);
for (int i = 0; i < numIterations; i++) {
/* ... fun and exciting things! ... */
}
A common mistake that many beginning programmers make is to do this:
for (int i = 0; i < randomInteger(min, max); i++) {
/* ... fun and exciting things! ... */
}
This recomputes the loop upper bound on each iteration.
I suspect that this does not give a uniform distribution of the number of times the loop will iterate that ranges from min to max, but I'm not sure exactly what distribution you do get when you do something like this. Does anyone know what the distribution of the number of loop iterations will be?
As a specific example: suppose that min = 0 and max = 2. Then there are the following possibilities:
When i = 0, the random value is 0. The loop runs 0 times.
When i = 0, the random value is nonzero. Then:
When i = 1, the random value is 0 or 1. Then the loop runs 1 time.
When i = 1, the random value is 2. Then the loop runs 2 times.
The probability of this first event is 1/3. The second event has probability 2/3, and within it, the first subcase has probability 2/3 and the second event has probability 1/3. Therefore, the average number of distributions is
0 × 1/3 + 1 × 2/3 × 2/3 + 2 × 2/3 × 1/3
= 0 + 4/9 + 4/9
= 8/9
Note that if the distribution were indeed uniform, we'd expect to get 1 loop iteration, but now we only get 8/9 on average. My question is whether it's possible to generalize this result to get a more exact value on the number of iterations.
Thanks!
Final edit (maybe!). I'm 95% sure that this isn't one of the standard distributions that are appropriate. I've put what the distribution is at the bottom of this post, as I think the code that gives the probabilities is more readable! A plot for the mean number of iterations against max is given below.
Interestingly, the number of iterations tails off as you increase max. Would be interesting if someone else could confirm this with their code.
If I were to start modelling this, I would start with the geometric distribution, and try to modify that. Essentially we're looking at a discrete, bounded distribution. So we have zero or more "failures" (not meeting the stopping condition), followed by one "success". The catch here, compared to the geometric or Poisson, is that the probability of success changes (also, like the Poisson, the geometric distribution is unbounded, but I think structurally the geometric is a good base). Assuming min=0, the basic mathematical form for P(X=k), 0 <= k <= max, where k is the number of iterations the loop runs, is, like the geometric distribution, the product of k failure terms and 1 success term, corresponding to k "false"s on the loop condition and 1 "true". (Note that this holds even to calculate the last probability, as the chance of stopping is then 1, which obviously makes no difference to a product).
Following on from this, an attempt to implement this in code, in R, looks like this:
fx = function(k,maximum)
{
n=maximum+1;
failure = factorial(n-1)/factorial(n-1-k) / n^k;
success = (k+1) / n;
failure * success
}
This assumes min=0, but generalizing to arbitrary mins isn't difficult (see my comment on the OP). To explain the code. First, as shown by the OP, the probabilities all have (min+1) as a denominator, so we calculate the denominator, n. Next, we calculate the product of the failure terms. Here factorial(n-1)/factorial(n-1-k) means, for example, for min=2, n=3 and k=2: 2*1. And it generalises to give you (n-1)(n-2)... for the total probability of failure. The probability of success increases as you get further into the loop, until finally, when k=maximum, it is 1.
Plotting this analytic formula gives the same results as the OP, and the same shape as the simulation plotted by John Kugelman.
Incidentally the R code to do this is as follows
plot_probability_mass_function = function(maximum)
{
x=0:maximum;
barplot(fx(x,max(x)), names.arg=x, main=paste("max",maximum), ylab="P(X=x)");
}
par(mfrow=c(3,1))
plot_probability_mass_function(2)
plot_probability_mass_function(10)
plot_probability_mass_function(100)
Mathematically, the distribution is, if I've got my maths right, given by:
which simplifies to
(thanks a bunch to http://www.codecogs.com/latex/eqneditor.php)
The latter is given by the R function
function(x,m) { factorial(m)*(x+1)/(factorial(m-x)*(m+1)^(x+1)) }
Plotting the mean number of iterations is done like this in R
meanf = function(minimum)
{
x = 0:minimum
probs = f(x,minimum)
x %*% probs
}
meanf = function(maximum)
{
x = 0:maximum
probs = f(x,maximum)
x %*% probs
}
par(mfrow=c(2,1))
max_range = 1:10
plot(sapply(max_range, meanf) ~ max_range, ylab="Mean number of iterations", xlab="max")
max_range = 1:100
plot(sapply(max_range, meanf) ~ max_range, ylab="Mean number of iterations", xlab="max")
Here are some concrete results I plotted with matplotlib. The X axis is the value i reached. The Y axis is the number of times that value was reached.
The distribution is clearly not uniform. I don't know what distribution it is offhand; my statistics knowledge is quite rusty.
1. min = 10, max = 20, iterations = 100,000
2. min = 100, max = 200, iterations = 100,000
I believe that it would still, given a sufficient amount of executions, conform to the distribution of the randomInteger function.
But this is probably a question better suited to be asked on MATHEMATICS.
I don’t know the math behind it, but I know how to compute it! In Haskell:
import Numeric.Probability.Distribution
iterations min max = iteration 0
where
iteration i = do
x <- uniform [min..max]
if i < x
then iteration (i + 1)
else return i
Now expected (iterations 0 2) gives you the expected value of ~0.89. Maybe someone with the requisite math knowledge can explain what I’m actually doing here. Because you start at 0, the loop will always run at least min times.