How to get this probability outcome in R?

How to get this probability outcome in R? - r

New to R. Not sure how to go about this problem. Essentially I want to create a Monte Carlo simulation. However this is how it's supposed to go:
There are only 3 people (A,B,C) And only come into contact once in this order: (A-->B) (B-->C). Starting out, A is 100% sick, while B and C are 0% sick. Each time a person comes into contact with another, they have a 50% chance of being sick How would i go and replicate that in R? I understand the mathematics behind this but unsure how to code it. This is what i have so far:
a='positive'
corona = c('positive','negative')
sample(x = corona, size = 1, replace = TRUE)
I know the output will only give me the results of whether B is sick or not. How would i continue to see if C is sick?

You could write a helper function to do the "infecting". Let's assume that 1 is "sick" and 0 is "not sick"
infect <- function(x, rate=.5) {
current_sick <- x==1
new_sick <- runif(sum(current_sick)) < rate
x[current_sick] <- as.numeric(new_sick) # turn TRUE/FALSE to 1/0
x
}
This function looks where all the 1's are, then for each of those individuals, it tosses a coin to see of the next person will be sick (not changing any of the non-sick values)
Then to test with 1000 people, you can do
A <- rep(1, 1000) # all sick
B <- infect(A)
C <- infect(B)
This draws 1000 A's. And we go though two transmission steps, doing all 1000 samples at once. If you run mean(C) you should get something close to 0.25.

Related

Random Serial Dictatorship algorithm vs linear programming in solving an assignment problem

I just came across this post, which left me quite baffled.
If I had to assign N objects to N users, knowing that each user has expressed a rank preference for the objects, with the goal to achieve the lowest possible average rank (that is what the post states), my automatic choice would be linear programming.
Instead, this Random Serial Dictatorship is suggested, and it is stated that it achieves some form of optimality.
I was not sure how what looked to me like a greedy stochastic algorithm could guarantee optimality, so I tried it out.
Suppose that 3 users A, B, C have expressed rank preferences for 3 houses H1, H2, H3:
user
house A B C
H1 3 2 1
H2 1 1 2
H3 2 3 3
We want to assign a house to each user, so that the average (or sum) of ranks is minimal.
If I understood it correctly, the Random Serial Dictatorship algorithm requires choosing a random order for the users, and allow each of them to select the house they prefer.
For me it's clear that this kind of strategy may: 1) not result in the same sum of ranks each time, 2) consequently not guarantee optimality.
Imagine that A chooses first, B second and C third. A will select H2 (rank 1). But that's also B's preferred house, so B will have to go for H1 (rank 2), and C will be left with their least preferred house H3 (rank 3). Ranks = 1, 2, 3. Sum = 6, average = 2.
If, on the other hand, B chooses first, A second and C third: B = H2 (1), A = H3 (2), C = H1 (1). Ranks = 1, 2, 1. Sum = 4, average = 4/3.
As an R simulation:
# 3 users A, B, C want to buy a house each, chosen from H1, H2, H3.
# Their preferences are expressed by 'rank' (1 = first choice, 2 = second choice, etc).
d0 <- data.frame("user" = rep(c("A","B","C"), each = 3),
"house" = rep(c("H1", "H2", "H3"), 3),
"rank" = c(3,1,2,2,1,3,1,2,3)
)
# 1. Assignment by random serial dictatorship
set.seed(232425)
all_ranks <- numeric(0)
for (i in 1:1000) {
d <- d0
# Create a random order of priority for the users.
o <- setNames(sample(1:3), c("A","B","C"))
# Let users choose their preferred house in turn, according to the created order.
d["order"] <- o[d$user]
d <- d[order(d$order, d$rank),]
for (i in 1:2) {
h <- d[i, "house"]
d <- rbind(d[1:i,], d[(d$order > i) & (d$house != h),])
}
ranks <- d$rank
ranks <- ranks[order(d$user)]
all_ranks <- rbind(all_ranks, ranks)
#print(d)
}
all_ranks <- setNames(as.data.frame(all_ranks), c("A","B","C"))
all_ranks_summary <- cbind("ID" = 1, setNames(stack(all_ranks), c("rank", "user")))
all_ranks_summary <- aggregate(ID ~ user + rank, all_ranks_summary, length)
barplot(ID ~ rank + user, all_ranks_summary, beside = TRUE, col = 2:4, legend.text = TRUE)
boxplot(rowMeans(all_ranks), main = "average rank")
Frequency of each rank for the 3 users and the average rank:
Clearly the average rank is not minimal in all cases.
Using a linear programming assignment method instead:
# 2. Assignment by linear programming
require(lpSolve)
cm <- xtabs(rank ~ house + user, d0)
lp.out <- lp.assign(cm)
lp.out$solution * cm
yields a guaranteed optimal solution:
user
house A B C
H1 0 0 1
H2 0 1 0
H3 2 0 0
My questions are:
1. Am I misunderstanding the Random Serial Dictatorship algorithm? Can it actually be written in a way that guarantees optimality?
2. Is a linear programming assignment method almost as bad, computationally, as the brute force enumeration of all combinatorial possibilities?
Perhaps I am wrong, but I'm just puzzled that one should start from a goal like:
"assign each user an option so that the average rank of the assigned option in that user's ranked list is minimized across all users"
and then settle for an algorithm that, as 'fair' and as 'fast' as it may be, does not seem to achieve that goal at all.
Unless I am completely missing the point, which is possible.

It's "Pareto optimal", which is to say, not optimal in general. Pareto optimal just means that you would need to harm one or more of the participants to improve the objective.

Having trouble solving simulation

I got a question related to probability theory and I tried to solve it by simulating it in R. However, I ran into a problem as the while loop does not seem to break.
The question is asking: How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?
Here is my code:
prob <- 0
people <- 1
while (prob <= 0.7) {
people <- people + 1 #start the iteration with 2 people in the room and increase 1 for every iteration
birthday <- sample(365, size = people, replace = TRUE)
prob <- length(which(birthday == 365)) / people
}
return(prob)
My guess is that it could never hit 70%, therefore the while loop never breaks, am I right? If so, did I interpret the question wrongly?
I did not want to post this on stats.stackexchange.com because I thought this is more related to code rather than math itself, but I will move it if necessary, thanks.

This is a case where an analytical solution based on probability is easier and more accurate than trying to simulate. I agree with Harshvardhan that your formulation is solving the wrong problem.
The probability of having at least one person in a pool of n have their birthday on a particular target date is 1-P{all n miss the target date}. This probability is at least 0.7 when P{all n miss the target date} < 0.3. The probability of each individual missing the target is assumed to be P{miss} = 1-1/365 (365 days per year, all birthdates equally likely). If the individual birthdays are independent, then P{all n miss the target date} = P{miss}^n.
I am not an R programmer, but the following Ruby should translate pretty easily:
# Use rationals to avoid cumulative float errors.
# Makes it slower but accurate.
P_MISS_TARGET = 1 - 1/365r
p_all_miss = P_MISS_TARGET
threshold = 3r / 10 # seeking P{all miss target} < 0.3
n = 1
while p_all_miss > threshold
p_all_miss *= P_MISS_TARGET
n += 1
end
puts "With #{n} people, the probability all miss is #{p_all_miss.to_f}"
which produces:
With 439 people, the probability all miss is 0.29987476838793214
Addendum
I got curious, since my answer differs from the accepted one, so I wrote a small simulation. Again, I think it's straightforward enough to understand even though it's not in R:
require 'quickstats' # Stats "gem" available from rubygems.org
def trial
n = 1
# Keep adding people to the count until one of them hits the target
n += 1 while rand(1..365) != 365
return n
end
def quantile(percentile = 0.7, number_of_trials = 1_000)
# Create an array containing results from specified number of trials.
# Defaults to 1000 trials
counts = Array.new(number_of_trials) { trial }
# Sort the array and determine the empirical target percentile.
# Defaults to 70th percentile
return counts.sort[(percentile * number_of_trials).to_i]
end
# Tally the statistics of 100 quantiles and report results,
# including margin of error, formatted to 3 decimal places.
stats = QuickStats.new
100.times { stats.new_obs(quantile) }
puts "#{"%.3f" % stats.avg}+/-#{"%.3f" % (1.96*stats.std_err)}"
Five runs produce outputs such as:
440.120+/-3.336
440.650+/-3.495
435.820+/-3.558
439.500+/-3.738
442.290+/-3.909
which is strongly consistent with the analytical result derived earlier and seems to differ significantly from other responder's answers.
Note that on my machine the simulation takes roughly 40 times longer than the analytical calculation, is more complex, and introduces uncertainty. To increase the precision you would need larger sample sizes, and thus longer run times. Given these considerations, I would reiterate my advice to go for the direct solution in this case.

Indeed, your probability will (almost) never reach 0.7, because you hardly will hit the point where exactly 1 person has got birthday = 365. When people gets larger, there will be more people having a birthday = 365, and the probability for exactly 1 person will decrease.
Furthermore, to calculate a probability for a given number of persons, you should draw many samples and then calculate the probability. Here is a way to achieve that:
N = 450 # max. number of peoples being tried
probs = array(numeric(), N) # empty array to store found probabilities
# try for all people numbers in range 1:N
for(people in 1:N){
# do 200 samples to calculate prop
samples = 200
successes = 0
for(i in 1:samples){
birthday <- sample(365, size = people, replace = TRUE)
total_last_day <- sum(birthday == 365)
if(total_last_day >= 1){
successes <- successes + 1
}
}
# store found prop in array
probs[people] = successes/samples
}
# output of those people numbers that achieved a probability of > 0.7
which(probs>0.7)
As this is a simulation, the result depends on the run. Increasing the sample rate would make the result more stable.

You are solving the wrong problem. The question is, "How many people are needed such that there is at least a 70% chance that one of them is born on the last day of December?". What you are finding now is "How many people are needed such that 70% have their birthdays on the last day of December?". The answer to the second question is close to zero. But the first one is much simpler.
Replace prob <- length(which(birthday == 365)) / people with check = any(birthday == 365) in your logic because at least one of them has to be born on Dec 31. Then, you will be able to find if that number of people will have at least one person born on Dec 31.
After that, you will have to rerun the simulation multiple times to generate empirical probability distribution (kind of Monte Carlo). Only then you can check for probability.
Simulation Code
people_count = function(i)
{
set.seed(i)
for (people in 1:10000)
{
birthday = sample(365, size = people, replace = TRUE)
check = any(birthday == 365)
if(check == TRUE)
{
pf = people
break
}
}
return(pf)
}
people_count() function returns the number of people required to have so that at least one of them was born on Dec 31. Then I rerun the simulation 10,000 times.
# Number of simulations
nsim = 10000
l = lapply(1:nsim, people_count) %>%
unlist()
Let's see the distribution of the number of people required.
To find actual probability, I'll use cumsum().
> cdf = cumsum(l/nsim)
> which(cdf>0.7)[1]
[1] 292
So, on average, you would need 292 people to have more than a 70% chance.

In addition to #pjs answer, I would like to provide one myself, written in R. I attempted to solve this question by simulation rather than an analytical approach, and I am sharing it in case it is helpful for someone else who also has the same problem. Its not that well written but the idea is there:
# create a function which will find if anyone is born on last day
last_day <- function(x){
birthdays <- sample(365, size = x, replace = TRUE) #randomly get everyone's birthdays
if(length(which(birthdays == 365)) >= 1) {
TRUE #find amount of people born on last day and return true if >1
} else {
FALSE
}
}
# find out how many people needed to get 70%
people <- 0 #set number of people to zero
prob <- 0 #set prob to zero
while (prob <= 0.7) { #loop does not stop until it hits 70%
people <- people + 1 #increase the number of people every iteration
prob <- mean(replicate(10000, last_day(people))) #run last_day 10000 times to find the mean of probability
}
print(no_of_people)
last_day() only return TRUE or FALSE. So I run last_day() 10000 times in the loop for every iteration to find out, out of 10000 times, how many times does it have one or more people born on the last day (This will give the probability). I then keep the loop running until the probability is 70% or more, then print the number of people.
The answer I get from running the loop once is 440 which is quite close to the answer provided by #pjs.

How to find the probability of extinction = 1 using Galton-Watson process in R?

I am simulating a basic Galton-Watson process (GWP) using a geometric distribution. I'm using this to find the probability of extinction for each generation. My question is, how do I find the generation at which the probability of extinction is equal to 1?
For example, I can create a function for the GWP like so:
# Galton-Watson Process for geometric distribution
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- sum(rgeom(Sn[i - 1], p))
}
return(Sn)
}
where, n is the number of generations.
Then, if I set the geometric distribution parameter p = 0.25... then to calculate the probability of extinction for, say, generation 10, I just do this:
N <- 10 # Number of elements in the initial population.
GWn <- replicate(N, GWP(10, 0.25)[10])
probExtinction <- sum(GWn==0)/N
probExtinction
This will give me the probability of extinction for generation 10... to find the probability of extinction for each generation I have to change the index value (to the corresponding generation number) when creating GWn... But what I'm trying to do is find at which generation will the probability of extinction = 1.
Any suggestions as to how I might go about solving this problem?

I can tell you how you would do this problem in principle, but I'm going to suggest that you may run into some difficulties (if you already know everything I'm about to say, just take it as advice to the next reader ...)
theoretically, the Galton-Watson process extinction probability never goes exactly to 1 (unless prob==1, or in the infinite-time limit)
of course, for any given replicate and random-number seed you can compute the first time point (if any) at which all of your lineages have gone extinct. This will be highly variable across runs, depending on the random-number seed ...
the distribution of extinction times is extremely skewed; lineages that don't go extinct immediately will last a loooong time ...
I modified your GWP function in two ways to make it more efficient: (1) stop the simulation when the lineage goes extinct; (2) replace the sum of geometric deviates with a single negative binomial deviate (see here)
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- rnbinom(1, size=Sn[i - 1], prob=p)
if (Sn[i]==0) break ## extinct, bail out
}
return(Sn)
}
The basic strategy now is: (1) run the simulations for a while, keep the entire trajectory; (2) compute extinction probability in every generation; (3) find the first generation such that p==1.
set.seed(101)
N <- 10 # Number of elements in the initial population.
maxgen <- 100
GWn <- replicate(N, GWP(maxgen, 0.5), simplify="array")
probExtinction <- rowSums(GWn==0)/N
which(probExtinction==1)[1]
(Subtract 1 from the last result if you want to start indexing from generation 0.) In this case the answer is NA, because there's 1/10 lineages that manages to stay alive (and indeed gets very large, so it will probably persist almost forever)
plot(0:maxgen, probExtinction, type="s") ## plot extinction probability
matplot(1+GWn,type="l",lty=1,col=1,log="y") ## plot lineage sizes (log(1+x) scale)
## demonstration that (sum(rgeom(n,...)) is equiv to rnbinom(1,size=n,...)
nmax <- 70
plot(prop.table(table(replicate(10000, sum(rgeom(10, prob=0.3))))),
xlim=c(0,nmax))
points(0:nmax,dnbinom(0:nmax, size=10, prob=0.3), col=2,pch=16)

R: How would I repeatedly simulate how many attempts before a success on a 1/10 chance? (and record how many attempts it took?)

R and probability noob here. I'm looking to create a histogram that shows the distribution of how many attempts it took to return a heads, repeated over 1000+ simulated runs on the equivalent of an unfairly weighted coin (0.1 heads, 0.9 tails).
From my understanding, this is not a geometric distribution or binomial distribution (but might make use of either of these to create the simulated results).
The real-world (ish) scenario I am looking to model this for is a speedrun of the game Zelda: Ocarina of Time. One of the goals in this speedrun is to obtain an item from a character that has a 1 in 10 chance of giving the player the item each attempt. As such, the player stops attempting once they receive the item (which they have a 1/10 chance of receiving each attempt). Every run, runners/viewers will keep track of how many attempts it took to receive the item during that run, as this affects the time it takes the runner to complete the game.
This is an example of what I'm looking to create:
(though with more detailed labels on the x axis if possible). In this, I manually flipped a virtual coin with a 1/10 chance of heads over and over. Once I got a successful result I recorded how many attempts it took into a vector in R and then repeated about 100 times - I then mapped this vector onto a histogram to visualise what the distribution would look like for the usual amount of attempts it will take to get a successful result - basically, i'd like to automate this simulation instead of me having to manually flip the virtual unfair coin, write down how many attempts it took before heads, and then enter it into R myself).

I'm not sure if this is quite what you're looking for, but if you create a function for your manual coin flipping, you can just use replicate() to call it many times:
foo <- function(p = 0.1) {
i <- 0
failure <- TRUE
while ( failure ) {
i <- i + 1
if ( sample(x = c(TRUE, FALSE), size = 1, prob = c(p, 1-p)) ) {
failure <- FALSE
}
}
return(i)
}
set.seed(42)
number_of_attempts <- replicate(1000, foo())
hist(number_of_attempts, xlab = "Number of Attempts Until First Success")
As I alluded to in my comment though, I'm not sure why you think the geometric distribution is inappropriate.
It "is used for modeling the number of failures until the first success" (from the Wikipedia on it).
So, we can just sample from it and add one; the approaches are equivalent, but this will be faster when your number of samples is high:
number_of_attempts2 <- rgeom(1000, 0.1) + 1
hist(number_of_attempts2, xlab = "Number of Attempts Until First Success")

I would use the 'rle' function since you can make a lot of simulations in a short period of time. Use this to count the run of tails before a head:
> n <- 1e6
> # generate a long string of flips with unfair coin
> flips <- sample(0:1,
+ n,
+ replace = TRUE,
+ .... [TRUNCATED]
> counts <- rle(flips)
> # now pull out the "lengths" of "0" which will be the tails before
> # a head is flipped
> runs <- counts$lengths[counts$value == 0]
> sprintf("# of simulations: %d max run of tails: %d mean: %.1f\n",
+ length(runs),
+ max(runs),
+ mean(runs))
[1] "# of simulations: 90326 max run of tails: 115 mean: 10.0\n"
> ggplot()+
+ geom_histogram(aes(runs),
+ binwidth = 1,
+ fill = 'blue')
and you get a chart like this:
Histograph of runs

I would tabulate the cumsum.
p=.1
N <- 1e8
set.seed(42)
tosses <- sample(0:1, N, T, prob=c(1-p, p))
attempts <- tabulate(cumsum(tosses))
length(attempts)
# [1] 10003599
hist(attempts, freq=F, col="#F48024")

Count the Number of 6s Rolled on a Number of Dice in R

I am trying to develop code that will tell me the likelihood of rolling at least one six given 1 thru 20 die using. I am specifically trying to build a single piece of code that loops through the problem space. generates this information. The question has left me at a loss.
I have tried using the sample function and looked at contingency tables.
die1 = sample(1:6,n,replace=T)
die2 = sample(1:6,n,replace=T)
sum_of_dice = die1 + die2
counts = table(sum_of_dice)
proba_empiric = counts/sum(counts)
barplot(proba_empiric)
The above provides the basis for a probability but not for the joint probability of two die.
The final code should be able to tell me the likelihood of rolling a six on 1 die, 2 die, 3 die, all the way to twenty die.

One way to simulate the probability of rolling at least one 6 using 1 to 20 die is to use rbinom():
sapply(1:20, function(x) mean(rbinom(10000, x, 1/6) > 0))
[1] 0.1675 0.3008 0.4174 0.5176 0.5982 0.6700 0.7157 0.7704 0.8001 0.8345 0.8643 0.8916 0.9094 0.9220 0.9310
[16] 0.9471 0.9547 0.9623 0.9697 0.9718

If I am understanding you correctly, you have 20 dice and you want to know the probability of atleast one six happening in them.
We can write a function to roll one die
roll_die <- function() sample(6, 1)
Then write another function which rolls 20 dice and checks if there is atleast one six in it
roll_20_die <- function() {
any(replicate(20, roll_die()) == 6)
}
and replicate this function sufficient number of times to get the probability ratio
n <- 10000
table(replicate(n, roll_20_die()))/n
# FALSE TRUE
#0.0244 0.9756