Estimate the chance n rolls of m fair six-sided dice - r

Similar with De mere problem
I want to generate a Monte Carlo simulation to estimate the probability of rolling at least one from n rolls of m fair six-sided dice.
My code:
m<-5000
n<-3
x<-replicate(m, sample(1:6,n,TRUE)==1)
p<-sum(x)/m
p is the probability estimated. Here I get the value 0.4822.
My questions:
1) Is there any other way without using sum to do it?
2) I doubt the code is wrong as the probability maybe too high.

Although the question as stated is a little unclear, the code suggests you want to estimate the chance of obtaining at least one outcome of "1" among n independent dice and that you aim to estimate this by simulating the experiment m times.
Program simulations from the inside out. Begin with a single iteration. You started well, but to be perfectly clear let's redo it using a highly suggestive syntax. Try this:
1 %in% sample(1:6,n,TRUE)
This uses sample to realize the results of n independent fair dice and checks whether the outcome 1 appears among any of them.
Once you are satisfied that this emulates your experiment (run it a bunch of times), then indeed replicate will perform the simulation:
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
That produces m results. Each will be TRUE (interpreted as equal to 1) in all iterations where 1 appeared and otherwise will be FALSE (interpreted as 0). Consequently, the average number of times 1 appeared can be obtained as
mean(x)
This empirical frequency is a good estimate of the theoretical probability.
As a check, note that 1 will not appear on a single die with a probability of 1-1/6 = 5/6 and therefore--because the n dice are independent--will not appear on any of them with a probability of (5/6)^n. Consequently the chance a 1 will appear must be 1 - (5/6)^n. Let us output those two values: the simulation mean and theoretical result. We might also include a Z score, which is a measure of how far away from the theoretical result the mean is. Typically, Z scores between -2 and 2 aren't significant evidence of any discrepancy.
Here's the full code. Although there are faster ways to write it, this is very fast already and is about as clear as one could make it.
m <- 5000 # Number of simulation iterations
n <- 3 # Number of dice per iteration
set.seed(17) # For reproducible results
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
# Compare to a theoretical result.
theory <- 1-(5/6)^n
avg <- mean(x)
Z <- (avg - theory) / sd(x) * sqrt(length(x))
c(Mean=signif(avg, 5), Theoretical=signif(theory, 5), Z.score=signif(Z, 3))
The output is
Mean Theoretical Z.score
0.4132 0.4213 -1.1600
Notice that neither result is anywhere near n/6, which would be 1/2 = 0.500.

Related

Decimal precission problems with runif

I'm running into issues when simulating low-probability events with runif in R, and wondering how to solve this.
Consider the following example for an experiment where we simulate values of TRUE with probability 5e-10 in a sample of size 10e9, and check if any of these samples got that value of TRUE. This experiment is repeated 10 times:
set.seed(123)
probability <- 0.0000000005
n_samples <- 1000000000
n_tries <- 10
for (i in 1:n_tries) {
print(any(runif(n=n_samples, min=0, max=1) < probability))
}
Code above will run relatively fast, and nearly half of the experiment replicates will return TRUE as expected.
However, as soon as the probability becomes 5e-11 (probability <- 0.00000000005), that expectation fails and no TRUE values will be returned even if the number of replicates is increased (used n_tries <- 100 twice with no luck; the whole process took 1h running).
This means runif is not returning values with as many precision as 11 decimals. This was unexpected, as R to my understanding works with as much as 16 decimals of precision, and we might need to simulate processes with probabilities that small (around 15 decimals).
Is this why runif fails to provide the expected output? are there any other alternatives/solutions to this problem?
Thank you
EDIT: I have made a test to check whether this problem could be related to boundary bias (causing a reduced density of probability near extreme values of 0 or 1). To do so, the result of runif is added a constant (e.g. k <- 0.5) and compared against the value of probability plus that same constant. However, that does not seem to fix the issue.

Birth/death model over time simulation

I'm trying to plot a graph that will show species number over time using a birth death model. So far all the packages and models I've seen will not allow me to input a starting number of species which is problematic as my extinction rate is higher than my speciation rate.
The data I have for this question is;
speciation rate (or b) is 0.0157
and extinction rate (or d) is 53.3.
starting number of species =250000.
currently I've tried using rbdtree and simbdtree.
Thank you in advance
Assuming you don't need the phylogenetic tree of the species that are generated, and just want to know the number of species over time, you don't really need a package for this: the code is fairly simple (if you know what equations to implement).
Also assuming that you want a stochastic simulation model.
Let's say for simplicity that you only want to know the numbers of species at the times at which speciation or extinction events occur (this is a little bit easier than figuring them out at equally spaced times). Let's say the per capita mutation/speciation/birth rate is b and the per capita death/extinction rate is d (it's fairly standard to assume these rates are constant per capita, you could make other assumptions if you wanted). Then if there are currently s species present, the total rate of events is (b+d)*s, and the probability that the next event is a birth is b/(b+d). The time is exponentially distributed.
num_events <- 1000000
s <- rep(NA, num_events+1)
t <- rep(NA, num_events+1)
t[1] <- 0
s[1] <- 250000
b <- 0.0157
d <- 53.3
set.seed(101)
for (i in 1:num_events) {
if (s[i]==0) break ## stop if extinct
delta_t <- rexp(1,rate=(b+d)*s[i])
t[i+1] <- t[i] + delta_t
if (runif(1)<(b/(b+d))) s[i+1] <- s[i]+1 else s[i+1] <- s[i]-1
}
plot(t,s,type="s",log="y")
curve(s[1]*exp((b-d)*x), add=TRUE, lwd=3, col=adjustcolor("red", alpha=0.4))
You can see that until you get down to a few 100 individuals, it's barely worth bothering with a stochastic simulation — the theoretical exponential decay curve matches the population dynamics almost exactly when the numbers are large.
The code could easily be sped up a little bit, at some small cost to interpretability (but it should take not more than a few seconds at most).

Is there a R function to find the maximize run within a given interval for a coin simulation?

hope you are doing well. So I was asked to design and conduct a simulation study to estimate the probability that the observed maximum run length (A run is a sequence of consecutive
heads or tails) for a fair coin flipping experiment is in the interval [9, 11] in a sample size of n = 1000, this is my attempt so far
result <-replicate(10000,{ #replicate 10000 times
experiment <- sample(c("T","H"),size=1000,replace=TRUE) #1000 flips
expe_run <- rle(experiment) #find the run
expe_val <- expe_run$values #values of run
expe_length <- expe_run$lengths #length of run
as<-list(expe_length,expe_val) #make a list for sapply function
max_run <-sapply(as, FUN=max) #apply max function through out for both
head_run <-expe_length[which(expe_val=='H')] # show the head run
tail_run <-expe_length[which(expe_val=='T')] #show the tail run
max_run
})
probability <-table(result)/10000 #probability for run
probability
The problem is I don't know how to finish the question, which is estimates the probability that the observed maximum run length for a fair coin flipping experiment is in the interval [9, 11], even though I got the table for every possible probability. Can you please help me out? Thank you
Are you looking for the probability, across the 10,000 trials, that the maximum value in expe_length is contained in the interval [9,11]?
If so, something like:
result <-replicate(10000, {
...
expe_length <- expe_run$lengths #length of run
max(expe_length) %in% 9:11
})
should give you result as a vector of TRUE/FALSE values indicating whether that trial had a maximum run length in that interval.
Afterwards,
sum(result) / length(result)
will give you the proportion you're after.

rbinom function in r, trying to get the probability

I am trying to estimate the probability of a family having two children and they both being girls using the rbinom function. I am trying to get the probability of this question and the rbinom does not give the probability it just gives random values. How can I get it to probability?
The usage of rbinom is
rbinom(
n, # The number of samples (or iterations if you like)
size, # The size of the sample
prob # The probability of "success" for your definition of success
)
If you assume the probability of having a girl is 50% (it's not - there are a number of confounding variables and the actual birth ratio of girls to boys is closer to 100:105), then you can calculate the outcome of one "trial" with a sample size of 2 children.
rbinom(1, 2, 0.5)
You will get an outcome of 0, 1, or 2 girls (it is random). This does not give you the probability that they are both girls. You have to complete multiple "trials".
Here is an example with n = 10. I am using set.seed to provide a specific initial state to the RNG to make the results reproducible.
set.seed(100)
num_girls <- rbinom(10, 2, 0.5)
You can do a little trick to find out how many times there are two girls.
sum(num_girls == 2) # Does the comparison and adds; TRUE = 1 and FALSE = 0
1
So there are 1/10 times when you get two girls. The number of trials is not enough to approach the true probability yet, but you should get the idea.

Programming probability of patients being well in R

I'm going to preface this with the fact that I am a complete R novice.
I have the following problem:
Consider a simple model that progresses year-by-year. In year i, let W_i = patient is well, I_i = patient is ill, and D_i = patient is dead. Transitions can be modeled as a set of conditional probabilities.
Let L = number of years that the patient is well.
I have come up with the probability mass function of L to be P(L)=(1-p)(p)^{L-1}.
The given information is that a patient is well in year 1 and given their age and risk factors, P(W_{i+1}|W_{i})=0.2 for all i
The problem is to write a function in R that simulates the trajectory of a single patient and returns the number of years the patient is well.
I thought that this could be programmed in R as a binomial distribution using the rbinom function. For a single patient,
rbinom(1, 1, 0.2)
but I don't think that this would return the number of years that the patient is well. I'm thinking that the rbinom function should be the start, and that it would need to be paired with a way to count the number of years that a patient is well, but I don't know how to do that.
The next problem is to use R to simulate 1000 patient trajectories and find the sample mean of years of wellness. I'm assuming that this would be an extension of the previous part, just replacing the 1 patient with 1000. However I can't quite figure out where to replace the 1 with 1000: n or size
rbinom(n, size, prob)
This is assuming that using rbinom is the correct thing to do in the first place...
If I were to do this in another programming language (say Python) I would use a while loop conditional on patient_status=W and starting with L=0 iterate through the loop and add 1 each successful iteration. I'm not sure if R works in the same way.
Let's start with what rbinom(1, 1, 0.2) does: it returns 1 instance of 1 independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2 of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.
The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:
outcomes dichotomized into success / failure
a fixed number (ten) of trials
a consistent probability of success (1/6)
independent trials (dice do not affect each other)
The situation you're describing doesn't have those features. So, what to do?
Option 1: Go with your instinct for a while() loop. I'll preface this by saying that while() loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.
one_patient <- function(){
status <- 1 # 1 = healthy, 0 = ill
years <- (-1) # count how many years completed while healthy
while(status == 1){
years <- years + 1 # this line will run at least one time
status <- rbinom(1, 1, 0.2) # your rbinom(1, 1, 0.2) line makes an appearance!
}
return(years)
}
Now, executing one_patient() will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.
Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:
outcomes dichotomized into success / failure
a consistent probability of success
repeated trials that terminate when the first success is reached
independent trials
Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom(), there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom().
To use rgeom(), we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8) will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:
rgeom(1, 0.8)
will accomplish the same task as the one_patient() function we defined above. (That is, the distribution of outputs for each will be the same.)
For multiple patients, you can either wrap the one_patient() function inside replicate(), or you can just directly adjust the first parameter of rgeom(1, 0.8). The second option is much faster, though both are fast if just simulating 1000 patients.
Addendum
Proof that both have the same effect:
sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))
sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))
Proof that rgeom() is faster:
library(microbenchmark)
microbenchmark(
replicate(10000, one_patient()),
rgeom(10000, 0.8)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090 100
# rgeom(10000, 0.8) 1.1978 1.22540 1.273766 1.23640 1.27485 1.9734 100

Resources