rbinom function in r, trying to get the probability - r

I am trying to estimate the probability of a family having two children and they both being girls using the rbinom function. I am trying to get the probability of this question and the rbinom does not give the probability it just gives random values. How can I get it to probability?

The usage of rbinom is
rbinom(
n, # The number of samples (or iterations if you like)
size, # The size of the sample
prob # The probability of "success" for your definition of success
)
If you assume the probability of having a girl is 50% (it's not - there are a number of confounding variables and the actual birth ratio of girls to boys is closer to 100:105), then you can calculate the outcome of one "trial" with a sample size of 2 children.
rbinom(1, 2, 0.5)
You will get an outcome of 0, 1, or 2 girls (it is random). This does not give you the probability that they are both girls. You have to complete multiple "trials".
Here is an example with n = 10. I am using set.seed to provide a specific initial state to the RNG to make the results reproducible.
set.seed(100)
num_girls <- rbinom(10, 2, 0.5)
You can do a little trick to find out how many times there are two girls.
sum(num_girls == 2) # Does the comparison and adds; TRUE = 1 and FALSE = 0
1
So there are 1/10 times when you get two girls. The number of trials is not enough to approach the true probability yet, but you should get the idea.

Related

Weighted Likelihood of an Event Occurring

I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.

Programming probability of patients being well in R

I'm going to preface this with the fact that I am a complete R novice.
I have the following problem:
Consider a simple model that progresses year-by-year. In year i, let W_i = patient is well, I_i = patient is ill, and D_i = patient is dead. Transitions can be modeled as a set of conditional probabilities.
Let L = number of years that the patient is well.
I have come up with the probability mass function of L to be P(L)=(1-p)(p)^{L-1}.
The given information is that a patient is well in year 1 and given their age and risk factors, P(W_{i+1}|W_{i})=0.2 for all i
The problem is to write a function in R that simulates the trajectory of a single patient and returns the number of years the patient is well.
I thought that this could be programmed in R as a binomial distribution using the rbinom function. For a single patient,
rbinom(1, 1, 0.2)
but I don't think that this would return the number of years that the patient is well. I'm thinking that the rbinom function should be the start, and that it would need to be paired with a way to count the number of years that a patient is well, but I don't know how to do that.
The next problem is to use R to simulate 1000 patient trajectories and find the sample mean of years of wellness. I'm assuming that this would be an extension of the previous part, just replacing the 1 patient with 1000. However I can't quite figure out where to replace the 1 with 1000: n or size
rbinom(n, size, prob)
This is assuming that using rbinom is the correct thing to do in the first place...
If I were to do this in another programming language (say Python) I would use a while loop conditional on patient_status=W and starting with L=0 iterate through the loop and add 1 each successful iteration. I'm not sure if R works in the same way.
Let's start with what rbinom(1, 1, 0.2) does: it returns 1 instance of 1 independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2 of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.
The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:
outcomes dichotomized into success / failure
a fixed number (ten) of trials
a consistent probability of success (1/6)
independent trials (dice do not affect each other)
The situation you're describing doesn't have those features. So, what to do?
Option 1: Go with your instinct for a while() loop. I'll preface this by saying that while() loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.
one_patient <- function(){
status <- 1 # 1 = healthy, 0 = ill
years <- (-1) # count how many years completed while healthy
while(status == 1){
years <- years + 1 # this line will run at least one time
status <- rbinom(1, 1, 0.2) # your rbinom(1, 1, 0.2) line makes an appearance!
}
return(years)
}
Now, executing one_patient() will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.
Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:
outcomes dichotomized into success / failure
a consistent probability of success
repeated trials that terminate when the first success is reached
independent trials
Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom(), there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom().
To use rgeom(), we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8) will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:
rgeom(1, 0.8)
will accomplish the same task as the one_patient() function we defined above. (That is, the distribution of outputs for each will be the same.)
For multiple patients, you can either wrap the one_patient() function inside replicate(), or you can just directly adjust the first parameter of rgeom(1, 0.8). The second option is much faster, though both are fast if just simulating 1000 patients.
Addendum
Proof that both have the same effect:
sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))
sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))
Proof that rgeom() is faster:
library(microbenchmark)
microbenchmark(
replicate(10000, one_patient()),
rgeom(10000, 0.8)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090 100
# rgeom(10000, 0.8) 1.1978 1.22540 1.273766 1.23640 1.27485 1.9734 100

Not sure about R's dbinom()

I am looking to use dbinom() in R to generate a probability. The default documentation gives dbinom(x, size, prob, log = FALSE), and I understand what they all mean except the x, where x is said to be "vector of quantiles". Can anyone explain what that means in context of let's say that I would like to find the probability of obtaining the number 5 twice if I sample 10 times from the numbers 1-5. In this case the binomial probability would be
choose(10, 2) * (1/5)^2 * (4/5)^8
In your example the "number of times you see a five" is the quantile of interest. Loosely speaking, a "quantile" is a possible value of a random variable. So if you want to find the probability of seeing a 5 x = 2 times out of size = 10 draws where each number has prob = 1 / 5 of being drawn you would enter dbinom(2, 10, 1 / 5).

Estimate the chance n rolls of m fair six-sided dice

Similar with De mere problem
I want to generate a Monte Carlo simulation to estimate the probability of rolling at least one from n rolls of m fair six-sided dice.
My code:
m<-5000
n<-3
x<-replicate(m, sample(1:6,n,TRUE)==1)
p<-sum(x)/m
p is the probability estimated. Here I get the value 0.4822.
My questions:
1) Is there any other way without using sum to do it?
2) I doubt the code is wrong as the probability maybe too high.
Although the question as stated is a little unclear, the code suggests you want to estimate the chance of obtaining at least one outcome of "1" among n independent dice and that you aim to estimate this by simulating the experiment m times.
Program simulations from the inside out. Begin with a single iteration. You started well, but to be perfectly clear let's redo it using a highly suggestive syntax. Try this:
1 %in% sample(1:6,n,TRUE)
This uses sample to realize the results of n independent fair dice and checks whether the outcome 1 appears among any of them.
Once you are satisfied that this emulates your experiment (run it a bunch of times), then indeed replicate will perform the simulation:
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
That produces m results. Each will be TRUE (interpreted as equal to 1) in all iterations where 1 appeared and otherwise will be FALSE (interpreted as 0). Consequently, the average number of times 1 appeared can be obtained as
mean(x)
This empirical frequency is a good estimate of the theoretical probability.
As a check, note that 1 will not appear on a single die with a probability of 1-1/6 = 5/6 and therefore--because the n dice are independent--will not appear on any of them with a probability of (5/6)^n. Consequently the chance a 1 will appear must be 1 - (5/6)^n. Let us output those two values: the simulation mean and theoretical result. We might also include a Z score, which is a measure of how far away from the theoretical result the mean is. Typically, Z scores between -2 and 2 aren't significant evidence of any discrepancy.
Here's the full code. Although there are faster ways to write it, this is very fast already and is about as clear as one could make it.
m <- 5000 # Number of simulation iterations
n <- 3 # Number of dice per iteration
set.seed(17) # For reproducible results
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
# Compare to a theoretical result.
theory <- 1-(5/6)^n
avg <- mean(x)
Z <- (avg - theory) / sd(x) * sqrt(length(x))
c(Mean=signif(avg, 5), Theoretical=signif(theory, 5), Z.score=signif(Z, 3))
The output is
Mean Theoretical Z.score
0.4132 0.4213 -1.1600
Notice that neither result is anywhere near n/6, which would be 1/2 = 0.500.

Approximation of results

I'm just learning how to use R. I'm practicing some statistic stuff, as Normal distribution, Poisson, etc.
When I try to calculate probabilities and the answer is a number very close to zero (0), the program shows as result 0, so I can't see the full answer, and I need the full answer. There is always a probability, even a small one!!
My question is: can I turn off the self-approximation or which code can I use to get a full answer?
Example:
1-pbinom(q =10, size = 10,prob = 0.8)
Result:
0
The pbinom function gives the cumulative density function. That i the probability that a value is less than or equal to a particular value. So with a discrete distribution like the binomial distribution with 10 draws
pbinom(10, 10, .8)
# [1] 1
tells you that there is a 100% change you will observe 10 or fewer successes.
Perhaps you're thinking of the probability density function (or probability mass function since this is a discrete distribution) dbinom
dbinom(10, 10, .8)
# [1] 0.1073742
means that there is a roughly 11% chance that all your draws will be successes. It's also true that
sum(dbinom(0:10, 10, .8))
# [1] 1
that the sum of the probabilities of getting 0 through is exactly 1.
So with these cases you are getting the exact answer. R does round values in the console according to the options(digits=) setting, but that's not what's happening here.
pbinom is the distribution function for the binomial distribution, which is discrete and can thus be exactly 1 (as in your example). You might have been thinking of continuous distributions like the normal or gamma distributions. In this case, rounding can cause your results to be truncated, for example
> 1 - pnorm(10, 0, 1)
[1] 0
However, the p[dist] functions have an argument lower.tail=FALSE designed to address this problem:
> pnorm(10, 0, 1, lower.tail=FALSE)
[1] 7.619853e-24

Resources