I'm just learning how to use R. I'm practicing some statistic stuff, as Normal distribution, Poisson, etc.
When I try to calculate probabilities and the answer is a number very close to zero (0), the program shows as result 0, so I can't see the full answer, and I need the full answer. There is always a probability, even a small one!!
My question is: can I turn off the self-approximation or which code can I use to get a full answer?
Example:
1-pbinom(q =10, size = 10,prob = 0.8)
Result:
0
The pbinom function gives the cumulative density function. That i the probability that a value is less than or equal to a particular value. So with a discrete distribution like the binomial distribution with 10 draws
pbinom(10, 10, .8)
# [1] 1
tells you that there is a 100% change you will observe 10 or fewer successes.
Perhaps you're thinking of the probability density function (or probability mass function since this is a discrete distribution) dbinom
dbinom(10, 10, .8)
# [1] 0.1073742
means that there is a roughly 11% chance that all your draws will be successes. It's also true that
sum(dbinom(0:10, 10, .8))
# [1] 1
that the sum of the probabilities of getting 0 through is exactly 1.
So with these cases you are getting the exact answer. R does round values in the console according to the options(digits=) setting, but that's not what's happening here.
pbinom is the distribution function for the binomial distribution, which is discrete and can thus be exactly 1 (as in your example). You might have been thinking of continuous distributions like the normal or gamma distributions. In this case, rounding can cause your results to be truncated, for example
> 1 - pnorm(10, 0, 1)
[1] 0
However, the p[dist] functions have an argument lower.tail=FALSE designed to address this problem:
> pnorm(10, 0, 1, lower.tail=FALSE)
[1] 7.619853e-24
Related
I'm new to R, and it's been a while for me and statistics. I don't understand what the parameters of rbinom are. What's is the difference between n (="number of observations") and size (="number of trials")?
This question is related, but I don't understand the answer there.
Thank you very much.
A binomial distribution usually has two parameters, an integer which indicates the number of attempts and so the maximum possible value (here called the size) and a success probability for each attempt between 0 and 1. The expectation is then the product of these two parameters.
For a random sample from this distribution, you are also interested in having a particular number of observations.
So in rbinom(n, size, prob) you have
n being the number of sample observations
size being the integer parameter of the binomial distribution, using 1 if you want a Bernoulli distribution
prob for the probability parameter of the binomial distribution
As an example, you might get
set.seed(2021)
rbinom(5, 100, 0.2)
# 19 23 22 19 21
I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.
I am trying to estimate the probability of a family having two children and they both being girls using the rbinom function. I am trying to get the probability of this question and the rbinom does not give the probability it just gives random values. How can I get it to probability?
The usage of rbinom is
rbinom(
n, # The number of samples (or iterations if you like)
size, # The size of the sample
prob # The probability of "success" for your definition of success
)
If you assume the probability of having a girl is 50% (it's not - there are a number of confounding variables and the actual birth ratio of girls to boys is closer to 100:105), then you can calculate the outcome of one "trial" with a sample size of 2 children.
rbinom(1, 2, 0.5)
You will get an outcome of 0, 1, or 2 girls (it is random). This does not give you the probability that they are both girls. You have to complete multiple "trials".
Here is an example with n = 10. I am using set.seed to provide a specific initial state to the RNG to make the results reproducible.
set.seed(100)
num_girls <- rbinom(10, 2, 0.5)
You can do a little trick to find out how many times there are two girls.
sum(num_girls == 2) # Does the comparison and adds; TRUE = 1 and FALSE = 0
1
So there are 1/10 times when you get two girls. The number of trials is not enough to approach the true probability yet, but you should get the idea.
This may be some basic/fundamental question on 'dnorm' function in R. Let's say I create some z scores through z transformation and try to get the sum out of 'dnorm'.
data=c(232323,4444,22,2220929,22323,13)
z=(data-mean(data))/sd(data)
result=dnorm(z,0,1)
sum(result)
[1] 1.879131
As above, the sum of 'dnorm' is not 1 nor 0.
Then let's say I use zero mean and one standard deviation even in my z transformation.
data=c(232323,4444,22,2220929,22323,13)
z=(data-0)/1
result=dnorm(z,0,1)
sum(result)
[1] 7.998828e-38
I still do not get either 0 or 1 in sum.
If my purpose is to get sum of the probability equal to one as I will need for my further usage, what method do you recommend using 'dnorm' or even using other PDF functions?
dnorm returns the values evaluated in the normal probability density function. It does not return probabilities. What is your reasoning that the sum of your transformed data evaluated in the density function should equate to one or zero? You're creating a random variable, there is no reason it should ever equate exactly zero or one.
Integrating dnorm yields a probability. Integrating dnorm over the entire support of the random variable yields a probability of one:
integrate(dnorm, -Inf, Inf)
#1 with absolute error < 9.4e-05
In fact, integrate(dnorm, -Inf, x) conceptually equals pnorm(x) for all x.
Edit: In light of your comment.
The same applies to other continuous probability distributions (PDFs):
integrate(dexp, 0, Inf, rate = 57)
1 with absolute error < 1.3e-05
Note that the ... argument(s) from ?integrate is passed to the integrand.
Recall also that the Poisson distribution, say, is a discrete probability distribution and hence integrating it (in the conventional sense) makes no sense. A discrete probability distribution have a probability mass function (PMF) and not a PDF which actually return probabilities. In that case, it should sum to one.
Consider:
dpois(0.5, lambda = 2)
#[1] 0
#Warning message:
#In dpois(0.5, lambda = 2) : non-integer x = 0.500000
Summing from 0 to a 'very' large number (i.e. over the support of the Poisson distribution):
sum(dpois(0:1000000, lambda = 2))
#[1] 1
I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.