I am looking to use dbinom() in R to generate a probability. The default documentation gives dbinom(x, size, prob, log = FALSE), and I understand what they all mean except the x, where x is said to be "vector of quantiles". Can anyone explain what that means in context of let's say that I would like to find the probability of obtaining the number 5 twice if I sample 10 times from the numbers 1-5. In this case the binomial probability would be
choose(10, 2) * (1/5)^2 * (4/5)^8
In your example the "number of times you see a five" is the quantile of interest. Loosely speaking, a "quantile" is a possible value of a random variable. So if you want to find the probability of seeing a 5 x = 2 times out of size = 10 draws where each number has prob = 1 / 5 of being drawn you would enter dbinom(2, 10, 1 / 5).
Related
I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.
I am trying to estimate the probability of a family having two children and they both being girls using the rbinom function. I am trying to get the probability of this question and the rbinom does not give the probability it just gives random values. How can I get it to probability?
The usage of rbinom is
rbinom(
n, # The number of samples (or iterations if you like)
size, # The size of the sample
prob # The probability of "success" for your definition of success
)
If you assume the probability of having a girl is 50% (it's not - there are a number of confounding variables and the actual birth ratio of girls to boys is closer to 100:105), then you can calculate the outcome of one "trial" with a sample size of 2 children.
rbinom(1, 2, 0.5)
You will get an outcome of 0, 1, or 2 girls (it is random). This does not give you the probability that they are both girls. You have to complete multiple "trials".
Here is an example with n = 10. I am using set.seed to provide a specific initial state to the RNG to make the results reproducible.
set.seed(100)
num_girls <- rbinom(10, 2, 0.5)
You can do a little trick to find out how many times there are two girls.
sum(num_girls == 2) # Does the comparison and adds; TRUE = 1 and FALSE = 0
1
So there are 1/10 times when you get two girls. The number of trials is not enough to approach the true probability yet, but you should get the idea.
In R, I would like to generate a multinomially distributed random number vector of a given size N, for example using rmultinom, but with a maximum size for each of the K boxes.
For example:
set.seed(1)
draw = rmultinom(n = 1, size = 1000, prob = c(67,211,264,166,144,52,2,175))
In this case, the size is 1000, specifying the total number of objects that are put into eight boxes (the length of prob), and prob = c(67,211,264,166,144,52,2,175) the vector of probabilities for the eight boxes (which is internally normalized to sum 1). In addition, I would like c(67,211,264,166,144,52,2,175) to be the vector of the maximum size for each of the eight boxes.
However in this case, it is possible to generate numbers that are higher than c(67,211,264,166,144,52,2,175) (for instance in the example above, draw[7,]=4 is higher than 2), whereas I would like each number to be lower or equal to the maximum size of each box specified in prob, in addition to draw summing to size = 1000.
Do you know any function or any simple way to do that? I was not able to find the answer.
From wikipedia: "For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories".
The keyword here is independent. Your constraint on the number of times each category can be drawn means the sampling is not independent. If your problem were multinomial, it would be possible - though very unlikely - that all numbers could be drawn from box 7. This is not what you want, so you can't use rmultinom.
Here's a different approach:
# vector of item counts
m <- c(67,211,264,166,144,52,2,175)
# expand the item counts in to a single vector with i repeated m[i] times
d <- unlist(lapply(1:length(m), function(x) rep(x, m[x])))
# sample from d without replacement
s <- sample(d, size=1000, replace=FALSE)
# count the number of items of each type were sampled
table(factor(s))
1 2 3 4 5 6 7 8
63 197 242 153 135 48 2 160
In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.
suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.