I have a list of about 100 000 probabilities on an event stored in a vector.
I want to know if it is possible to calculate the probability of n occuring events (e.g. what is the probability that exactly 1000 events occur).
I managed to calculate several probabilities in R :
p is the vector containing all the probabilities
probability of none : prod(1-p)
probability of at least one : 1 - prod(1-p)
I found how to calculate the probability of exactly one event :
sum(p * (prod(1-p) / (1-p)))
But I don't know how to generate a formula for n events.
I do not know R, but I know how I would solve this with programming.
This is a straightforward dynamic programming problem. We start with a vector v = [1.0] of probabilities. Then in untested Python:
for p_i in probabilities:
next_v = [p_i * v[0]]
v.append(0.0)
for j in range(len(v) - 1):
next_v.append(v[j]*p_i + v[j+1]*(1-p_i)
# For roundoff errors
total = sum(next_v)
for j in range(len(next_v)):
next_v[j] /= total
v = next_v
And now your answers can be just read off of the right entry in the vector.
This approach is equivalent to calculating Pascal's triangle row by row, throwing away the old row when you're done.
Related
I want to use R to randomly generate an integer sequence that each integer is picked from a pool of integers (0,1,2,3....,k) with replacement. k is pre-determined. The selection probability for every integer k in (0,1,2,3....,k) is pk(1-p) where p is pre-determined. That is, 1 has much higher probability to be picked compared to k and my final integer sequence will likely have more 1 than k. I am not sure how to implement this number selecting process in R.
A generic approach to this type of problem would be:
Calculate the p^k * (1-p) for each integer
Create a cumulative sum of these in a table t.
Draw a number from a uniform distribution with range(t)
Measure how far into t that number falls and check which integer that corresponds to.
The larger the probability for an integer is, the larger part of that range it will cover.
Here's quick and dirty example code:
draw <- function(n=1, k, p) {
v <- seq( 0, k )
pr <- (p ** v) * (1-p)
t <- cumsum(pr)
r <- range(t)
x <- runif( n, min=min(r), max=max(r) )
f <- findInterval( x, vec=t )
v[ f+1 ] ## first interval is 0, and it will likely never pass highest interval
}
Note, the proposed solution doesn't care if your density function adds up to 1. In real life it likely will, based on your description. But that's not really important for the solution.
The answer by Sirius is good. But as I can tell, what you're describing is something like a truncated geometric distribution.
I should note that the geometric distribution is defined differently in different works (see MathWorld, for example), so we use the distribution defined as follows:
P(X = x) ~ p^x * (1 - p), where x is an integer in [0, k].
I am not very familiar with R, but the solution involves calling rgeom(1, 1 - p) until the result is k or less.
Alternatively, you can use a generic rejection sampler, since the probabilities are known (better called weights here, since they need not sum to 1). Rejection sampling is described as follows:
Assume and each weight is 0 or greater. Store the weights in a list. Calculate the highest weight, call it max. Then, to choose an integer in the interval [0, k] using rejection sampling:
Choose a uniform random integer i in the interval [0, k].
With probability weights[i]/max (where weights[i] = p^i * (1-p) in your case), return i. Otherwise, go to step 1.
Given the weights for each item, there are many other ways to make a weighted choice besides rejection sampling or the solution in Sirius's answer; see my note on weighted choice algorithms.
In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.
I am very new to programming so I apologise in advance for my lack of knowledge.
I want to find the probability of obtaining the sum k when throwing m die. I am not looking for a direct answer, I just want to ask if I am on the right track and what I can improve.
I begin with a function that calculates the sum of an array of m die:
function dicesum(m)
j = rand((1:6), m)
sum(j)
end
Now I am trying specific values to see if I can find a pattern (but without much luck). I have tried m = 2 (two die). What I am trying to do is to write a function which checks whether the sum of the two die is k and if it is, it calculates the probability. My attempt is very naive but I am hoping someone can point me in the right direction:
m = 2
x, y = rand(1:6), rand(1:6)
z = x+y
if z == dicesum(m)
Probability = ??/6^m
I want to somehow find the number of 'elements' in dicesum(2) in order to calculate the probability. For example, consider the case when dicesum(2) = 8. With two die, the possible outcomes are (2,6),(6,2), (5,3), (3,5), (4,4), (4,4). The probability being (2/36)*3.
I understand that the general case is far more complicated but I just want an idea of how to being this problem. Thanks in advance for any help.
If I understand correctly, you want to use simulation to approximate the probability of obtaining a sum of k when roll m dice. What I recommend is creating a function that will take k and m as arguments and repeat the simulation a large number of times. The following might help you get started:
function Simulate(m,k,Nsim=10^4)
#Initialize the counter
cnt=0
#Repeat the experiment Nsim times
for sim in 1:Nsim
#Simulate roll of m dice
s = sum(rand(1:6,m))
#Increment counter if sum matches k
if s == k
cnt += 1
end
end
#Return the estimated probability
return cnt/Nsim
end
prob = Simulate(3,4)
The estimate is approximately .0131.
You can also perform your simulation in a vectorized style as shown below. Its less efficient in terms of memory allocation because it creates a vector s of length Nsim, whereas the loop code uses a single integer to count, cnt. Sometimes unnecessary memory allocation can cause performance issues. In this case, it turns out that the vectorized code is about twice as fast. Usually, loops are a bit faster. Someone more familiar with the internals of Julia might be able to offer an explanation.
function Simulate1(m,k,Nsim=10^4)
#Simulate roll of m dice Nsim times
s = sum(rand(1:6,Nsim,m),2)
#Relative frequency of matches
prob = mean(s .== k)
return prob
end
So I am stuck on this problem for a long time.
I was think I should first create the two functions, like this:
n = runif(10000)
int sum = 0
estimator1_fun = function(n){
for(i in 1:10000){
sum = sum + ((n/i)*runif(1))
)
return (sum)
}
and do the same for the other function, and use the mse formula? Am I even approaching this correctly? I tried formatting it, but found that using an image would be better.
Assuming U(0,Theta_0) is the uniform distribution from 0 to Theta_0, and that Theta_0 is a fixed constant, I would proceed as follows:
1. Define Theta_0. Give it a fixed value.
2. Write the function that gives a random number from that distribution
- The distribution function is runif(0,Theta_0).
- Arguments could be Theta_0 and N.
3. Sample it a few thousand (or whatever) times into a vector X.
4. Calculate the two estimates.
5. Repeat steps 3 & 4 for more samples
6. Plot the two estimates against the number of samples and
see if it is approaching Theta_0
suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.