Generating sorted random ints without the sort? O(n) - math

Just been looking at a code golf question about generating a sorted list of 100 random integers. What popped into my head, however, was the idea that you could generate instead a list of positive deltas, and just keep adding them to a running total, thus:
deltas: 1 3 2 7 2
ints: 1 4 6 13 15
In fact, you would use floats, then normalise to fit some upper limit, and round, but the effect is the same.
Although it wouldn't make for shorter code, it would certainly be faster without the sort step. But the thing I have no real handle on is this: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
Edit: A sample script:
import random,sys
running = 0
max = 1000
deltas = [random.random() for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Whose output (fair dice roll) was:
[24, 71, 133, 261, 308, 347, 499, 543, 722, 852]
UPDATE: Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.

So you are asking if the numbers generated in this way are going to be uniformly distributed.
You are generating a series:
yj = ∑i=0j ( xi / A )
where A is the sum of all xi. xi is the list of (positive) deltas.
This can be done iff xi are exponentially distributed (with any fixed mean). So, if xi are uniformly distributed, the resulting yj will not be uniformly distributed.
Having said that, it's fairly easy to generate exponential xi values.
One example would be:
sum := 0
for I = 1 to N do:
X[I] = sum = sum - ln(RAND)
sum = sum - ln(RAND)
for I = 1 to N do:
X[I] = X[I]/sum
and you will have your random numbers sorted in the range [0, 1).
Reference: Generating Sorted Lists of Random Numbers. The paper has other (faster) algorithms as well.
Of course, this generates floating-point numbers. For uniform distribution of integers, you can replace sum above by sum/RANGE in the last step (i.e., the R.H.S becomes X[I]*RANGE/sum, and then round the numbers to the nearest integer).

A uniform distribution has an upper and a lower bound. If you use your proposed method, and your deltas happen to be chosen large enough that you run into the upper bound before you have generated all your numbers, what would your algorithm do next?
Having said that, you may want to investigate the Poisson distribution, which is the distribution of interval times between random events occurring with a given average frequency.

If you take the number range of being 1 to 1000, and you have to use 100 of these numbers, the delta will have to be as a minimum 10, otherwise you can not reach the 1000 mark. How about some working to demonstrate it in action...
The chance of any given number in an evenly distributed random selection is 100/1000 e.g. 1/10 - no shock there, take that as the basis.
Assuming you start using a delta and that delta is just 10.
The odds of getting the number 1 is 1/10 - seems fine.
The odds of getting the number 2 is 1/10 + (1/10 * 1/10) (because you could hit 2 deltas of 1 in a row, or just hit a 2 as the first delta.)
The odds of getting the number 3 is 1/10 + (1/10 * 1/10 * 1/10) + (1/10 * 1/10) + (1/10 * 1/10)
The first case was a delta of 3, the second was hitting 3 deltas of 1 in a row, the third case would be a delta of 1 followed by a 2, and the fourth case was a delta of 2 followed by a 1.
For the sake of my fingers typing, we won't generate the combinations that hit 5.
Immediately the first few numbers have a greater percentage chance than the straight random.
This could be altered by changing the delta value so the fractions are all different, but I do not believe you could find a delta that produced identical odds.
To give an analogy that might just sink it, if you consider your delta as just 6 and you run that twice it is the equivalent of throwing 2 dice - each of the deltas is independant, but you know that 7 has a higher chance of being selected than 2.

I think it will be extremely similar but the extremes will be different because of the normalization. For example, 100 numbers chosen at random between 1 and 100 could all be 1. However, 100 numbers created using your system could all have deltas of 0.01 but when you normalize them you'll scale them up to be in the range 1 -> 100 which will mean you'll never get that strange possibility of a set of very low numbers.

Alok's answer and Dan Dyer's comment point out that using an exponential distribution for the deltas would give a uniform distribution of integers.
So the new version of the code sample in the question would be:
import random,sys
running = 0
max = 1000
deltas = [random.expovariate(1.0) for i in range(0,11)]
floats = []
for d in deltas:
running += d
floats.append(running)
upper = floats.pop()
ints = [int(round(f/upper*max)) for f in floats]
print(ints)
Note the use of random.expovariate(1.0), a Python exponential distribution random number generator (very useful!). Here it's called with a mean of 1.0, but since the script normalises against the last number in the sequence, the mean itself doesn't matter.
Output (fair dice roll):
[11, 43, 148, 212, 249, 458, 539, 725, 779, 871]

Q: Would the resulting distribution of integers be the same as generating 100 random integers from a uniformly distributed probability density function?
A: Each delta will be uniformly distributed. The central limit theorem tells us that the distribution of a sum of a large number of such deviates (since they have a finite mean and variance) will tend to the normal distribution. Hence the later deviates in your sequence will not be uniformly distributed.
So the short answer is "no". Afraid I cannot give a simple solution without doing algebra I don't have time to do today!

The reference (1979) in Alok's answer is interesting. It gives an algorithm for generating the uniform order statistics not by addition but by successive multiplication:
max = 1.
for i = N downto 1 do
out[i] = max = max * RAND^(1/i)
where RAND is uniform on [0,1). This way you don't have to normalize at the end, and in fact don't even have to store the numbers in an array; you could use this as an iterator.
The Exponential distribution: theory, methods and applications
By N. Balakrishnan, Asit P. Basu gives another derivation of this algorithm on page 22 and credits Malmquist (1950).

You can do it in two passes;
in the first pass, generate deltas between 0 and (MAX_RAND/n)
in the second pass, normalise the random numbers to be within bounds
Still O(n), with good locality of reference.

Related

how to sample from an upside down bell curve

I can generate numbers with uniform distribution by using the code below:
runif(1,min=10,max=20)
How can I sample randomly generated numbers that fall more frequently closer to the minimum and maxium boundaries? (Aka an "upside down bell curve")
Well, bell curve is usually gaussian, meaning it doesn't have min and max. You could try Beta distribution and map it to desired interval. Along the lines
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
As #Gregor-reinstateMonica noted, Beta distribution is bounded on both ends, [0...1], so it could be easily mapped into any bounded interval just by scale and shift. It has two parameters, and symmetric if those parameters are equal. Above 1 parameters make it kind of bell distribution, but below 1 parameters make it into inverse bell, what you're looking for. You could play with them, put different values instead of 0.5 and see how it is going. Parameters equal to 1 makes it uniform.
Sampling from a beta distribution is a good idea. Another way is to sample a number of uniform numbers and then take the minimum or maximum of them.
According to the theory of order statistics, the cumulative distribution function for the maximum is F(x)^n where F is the cdf from which the sample is taken and n is the number of samples, and the cdf for the minimum is 1 - (1 - F(x))^n. For a uniform distribution, the cdf is a straight line from 0 to 1, i.e., F(x) = x, and therefore the cdf of the maximum is x^n and the cdf of the minimum is 1 - (1 - x)^n. As n increases, these become more and more curved, with most of the mass close to the ends.
A web search for "order statistics" will turn up some resources.
If you don't care about decimal places, a hacky way would be to generate a large sample of normally distributed datapoints using rnorm(), then count the number of times each given rounded value appears (n), and then substract n from the maximum value of n (max(n)) to get inverse counts.
You can then use the inverse count to make a new vector (that you can sample from), i.e.:
library(tidyverse)
x <- rnorm(100000, 100, 15)
x_tib <- round(x) %>%
tibble(x = .) %>%
count(x) %>%
mutate(new_n = max(n) - n)
new_x <- rep(x_tib$x, x_tib$new_n)
qplot(new_x, binwidth = 1)
An "upside-down bell curve" compared to the normal distribution can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. Notice that this sampler samples in a truncated interval (here, the interval [x0, x1]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if y < 1-exp(-(x*x)): return x
end

Generate N random integers that are sampled from a uniform distribution and sum to M in R [duplicate]

In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.

Partial sums of harmonic series

Formula:
I was told by my math teacher that it is impossible to calculate from the formula above n that is neccesary for sum to exceed 40 ( sum > 40), and know the sum in 50 decimals precision.
(in short: First n that is neccesary for sum > 40, and what would that sum be in 50 decimals precision)
I tryed writing c++ program for this, but realized after tno of optimizations that it would take just way too long.
H_n is bounded below by ln n + gamma where gamma is the Euler-Mascheroni constant (http://en.wikipedia.org/wiki/Euler%E2%80%93Mascheroni_constant). So you can start by finding n such that \ln n + gamma = 40. Solving, you get ln n = 40 - gamma, n = e^(40-gamma), which is quite straightforward to calculate. Once you know the ballpark, you can use a binary search and more accurate over and under estimates for H_n (see the asymptotic expansion at http://en.wikipedia.org/wiki/Harmonic_number#Calculation; there are many references that can provide more detail).
Why would that be impossible? It's
40.00000000000000000202186036912232961108532260403356
Steps to get there:
Ask Wolfram Alpha for the number n where the sum equals 40.
You'll get something around 1.32159290357566702732792368 10^17. Pick the next higher integer.
Compute the sum for n = 132159290357566703.
Click on "More digits" until satisfied.

Calculate original set size after hash collisions have occurred

You have an empty ice cube tray which has n little ice cube buckets, forming a natural hash space that's easy to visualize.
Your friend has k pennies which he likes to put in ice cube trays. He uses a random number generator repeatedly to choose which bucket to put each penny. If the bucket determined by the random number is already occupied by a penny, he throws the penny away and it is never seen again.
Say your ice cube tray has 100 buckets (i.e, would make 100 ice cubes). If you notice that your tray has c=80 pennies, what is the most likely number of pennies (k) that your friend had to start out with?
If c is low, the odds of collisions are low enough that the most likely number of k == c. E.g. if c = 3, then it's most like that k was 3. However, the odds of a collision are increasingly likely, after say k=14 then odds are there should be 1 collision, so maybe it's maximally likely that k = 15 if c = 14.
Of course if n == c then there would be no way of knowing, so let's set that aside and assume c < n.
What's the general formula for estimating k given n and c (given c < n)?
The problem as it stands is ill-posed.
Let n be the number of trays.
Let X be the random variable for the number of pennies your friend started with.
Let Y be the random variable for the number of filled trays.
What you are asking for is the mode of the distribution P(X|Y=c).
(Or maybe the expectation E[X|Y=c] depending on how you interpret your question.)
Let's take a really simple case: the distribution P(X|Y=1). Then
P(X=k|Y=1) = (P(Y=1|X=k) * P(X=k)) / P(Y=1)
= (1/nk-1 * P(X=k)) / P(Y=1)
Since P(Y=1) is normalizing constant, we can say P(X=k|Y=1) is proportional to 1/nk-1 * P(X=k).
But P(X=k) is a prior probability distribution. You have to assume some probability distribution on the number of coins your friend has to start with.
For example, here are two priors I could choose:
My prior belief is that P(X=k) = 1/2k for k > 0.
My prior belief is that P(X=k) = 1/2k - 100 for k > 100.
Both would be valid priors; the second assumes that X > 100. Both would give wildly different estimates for X: prior 1 would estimate X to be around 1 or 2; prior 2 would estimate X to be 100.
I would suggest if you continue to pursue this question you just go ahead and pick a prior. Something like this would work nicely: WolframAlpha. That's a geometric distribution with support k > 0 and mean 10^4.

Combining two normal random variables

suppose I have the following 2 random variables :
X where mean = 6 and stdev = 3.5
Y where mean = -42 and stdev = 5
I would like to create a new random variable Z based on the first two and knowing that : X happens 90% of the time and Y happens 10% of the time.
It is easy to calculate the mean for Z : 0.9 * 6 + 0.1 * -42 = 1.2
But is it possible to generate random values for Z in a single function?
Of course, I could do something along those lines :
if (randIntBetween(1,10) > 1)
GenerateRandomNormalValue(6, 3.5);
else
GenerateRandomNormalValue(-42, 5);
But I would really like to have a single function that would act as a probability density function for such a random variable (Z) that is not necessary normal.
sorry for the crappy pseudo-code
Thanks for your help!
Edit : here would be one concrete interrogation :
Let's say we add the result of 5 consecutives values from Z. What would be the probability of ending with a number higher than 10?
But I would really like to have a
single function that would act as a
probability density function for such
a random variable (Z) that is not
necessary normal.
Okay, if you want the density, here it is:
rho = 0.9 * density_of_x + 0.1 * density_of_y
But you cannot sample from this density if you don't 1) compute its CDF (cumbersome, but not infeasible) 2) invert it (you will need a numerical solver for this). Or you can do rejection sampling (or variants, eg. importance sampling). This is costly, and cumbersome to get right.
So you should go for the "if" statement (ie. call the generator 3 times), except if you have a very strong reason not to (using quasi-random sequences for instance).
If a random variable is denoted x=(mean,stdev) then the following algebra applies
number * x = ( number*mean, number*stdev )
x1 + x2 = ( mean1+mean2, sqrt(stdev1^2+stdev2^2) )
so for the case of X = (mx,sx), Y= (my,sy) the linear combination is
Z = w1*X + w2*Y = (w1*mx,w1*sx) + (w2*my,w2*sy) =
( w1*mx+w2*my, sqrt( (w1*sx)^2+(w2*sy)^2 ) ) =
( 1.2, 3.19 )
link: Normal Distribution look for Miscellaneous section, item 1.
PS. Sorry for the wierd notation. The new standard deviation is calculated by something similar to the pythagorian theorem. It is the square root of the sum of squares.
This is the form of the distribution:
ListPlot[BinCounts[Table[If[RandomReal[] < .9,
RandomReal[NormalDistribution[6, 3.5]],
RandomReal[NormalDistribution[-42, 5]]], {1000000}], {-60, 20, .1}],
PlotRange -> Full, DataRange -> {-60, 20}]
It is NOT Normal, as you are not adding Normal variables, but just choosing one or the other with certain probability.
Edit
This is the curve for adding five vars with this distribution:
The upper and lower peaks represent taking one of the distributions alone, and the middle peak accounts for the mixing.
The most straightforward and generically applicable solution is to simulate the problem:
Run the piecewise function you have 1,000,000 (just a high number) of times, generate a histogram of the results (by splitting them into bins, and divide the count for each bin by your N (1,000,000 in my example). This will leave you with an approximation for the PDF of Z at every given bin.
Lots of unknowns here, but essentially you just wish to add the two (or more) probability functions to one another.
For any given probability function you could calculate a random number with that density by calculating the area under the probability curve (the integral) and then generating a random number between 0 and that area. Then move along the curve until the area is equal to your random number and use that as your value.
This process can then be generalized to any function (or sum of two or more functions).
Elaboration:
If you have a distribution function f(x) which ranges from 0 to 1. You could calculate a random number based on the distribution by calculating the integral of f(x) from 0 to 1, giving you the area under the curve, lets call it A.
Now, you generate a random number between 0 and A, let's call that number, r. Now you need to find a value t, such that the integral of f(x) from 0 to t is equal to r. t is your random number.
This process can be used for any probability density function f(x). Including the sum of two (or more) probability density functions.
I'm not sure what your functions look like, so not sure if you are able to calculate analytic solutions for all this, but worse case scenario, you could use numeric techniques to approximate the effect.

Resources