mathematics behind modulo behavor - math

Preamble
This question is not about the behavior of (P)RNG and rand(). It's about using power of two values uniformly distributed against modulo.
Introduction
I knew that one should not use modulo % to convert a value from a range to another, for example to get a value between 0 and 5 from the rand() function: there will be a bias. It's explained here https://bitbucket.org/haypo/hasard/src/ebf5870a1a54/doc/common_errors.rst?at=default and in this answer Why do people say there is modulo bias when using a random number generator?
But today after investigating some code which was looking wrong, I've made a tool to demonstrate the behavor of modulo: https://gitorious.org/modulo-test/modulo-test/trees/master and found that's not clear enough.
A dice is only 3 bits
I checked with 6 values in range 0..5. Only 3 bits are needed to code those values.
$ ./modulo-test 10000 6 3
interations = 10000, range = 6, bits = 3 (0x00000007)
[0..7] => [0..5]
theorical occurences 1666.67 probability 0.16666667
[ 0] occurences 2446 probability 0.24460000 ( +46.76%)
[ 1] occurences 2535 probability 0.25350000 ( +52.10%)
[ 2] occurences 1275 probability 0.12750000 ( -23.50%)
[ 3] occurences 1297 probability 0.12970000 ( -22.18%)
[ 4] occurences 1216 probability 0.12160000 ( -27.04%)
[ 5] occurences 1231 probability 0.12310000 ( -26.14%)
minimum occurences 1216.00 probability 0.12160000 ( -27.04%)
maximum occurences 2535.00 probability 0.25350000 ( +52.10%)
mean occurences 1666.67 probability 0.16666667 ( +0.00%)
stddev occurences 639.43 probability 0.06394256 ( 38.37%)
With 3 bits of input, the results are indeed awful, but behave as expected. See answer https://stackoverflow.com/a/14614899/611560
Increasing the number of input bits
What puzzled me, was increasing the number of input bits made the results different.
You should not forgot to increase the number of iterations, eg the number of sample otherwise the results are likely wrong (see Wrong Statistics).
Lets try with 4 bits:
$ ./modulo-test 20000 6 4
interations = 20000, range = 6, bits = 4 (0x0000000f)
[0..15] => [0..5]
theorical occurences 3333.33 probability 0.16666667
[ 0] occurences 3728 probability 0.18640000 ( +11.84%)
[ 1] occurences 3763 probability 0.18815000 ( +12.89%)
[ 2] occurences 3675 probability 0.18375000 ( +10.25%)
[ 3] occurences 3721 probability 0.18605000 ( +11.63%)
[ 4] occurences 2573 probability 0.12865000 ( -22.81%)
[ 5] occurences 2540 probability 0.12700000 ( -23.80%)
minimum occurences 2540.00 probability 0.12700000 ( -23.80%)
maximum occurences 3763.00 probability 0.18815000 ( +12.89%)
mean occurences 3333.33 probability 0.16666667 ( +0.00%)
stddev occurences 602.48 probability 0.03012376 ( 18.07%)
Lets try with 5 bits:
$ ./modulo-test 40000 6 5
interations = 40000, range = 6, bits = 5 (0x0000001f)
[0..31] => [0..5]
theorical occurences 6666.67 probability 0.16666667
[ 0] occurences 7462 probability 0.18655000 ( +11.93%)
[ 1] occurences 7444 probability 0.18610000 ( +11.66%)
[ 2] occurences 6318 probability 0.15795000 ( -5.23%)
[ 3] occurences 6265 probability 0.15662500 ( -6.03%)
[ 4] occurences 6334 probability 0.15835000 ( -4.99%)
[ 5] occurences 6177 probability 0.15442500 ( -7.34%)
minimum occurences 6177.00 probability 0.15442500 ( -7.34%)
maximum occurences 7462.00 probability 0.18655000 ( +11.93%)
mean occurences 6666.67 probability 0.16666667 ( +0.00%)
stddev occurences 611.58 probability 0.01528949 ( 9.17%)
Lets try with 6 bits:
$ ./modulo-test 80000 6 6
interations = 80000, range = 6, bits = 6 (0x0000003f)
[0..63] => [0..5]
theorical occurences 13333.33 probability 0.16666667
[ 0] occurences 13741 probability 0.17176250 ( +3.06%)
[ 1] occurences 13610 probability 0.17012500 ( +2.08%)
[ 2] occurences 13890 probability 0.17362500 ( +4.18%)
[ 3] occurences 13702 probability 0.17127500 ( +2.77%)
[ 4] occurences 12492 probability 0.15615000 ( -6.31%)
[ 5] occurences 12565 probability 0.15706250 ( -5.76%)
minimum occurences 12492.00 probability 0.15615000 ( -6.31%)
maximum occurences 13890.00 probability 0.17362500 ( +4.18%)
mean occurences 13333.33 probability 0.16666667 ( +0.00%)
stddev occurences 630.35 probability 0.00787938 ( 4.73%)
Question
Please explain me why the results are different when changing the input bits (and increasing the sample count accordingly) ? What is the mathematical reasoning behind these ?
Wrong statistics
In the previous version of the question, I showed a test with 32bits of input and only 1000000 iterations, eg 10^6 samples, and said I was surprised to get correct results.
It was so wrong I'm ashamed of: there must be N times more samples to have confidence to get all 2^32 values of the generator. Here 10^6 is way to small compaired to 2^32. Bonus for people able to explain this in mathematical/statistical language..
Here the wrong results:
$ ./modulo-test 1000000 6 32
interations = 1000000, range = 6, bits = 32 (0xffffffff)
[0..4294967295] => [0..5]
theorical occurences 166666.67 probability 0.16666667
[ 0] occurences 166881 probability 0.16688100 ( +0.13%)
[ 1] occurences 166881 probability 0.16688100 ( +0.13%)
[ 2] occurences 166487 probability 0.16648700 ( -0.11%)
[ 3] occurences 166484 probability 0.16648400 ( -0.11%)
[ 4] occurences 166750 probability 0.16675000 ( +0.05%)
[ 5] occurences 166517 probability 0.16651700 ( -0.09%)
minimum occurences 166484.00 probability 0.16648400 ( -0.11%)
maximum occurences 166881.00 probability 0.16688100 ( +0.13%)
mean occurences 166666.67 probability 0.16666667 ( +0.00%)
stddev occurences 193.32 probability 0.00019332 ( 0.12%)
I still have to read and re-read the excellent article of Zed Shaw "Programmers Need To Learn Statistics Or I Will Kill Them All".

In essence, you're doing:
(rand() & 7) % 6
Let's assume that rand() is uniformly distributed on [0; RAND_MAX], and that RAND_MAX+1 is a power of two. It is clear that rand() & 7 can evaluate to 0, 1, ..., 7, and that the outcomes are equiprobable.
Now let's look at what happens when you take the result modulo 6.
0 and 6 map to 0;
1 and 7 map to 1;
2 maps to 2;
3 maps to 3;
4 maps to 4;
5 maps to 5.
This explains why you get twice as many zeroes and ones as you get the other numbers.
The same thing is happening in the second case. However, the value of the "extra" numbers is much smaller, making their contribution indistinguishable from noise.
To summarize, if you have an integer uniformly distributed on [0; M-1], and you take it modulo N, the result will be biased towards zero unless M is divisible by N.

rand() (or some other PRNG) produces values in the interval [0 .. RAND_MAX]. You want to map these to the interval [0 .. N-1] using the remainder operator.
Write
(RAND_MAX+1) = q*N + r
with 0 <= r < N.
Then for each value in the interval [0 .. N-1] there are
q+1 values of rand() that are mapped to that value if the value is smaller than r
q values of rand() that are mapped to it if the value is >= r.
Now, if q is small, the relative difference between q and q+1 is large, but if q is large - 2^32 / 6, for example - the difference cannot easily be measured.

Related

Lua: How do I calculate a random number between 50 and 500, with an average result of 100?

I think this is a log-normal distribution? I'm not sure. The lua I have is here:
local min = 50
local max = 500
local avg = 100
local fFloat = ( RandomInt( 0, avg ) / 100 ) ^ 2 -- float between 0 and 1, squared
local iRange = max - min -- range of min-max
local fDistribution = ( ( fFloat * iRange ) + min ) / 100
finalRandPerc = fDistribution * RandomInt( min, max ) / 100
It is close to working, but sometimes generates numbers that are slightly too large.
This can be done in literally infinite number of ways. One other approach is to generate a number from binomial distribution, multiply with 450 and add 50. I will leave the task of finding the right parameters for the binomial distribution to you.
How do I calculate a random number between 50 and 500, with an average result of 100?
You can use Chi-squared of degree 4 with its tail removed.
It is very easy to calculate.
local function random_50_500()
-- result is from 50 to 500
-- mean is very near to 100
local x
repeat
x = math.log(math.random()) + math.log(math.random())
until x > -18
return x * -25 + 50
end

Is there a function f(n) that returns the n:th combination in an ordered list of combinations without repetition?

Combinations without repetitions look like this, when the number of elements to choose from (n) is 5 and elements chosen (r) is 3:
0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 3 4
1 2 3
1 2 4
1 3 4
2 3 4
As n and r grows the amount of combinations gets large pretty quickly. For (n,r) = (200,4) the number of combinations is 64684950.
It is easy to iterate the list with r nested for-loops, where the initial iterating value of each for loop is greater than the current iterating value of the for loop in which it is nested, as in this jsfiddle example:
https://dotnetfiddle.net/wHWK5o
What I would like is a function that calculates only one combination based on its index. Something like this:
tuple combination(i,n,r) {
return [combination with index i, when the number of elements to choose from is n and elements chosen is r]
Does anyone know if this is doable?
You would first need to impose some sort of ordering on the set of all combinations available for a given n and r, such that a linear index makes sense. I suggest we agree to keep our combinations in increasing order (or, at least, the indices of the individual elements), as in your example. How then can we go from a linear index to a combination?
Let us first build some intuition for the problem. Suppose we have n = 5 (e.g. the set {0, 1, 2, 3, 4}) and r = 3. How many unique combinations are there in this case? The answer is of course 5-choose-3, which evaluates to 10. Since we will sort our combinations in increasing order, consider for a minute how many combinations remain once we have exhausted all those starting with 0. This must be 4-choose-3, or 4 in total. In such a case, if we are looking for the combination at index 7 initially, this implies we must subtract 10 - 4 = 6 and search for the combination at index 1 in the set {1, 2, 3, 4}. This process continues until we find a new index that is smaller than this offset.
Once this process concludes, we know the first digit. Then we only need to determine the remaining r - 1 digits! The algorithm thus takes shape as follows (in Python, but this should not be too difficult to translate),
from math import factorial
def choose(n, k):
return factorial(n) // (factorial(k) * factorial(n - k))
def combination_at_idx(idx, elems, r):
if len(elems) == r:
# We are looking for r elements in a list of size r - thus, we need
# each element.
return elems
if len(elems) == 0 or len(elems) < r:
return []
combinations = choose(len(elems), r) # total number of combinations
remains = choose(len(elems) - 1, r) # combinations after selection
offset = combinations - remains
if idx >= offset: # combination does not start with first element
return combination_at_idx(idx - offset, elems[1:], r)
# We now know the first element of the combination, but *not* yet the next
# r - 1 elements. These need to be computed as well, again recursively.
return [elems[0]] + combination_at_idx(idx, elems[1:], r - 1)
Test-driving this with your initial input,
N = 5
R = 3
for idx in range(choose(N, R)):
print(idx, combination_at_idx(idx, list(range(N)), R))
I find,
0 [0, 1, 2]
1 [0, 1, 3]
2 [0, 1, 4]
3 [0, 2, 3]
4 [0, 2, 4]
5 [0, 3, 4]
6 [1, 2, 3]
7 [1, 2, 4]
8 [1, 3, 4]
9 [2, 3, 4]
Where the linear index is zero-based.
Start with the first element of the result. The value of that element depends on the number of combinations you can get with smaller elements. For each such smaller first element, the number of combinations with first element k is n − k − 1 choose r − 1, with potentially some of-by-one corrections. So you would sum over a bunch of binomial coefficients. Wolfram Alpha can help you compute such a sum, but the result still has a binomial coefficient in it. Solving for the largest k such that the sum doesn't exceed your given index i is a computation you can't do with something as simple as e.g. a square root. You need a loop to test possible values, e.g. like this:
def first_naive(i, n, r):
"""Find first element and index of first combination with that first element.
Returns a tuple of value and index.
Example: first_naive(8, 5, 3) returns (1, 6) because the combination with
index 8 is [1, 3, 4] so it starts with 1, and because the first combination
that starts with 1 is [1, 2, 3] which has index 6.
"""
s1 = 0
for k in range(n):
s2 = s1 + choose(n - k - 1, r - 1)
if i < s2:
return k, s1
s1 = s2
You can reduce the O(n) loop iterations to O(log n) steps using bisection, which is particularly relevant for large n. In that case I find it easier to think about numbering items from the end of your list. In the case of n = 5 and r = 3 you get choose(2, 2)=1 combinations starting with 2, choose(3,2)=3 combinations starting with 1 and choose(4,2)=6 combinations starting with 0. So in the general choose(n,r) binomial coefficient you increase the n with each step, and keep the r. Taking into account that sum(choose(k,r) for k in range(r,n+1)) can be simplified to choose(n+1,r+1), you can eventually come up with bisection conditions like the following:
def first_bisect(i, n, r):
nCr = choose(n, r)
k1 = r - 1
s1 = nCr
k2 = n
s2 = 0
while k2 - k1 > 1:
k3 = (k1 + k2) // 2
s3 = nCr - choose(k3, r)
if s3 <= i:
k2, s2 = k3, s3
else:
k1, s1 = k3, s3
return n - k2, s2
Once you know the first element to be k, you also know the index of the first combination with that same first element (also returned from my function above). You can use the difference between that first index and your actual index as input to a recursive call. The recursive call would be for r − 1 elements chosen from n − k − 1. And you'd add k + 1 to each element from the recursive call, since the top level returns values starting at 0 while the next element has to be greater than k in order to avoid duplication.
def combination(i, n, r):
"""Compute combination with a given index.
Equivalent to list(itertools.combinations(range(n), r))[i].
Each combination is represented as a tuple of ascending elements, and
combinations are ordered lexicograplically.
Args:
i: zero-based index of the combination
n: number of possible values, will be taken from range(n)
r: number of elements in result list
"""
if r == 0:
return []
k, ik = first_bisect(i, n, r)
return tuple([k] + [j + k + 1 for j in combination(i - ik, n - k - 1, r - 1)])
I've got a complete working example, including an implementation of choose, more detailed doc strings and tests for some basic assumptions.

Number of action per year. Combinatorics question

I'm writing a diploma about vaccines. There is a region, its population and 12 month. There is an array of 12 values from 0 to 1 with step 0.01. It means which part of population should we vaccinate in every month.
For example if we have array = [0.1,0,0,0,0,0,0,0,0,0,0,0]. That means that we should vaccinate 0.1 of region population only in first month.
Another array = [0, 0.23,0,0,0,0,0,0, 0.02,0,0,0]. It means that we should vaccinate 0.23 of region population in second month and 0.02 of region population in 9th month.
So the question is: how to generate (using 3 loops) 12(months) * 12(times of vaccinating) * 100 (number of steps from 0 to 1) = 14_400 number of arrays that will contain every version of these combinations.
For now I have this code:
for(int month = 0;month<12;month++){
for (double step = 0;step<=1;step+=0.01){
double[] arr = new double[12];
arr[month] = step;
}
}
I need to add 3d loop that will vary number of vaccinating per year.
Have no idea how to write it.
Idk if it is understandable.
Hope u get it otherwise ask me, please.
You have 101 variants for the first month 0.00, 0.01..1.00
And 101 variants for the second month - same values.
And 101*101 possible combinations for two months.
Continuing - for all 12 months you have 101^12 variants ~ 10^24
It is not possible to generate and store so many combinations (at least in the current decade)
If step is larger than 0.01, then combination count might be reliable. General formula is P=N^M where N is number of variants per month, M is number of months
You can traverse all combinations representing all integers in range 0..P-1 in N-ric numeral system. Or make digit counter:
fill array D[12] with zeros
repeat
increment element at the last index by step value
if it reaches the limit, make it zero
and increment element at the next index
until the first element reaches the limit
It is similar to counting 08, 09, here we cannot increment 9, so make 10 and so on
s = 1
m = 3
mx = 3
l = [0]*m
i = 0
while i < m:
print([x/3 for x in l])
i = 0
l[i] += s
while (i < m) and l[i] > mx:
l[i] = 0
i += 1
if i < m:
l[i] += s
Python code prints 64 ((mx/s+1)^m=4^3) variants like [0.3333, 0.6666, 0.0]

Sorting with parity in julia

Suppose I have the following array:
[6,3,3,5,6],
Is there an already implemented way to sort the array and that returns also the number of permutations that it had to make the algorithm to sort it?
For instance, I have to move 3 times to the right with the 6 so it can be ordered, which would give me parity -1.
The general problem would be to order an arbitrary array (all integers, with repeated indexes!), and to know the parity performed by the algorithm to order the array.
a=[6,3,3,5,6]
sortperm(a) - [ 1:size(a)[1] ]
Results in
3-element Array{Int64,1}:
1
1
1
-3
0
sortperm shows you where each n-th index should go into. We're using 1:size(a)[1] to compare the earlier index to its original indexation.
If your array is small, you can compute the determinant of the permutation matrix
function permutation_sign_1(p)
n = length(p)
A = zeros(n,n)
for i in 1:n
A[i,p[i]] = 1
end
det(A)
end
In general, you can decompose the permutation as a product of cycles,
count the number of even cycles, and return its parity.
function permutation_sign_2(p)
n = length(p)
not_seen = Set{Int}(1:n)
seen = Set{Int}()
cycles = Array{Int,1}[]
while ! isempty(not_seen)
cycle = Int[]
x = pop!( not_seen )
while ! in(x, seen)
push!( cycle, x )
push!( seen, x )
x = p[x]
pop!( not_seen, x, 0 )
end
push!( cycles, cycle )
end
cycle_lengths = map( length, cycles )
even_cycles = filter( i -> i % 2 == 0, cycle_lengths )
length( even_cycles ) % 2 == 0 ? 1 : -1
end
The parity of a permutation can also be obtained from the
number of inversions.
It can be computed by slightly modifying the merge sort algorithm.
Since it is also used to compute Kendall's tau (check less(corkendall)),
there is already an implementation.
using StatsBase
function permutation_sign_3(p)
x = copy(p)
number_of_inversions = StatsBase.swaps!(x)
number_of_inversions % 2 == 0 ? +1 : -1
end
On your example, those three functions give the same result:
x = [6,3,3,5,6]
p = sortperm(x)
permutation_sign_1( p )
permutation_sign_2( p )
permutation_sign_3( p ) # -1

Minimum number of element required to make a sequence that sums to a particular number

Suppose there is number s=12 , now i want to make sequence with the element a1+a2+.....+an=12.
The criteria is as follows-
n must be minimum.
a1 and an must be 1;
ai can differs a(i-1) by only 1,0 and -1.
for s=12 the result is 6.
So how to find the minimum value of n.
Algorithm for finding n from given s:
1.Find q = FLOOR( SQRT(s-1) )
2.Find r = q^2 + q
3.If s <= r then n = 2q, else n = 2q + 1
Example: s = 12
q = FLOOR( SQRT(12-1) ) = FLOOR(SQRT(11) = 3
r = 3^2 + 3 = 12
12 <= 12, therefore n = 2*3 = 6
Example: s = 160
q = FLOOR( SQRT(160-1) ) = FLOOR(SQRT(159) = 12
r = 12^2 + 12 = 156
159 > 156, therefore n = 2*12 + 1 = 25
and the 25-numbers sequence for
159: 1,2,3,4,5,6,7,8,9,10,10,10,9,10,10,10,9,8,7,6,5,4,3,2,1
Here's a way to visualize the solution.
First, draw the smallest triangle (rows containing successful odd numbers of stars) that has a greater or equal number of stars to n. In this case, we draw a 16-star triangle.
*
***
*****
*******
Then we have to remove 16 - 12 = 4 more stars. We do this diagonally starting from the top.
1
**2
****3
******4
The result is:
**
****
******
Finally, add up the column heights to get the final answer:
1, 2, 3, 3, 2, 1.
There are two cases: s odd and s even. When s is odd, you have the sequence:
1, 2, 3, ..., (s-1)/2, (s-1)/2, (s-1)/2-1, (s-1)/2-2, ..., 1
when n is even you have:
1, 2, 3, ..., s/2, s/2-1, s/2-2, ..., 1
The maximum possible for any given series of length n is:
n is even => (n^2+2n)/4
n is odd => (n+1)^2/4
These two results are arrived at easily enough by looking at the simple arithmetic sum of series where in the case of n even it is twice the sum of the series 1...n/2. In the case of n odd it is twice the sum of the series 1...(n-1)/2 and add on n+1/2 (the middle element).
Clearly you can generate any positive number that is less than this max as long as n>3.
So the problem then becomes finding the smallest n with a max greater than your target.
Algorithmically I'd go for:
Find (sqrt(4*s)-1) and round up to the next odd number. Call this M. This is an easy to work out value and will represent the lowest odd n that will work.
Check M-1 to see if its max sum is greater than s. If so then that your n is M-1. Otherwise your n is M.
Thank all you answer me. I derived a simpler solution. The algorithm looks like-
First find what is the maximum sum that can be made using n element-
if n=1 -> 1 sum=1;
if n=2 -> 1,1 sum=2;
if n=3 -> 1,2,1 sum=4;
if n=4 -> 1,2,2,1 sum=6;
if n=5 -> 1,2,3,2,1 sum=9;
if n=6 -> 1,2,3,3,2,1 sum=12;
So from observation it is clear that form any number,n 9<n<=12 can be
made using 6 element, similarly number
6<n<=9 can be made at using 5 element.
So it require only a binary search to find the number of
element that make a particular number.

Resources