3 random numbers from 2 random numbers task - math

Suppose, you have some uniform destribution rnd(x) function what will return 0 or 1.
How you can use this function to create any rnd(x,n) function what will return uniform distributed numbers from 0 to n?
I mean everyone using it, but for me it's not so clever. For example, I can create distributions with right border 2^n-1 ([0-1],[0-3],[0-7], etc.) but can't find a way how to do this for ranges like [0-2] or [0-5] without using very big numbers for reasonable precision.

Suppose that you need to create function rnd(n) which returns uniformly distributed random number in range [0, n] by using another function rnd1() which returns 0 or 1.
Find such smallest k that 2^k >= n+1
Create number consisting of k bits and fill all its bits by using rnd1(). Result is uniformly distributed number in range [0, 2^k-1]
Compare generated number to n. If it is smaller or equal to n, return it. Otherwise go to step 2.
In general, this is a variation of how to generate uniform numbers in small range by using library function which generates numbers in large range:
unsigned int rnd(n) {
while (true) {
unsigned int x = rnd_full_unsigned_int();
if (x < MAX_UNSIGNED_INT / (n+1) * (n+1)) {
return x % (n+1);
}
}
}
Explanation for above code. If you simply return rnd_full_unsigned_int() % (n+1) then this will generate bias towards small valued numbers. Black spiral represents all possible values from 0 to MAX_UNSIGNED_INT, counted from inside. Length of single revolution path is (n+1). Red line shows why bias occurs. So, in order to remove this bias we first create random number x in range [0, MAX_UNSIGNED_INT] (this is easy with bit-fill). Then, if x falls into bias-generating region, we recreate it. We keep recreating it until it doesn't fall into bias-generating region. x at this moment is in form a*(n+1)-1, so x % (n+1) is a uniformly distributed number [0, n].

Related

Generating Integer Sequences based on a Modified Bernoulli Distribution

I want to use R to randomly generate an integer sequence that each integer is picked from a pool of integers (0,1,2,3....,k) with replacement. k is pre-determined. The selection probability for every integer k in (0,1,2,3....,k) is pk(1-p) where p is pre-determined. That is, 1 has much higher probability to be picked compared to k and my final integer sequence will likely have more 1 than k. I am not sure how to implement this number selecting process in R.
A generic approach to this type of problem would be:
Calculate the p^k * (1-p) for each integer
Create a cumulative sum of these in a table t.
Draw a number from a uniform distribution with range(t)
Measure how far into t that number falls and check which integer that corresponds to.
The larger the probability for an integer is, the larger part of that range it will cover.
Here's quick and dirty example code:
draw <- function(n=1, k, p) {
v <- seq( 0, k )
pr <- (p ** v) * (1-p)
t <- cumsum(pr)
r <- range(t)
x <- runif( n, min=min(r), max=max(r) )
f <- findInterval( x, vec=t )
v[ f+1 ] ## first interval is 0, and it will likely never pass highest interval
}
Note, the proposed solution doesn't care if your density function adds up to 1. In real life it likely will, based on your description. But that's not really important for the solution.
The answer by Sirius is good. But as I can tell, what you're describing is something like a truncated geometric distribution.
I should note that the geometric distribution is defined differently in different works (see MathWorld, for example), so we use the distribution defined as follows:
P(X = x) ~ p^x * (1 - p), where x is an integer in [0, k].
I am not very familiar with R, but the solution involves calling rgeom(1, 1 - p) until the result is k or less.
Alternatively, you can use a generic rejection sampler, since the probabilities are known (better called weights here, since they need not sum to 1). Rejection sampling is described as follows:
Assume and each weight is 0 or greater. Store the weights in a list. Calculate the highest weight, call it max. Then, to choose an integer in the interval [0, k] using rejection sampling:
Choose a uniform random integer i in the interval [0, k].
With probability weights[i]/max (where weights[i] = p^i * (1-p) in your case), return i. Otherwise, go to step 1.
Given the weights for each item, there are many other ways to make a weighted choice besides rejection sampling or the solution in Sirius's answer; see my note on weighted choice algorithms.

Mixing function for non power of 2 integer intervals

I'm looking for a mixing function that given an integer from an interval <0, n) returns a random-looking integer from the same interval. The interval size n will typically be a composite non power of 2 number. I need the function to be one to one. It can only use O(1) memory, O(1) time is strongly preferred. I'm not too concerned about randomness of the output, but visually it should look random enough (see next paragraph).
I want to use this function as a pixel shuffling step in a realtime-ish renderer to select the order in which pixels are rendered (The output will be displayed after a fixed time and if it's not done yet this gives me a noisy but fast partial preview). Interval size n will be the number of pixels in the render (n = 1920*1080 = 2073600 would be a typical value). The function must be one to one so that I can be sure that every pixel is rendered exactly once when finished.
I've looked at the reversible building blocks used by hash prospector, but these are mostly specific to power of 2 ranges.
The only other method I could think of is multiply by large prime, but it doesn't give particularly nice random looking outputs.
What are some other options here?
Here is one solution based on the idea of primitive roots modulo a prime:
If a is a primitive root mod p then the function g(i) = a^i % p is a permutation of the nonzero elements which are less than p. This corresponds to the Lehmer prng. If n < p, you can get a permutation of 0, ..., n-1 as follows: Given i in that range, first add 1, then repeatedly multiply by a, taking the result mod p, until you get an element which is <= n, at which point you return the result - 1.
To fill in the details, this paper contains a table which gives a series of primes (all of which are close to various powers of 2) and corresponding primitive roots which are chosen so that they yield a generator with good statistical properties. Here is a part of that table, encoded as a Python dictionary in which the keys are the primes and the primitive roots are the values:
d = {32749: 30805,
65521: 32236,
131071: 66284,
262139: 166972,
524287: 358899,
1048573: 444362,
2097143: 1372180,
4194301: 1406151,
8388593: 5169235,
16777213: 9726917,
33554393: 32544832,
67108859: 11526618,
134217689: 70391260,
268435399: 150873839,
536870909: 219118189,
1073741789: 599290962}
Given n (in a certain range -- see the paper if you need to expand that range), you can find the smallest p which works:
def find_p_a(n):
for p in sorted(d.keys()):
if n < p:
return p, d[p]
once you know n and the matching p,a the following function is a permutation of 0 ... n-1:
def f(i,n,p,a):
x = a*(i+1) % p
while x > n:
x = a*x % p
return x-1
For a quick test:
n = 2073600
p,a = find_p_a(n) # p = 2097143, a = 1372180
nums = [f(i,n,p,a) for i in range(n)]
print(len(set(nums)) == n) #prints True
The average number of multiplications in f() is p/n, which in this case is 1.011 and will never be more than 2 (or very slightly larger since the p are not exact powers of 2). In practice this method is not fundamentally different from your "multiply by a large prime" approach, but in this case the factor is chosen more carefully, and the fact that sometimes more than 1 multiplication is required adding to the apparent randomness.

Generate N random integers that are sampled from a uniform distribution and sum to M in R [duplicate]

In some code I want to choose n random numbers in [0,1) which sum to 1.
I do so by choosing the numbers independently in [0,1) and normalizing them by dividing each one by the total sum:
numbers = [random() for i in range(n)]
numbers = [n/sum(numbers) for n in numbers]
My "problem" is, that the distribution I get out is quite skew. Choosing a million numbers not a single one gets over 1/2. By some effort I've calculated the pdf, and it's not nice.
Here is the weird looking pdf I get for 5 variables:
Do you have an idea for a nice algorithm to choose the numbers, that result in a more uniform or simple distribution?
You are looking to partition the distance from 0 to 1.
Choose n - 1 numbers from 0 to 1, sort them and determine the distances between each of them.
This will partition the space 0 to 1, which should yield the occasional large result which you aren't getting.
Even so, for large values of n, you can generally expect your max value to decrease as well, just not as quickly as your method.
You might be interested in the Dirichlet distribution which is used for generate quantities that sum to 1 if you're looking for probabilities. There's also a section on how to generate them using gamma distributions here.
Another way to get n random numbers which sum up to 1:
import random
def create_norm_arr(n, remaining=1.0):
random_numbers = []
for _ in range(n - 1):
r = random.random() # get a random number in [0, 1)
r = r * remaining
remaining -= r
random_numbers.append(r)
random_numbers.append(remaining)
return random_numbers
random_numbers = create_norm_arr(5)
print(random_numbers)
print(sum(random_numbers))
This makes higher numbers more likely.

How to transform any 64-bit integer to a uniformly distributed random number in certain interval

Can I transform integers between [Long.MIN_VALUE, Long.MAX_VALUE] to [0,1] uniformly and randomly? The main purpose is so that I can pick a fraction, e.g. 0.1 and select a uniformly and randomly distributed 10% of longs. If this transformation method is f(x), I can check f(x)< 0.1 on any long to see if it belongs to the 10% selected numbers. Some important requirements:
It can be controlled by seeds, so that I can get the same results every time (when I want), and an also change the seed to get completely different result.
When I increase the fraction from x to y, I want all the selected numbers in x to be also be selected in Y. For example, if 894230 is selected when I pick 10% of numbers, it should also be selected when I pick 20% numbers.
I can select seeds that won't result in obvious patterns such as double f(x) { return (x%1000)/1000.0; }.
Nice to have (but not necessary):
I can use a dummy seed that result in an obvious pattern (so that it looks obvious in a unit test).
I use Java, but I don't mind answers in any language that can be rewritten easily in Java
If the job is really just to identify whether x belongs to the set of of chosen integers, then I believe the following will achieve all of your first three requirements:
(long)anyReasonableHashFunction(x ^ seed) < Long.MAX_VALUE * fraction
boost::hash<long> hasher;
double f( long x, long seed)
{
return hasher( x ^ seed) / (double)std::numeric_limits< long>::max();
}

how to compare two curves (arrays of points)

i have 2 arrays of points (x,y), with those points I can draw 2 curves.
Anyone have ideas how to calculate how those curves are similar?
You can always calculate the area between those two curves. (This is a bit easier if the endpoints match.) The curves are similar if the area is small, not so similar if the area is not small.
Note that I did not define 'small'. That was intentional. Then again, you didn't define 'similar'.
Edit
Sometimes area isn't the best metric. For example consider the function f(x)=0 and f(x)=1e6*sin(x). If the range of x is some integral multiple of 2*pi, the area between these curves is zero. A function that oscillates between plus and minus one million is not a good approximation of f(x)=0.
A better metric is needed. Here are a couple. Note: I am assuming here that the x values are identical in the two sets; the only things that differ are the y values.
Sum of squares. For each x value, compute delta_yi = y1,i - y2,i and accumulate delta_yi2. This metric is the basis for a least square optimization, where the goal is to minimize the sum of the squares of the errors. This is a widely used approach because oftentimes it is fairly easy to implement.
Maximum deviation. Find the abs_delta_yi = |y1,i - y2,i| that maximizes the |y1,i - y2,i| for all x values. This metric is the basis for a lot of the implementations of the functions in the math library, where the goal is to minimize the maximum error. These math library implementations are approximations of the true function. As a consumer of such an approximation, I typically care more about the worst thing that the approximation is going to do to my application than I care about how that approximation is going to behave on average.
You might want to consider using Dynamic Time Warping (DTW) or Frechet distance.
Dynamic Time Warping
Dynamic Time Warping sums the difference throughout the entire curve. It can handle two arrays of different sizes. Here is a snippet from Wikipedia on how the code might look. This solution uses a two-dimensional array. The cost would be the distance between two points. The final value of the array DTW[n, m] contains the cumulative distance.
int DTWDistance(s: array [1..n], t: array [1..m]) {
DTW := array [0..n, 0..m]
for i := 1 to n
DTW[i, 0] := infinity
for i := 1 to m
DTW[0, i] := infinity
DTW[0, 0] := 0
for i := 1 to n
for j := 1 to m
cost:= d(s[i], t[j])
DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion
DTW[i , j-1], // deletion
DTW[i-1, j-1]) // match
return DTW[n, m]
}
DTW is similar to Jacopson's answer.
Frechet Distance
Frechet distance calculates the farthest that the curves separate. This means that all other points on the curve are closer together than this distance. This approach is typically represented with a dog and owner as shown here:
Frechet Distance Example.
Depending on your arrays, you can compare the distance of the points and use the maximum.
I assume a Curve is an array of 2D points over the real numbers, the size of the array is N, so I call p[i] the i-th point of the curve; i goes from 0 to N-1.
I also assume that the two curves have the same size and that it is meaningful to "compare" the i-th point of the first curve with the i-th point of the second curve.
I call Delta, a real number, the result of the comparison of the two curves.
Delta can be computed as follow:
Delta = 0;
for( i = 0; i < N; i++ ) {
Delta = Delta + distance(p[i],q[i]);
}
where p are points from the first curve and q are points from the second curve.
Now you have to choose a suitable distance function depending on your problem: the function has two points as arguments and returns a real number.
For example distance can be the usual distance of two point on the plane (Pythagorean theorem and http://en.wikipedia.org/wiki/Euclidean_distance).
An example of the method in C++:
#include <numeric>
#include <vector>
#include <cmath>
#include <iostream>
#include <functional>
#include <stdexcept>
typedef double Real_t;
class Point
{
public:
Point(){}
Point(std::initializer_list<Real_t> args):x(args.begin()[0]),y(args.begin()[1]){}
Point( const Real_t& xx, const Real_t& yy ):x(xx),y(yy){}
Real_t x,y;
};
typedef std::vector< Point > Curve;
Real_t point_distance( const Point& a, const Point& b )
{
return hypot(a.x-b.x,a.y-b.y);
}
Real_t curve_distance( const Curve& c1, const Curve& c2 )
{
if ( c1.size() != c2.size() ) throw std::invalid_argument("size mismatch");
return std::inner_product( c1.begin(), c1.end(), c2.begin(), Real_t(0), std::plus< Real_t >(), point_distance );
}
int main(int,char**)
{
Curve c1{{0,0},
{1,1},
{2,4},
{3,9}};
Curve c2{{0.1,-0.1},
{1.1,0.9},
{2.1,3.9},
{3.1,8.9}};
std::cout << curve_distance(c1,c2) << "\n";
return 0;
}
If your two curves have different size then you have to think how to extend the previous method, for example you can reduce the size of the longest curve by means of a suitable algorithm (for example the Ramer–Douglas–Peucker algorithm can be a starting point) in order to match it to the size of the shortest curve.
I have just described a very simple method, you can also take different approaches; for example you can fit two curves to the two set of points and then work with the two curves expressed as mathematical function.
This can also be solved, thinking in terms of distributions.
Especially if the position of a value is interchangeable within an array.
Then you could calculate the mean and the std (and other distribution characteristics) for both arrays. And calculate the difference between those characteristics.

Resources