How to test randomness (case in point - Shuffling) - math

First off, this question is ripped out from this question. I did it because I think this part is bigger than a sub-part of a longer question. If it offends, please pardon me.
Assume that you have a algorithm that generates randomness. Now how do you test it?
Or to be more direct - Assume you have an algorithm that shuffles a deck of cards, how do you test that it's a perfectly random algorithm?
To add some theory to the problem -
A deck of cards can be shuffled in 52! (52 factorial) different ways. Take a deck of cards, shuffle it by hand and write down the order of all cards. What is the probability that you would have gotten exactly that shuffle? Answer: 1 / 52!.
What is the chance that you, after shuffling, will get A, K, Q, J ... of each suit in a sequence? Answer 1 / 52!
So, just shuffling once and looking at the result will give you absolutely no information about your shuffling algorithms randomness. Twice and you have more information, Three even more...
How would you black box test a shuffling algorithm for randomness?

Statistics. The de facto standard for testing RNGs is the Diehard suite (originally available at http://stat.fsu.edu/pub/diehard). Alternatively, the Ent program provides tests that are simpler to interpret but less comprehensive.
As for shuffling algorithms, use a well-known algorithm such as Fisher-Yates (a.k.a "Knuth Shuffle"). The shuffle will be uniformly random so long as the underlying RNG is uniformly random. If you are using Java, this algorithm is available in the standard library (see Collections.shuffle).
It probably doesn't matter for most applications, but be aware that most RNGs do not provide sufficient degrees of freedom to produce every possible permutation of a 52-card deck (explained here).

Here's one simple check that you can perform. It uses generated random numbers to estimate Pi. It's not proof of randomness, but poor RNGs typically don't do well on it (they will return something like 2.5 or 3.8 rather ~3.14).
Ideally this would be just one of many tests that you would run to check randomness.
Something else that you can check is the standard deviation of the output. The expected standard deviation for a uniformly distributed population of values in the range 0..n approaches n/sqrt(12).
/**
* This is a rudimentary check to ensure that the output of a given RNG
* is approximately uniformly distributed. If the RNG output is not
* uniformly distributed, this method will return a poor estimate for the
* value of pi.
* #param rng The RNG to test.
* #param iterations The number of random points to generate for use in the
* calculation. This value needs to be sufficiently large in order to
* produce a reasonably accurate result (assuming the RNG is uniform).
* Less than 10,000 is not particularly useful. 100,000 should be sufficient.
* #return An approximation of pi generated using the provided RNG.
*/
public static double calculateMonteCarloValueForPi(Random rng,
int iterations)
{
// Assumes a quadrant of a circle of radius 1, bounded by a box with
// sides of length 1. The area of the square is therefore 1 square unit
// and the area of the quadrant is (pi * r^2) / 4.
int totalInsideQuadrant = 0;
// Generate the specified number of random points and count how many fall
// within the quadrant and how many do not. We expect the number of points
// in the quadrant (expressed as a fraction of the total number of points)
// to be pi/4. Therefore pi = 4 * ratio.
for (int i = 0; i < iterations; i++)
{
double x = rng.nextDouble();
double y = rng.nextDouble();
if (isInQuadrant(x, y))
{
++totalInsideQuadrant;
}
}
// From these figures we can deduce an approximate value for Pi.
return 4 * ((double) totalInsideQuadrant / iterations);
}
/**
* Uses Pythagoras' theorem to determine whether the specified coordinates
* fall within the area of the quadrant of a circle of radius 1 that is
* centered on the origin.
* #param x The x-coordinate of the point (must be between 0 and 1).
* #param y The y-coordinate of the point (must be between 0 and 1).
* #return True if the point is within the quadrant, false otherwise.
*/
private static boolean isInQuadrant(double x, double y)
{
double distance = Math.sqrt((x * x) + (y * y));
return distance <= 1;
}

First, it is impossible to know for sure if a certain finite output is "truly random" since, as you point out, any output is possible.
What can be done, is to take a sequence of outputs and check various measurements of this sequence against what is more likely. You can derive a sort of confidence score that the generating algorithm is doing a good job.
For example, you could check the output of 10 different shuffles. Assign a number 0-51 to each card, and take the average of the card in position 6 across the shuffles. The convergent average is 25.5, so you would be surprised to see a value of 1 here. You could use the central limit theorem to get an estimate of how likely each average is for a given position.
But we shouldn't stop here! Because this algorithm could be fooled by a system that only alternates between two shuffles that are designed to give the exact average of 25.5 at each position. How can we do better?
We expect a uniform distribution (equal likelihood for any given card) at each position, across different shuffles. So among the 10 shuffles, we could try to verify that the choices 'look uniform.' This is basically just a reduced version of the original problem. You could check that the standard deviation looks reasonable, that the min is reasonable, and the max value as well. You could also check that other values, such as the closest two cards (by our assigned numbers), also make sense.
But we also can't just add various measurements like this ad infinitum, since, given enough statistics, any particular shuffle will appear highly unlikely for some reason (e.g. this is one of very few shuffles in which cards X,Y,Z appear in order). So the big question is: which is the right set of measurements to take? Here I have to admit that I don't know the best answer. However, if you have a certain application in mind, you can choose a good set of properties/measurements to test, and work with those -- this seems to be the way cryptographers handle things.

There's a lot of theory on testing randomness. For a very simple test on a card shuffling algorithm you could do a lot of shuffles and then run a chi squared test that the probability of each card turning up in any position was uniform. But that doesn't test that consecutive cards aren't correlated so you would also want to do tests on that.
Volume 2 of Knuth's Art of Computer Programming gives a number of tests that you could use in sections 3.3.2 (Empirical tests) and 3.3.4 (The Spectral Test) and the theory behind them.

The only way to test for randomness is to write a program that attempts to build a predictive model for the data being tested, and then use that model to try to predict future data, and then showing that the uncertainty, or entropy, of its predictions tend towards maximum (i.e. the uniform distribution) over time. Of course, you'll always be uncertain whether or not your model has captured all of the necessary context; given a model, it'll always be possible to build a second model that generates non-random data that looks random to the first. But as long as you accept that the orbit of Pluto has an insignificant influence on the results of the shuffling algorithm, then you should be able to satisfy yourself that its results are acceptably random.
Of course, if you do this, you might as well use your model generatively, to actually create the data you want. And if you do that, then you're back at square one.

Shuffle alot, and then record the outcomes (if im reading this correctly). I remember seeing comparisons of "random number generators". They just test it over and over, then graph the results.
If it is truly random the graph will be mostly even.

I'm not fully following your question. You say
Assume that you have a algorithm that generates randomness. Now how do you test it?
What do you mean? If you're assuming you can generate randomness, there's no need to test it.
Once you have a good random number generator, creating a random permutation is easy (e.g. Call your cards 1-52. Generate 52 random numbers assigning each one to a card in order, and then sort according to your 52 randoms) . You're not going to destroy the randomness of your good RNG by generating your permutation.
The difficult question is whether you can trust your RNG. Here's a sample link to people discussing that issue in a specific context.

Testing 52! possibilities is of course impossible. Instead, try your shuffle on smaller numbers of cards, like 3, 5, and 10. Then you can test billions of shuffles and use a histogram and the chi-square statistical test to prove that each permutation is coming up an "even" number of times.

No code so far, therefore I copy-paste a testing part from my answer to the original question.
// ...
int main() {
typedef std::map<std::pair<size_t, Deck::value_type>, size_t> Map;
Map freqs;
Deck d;
const size_t ntests = 100000;
// compute frequencies of events: card at position
for (size_t i = 0; i < ntests; ++i) {
d.shuffle();
size_t pos = 0;
for(Deck::const_iterator j = d.begin(); j != d.end(); ++j, ++pos)
++freqs[std::make_pair(pos, *j)];
}
// if Deck.shuffle() is correct then all frequencies must be similar
for (Map::const_iterator j = freqs.begin(); j != freqs.end(); ++j)
std::cout << "pos=" << j->first.first << " card=" << j->first.second
<< " freq=" << j->second << std::endl;
}
This code does not test randomness of underlying pseudorandom number generator. Testing PRNG randomness is a whole branch of science.

For a quick test, you can always try compressing it. Once it doesn't compress, then you can move onto other tests.
I've tried dieharder but it refuses to work for a shuffle. All tests fail. It is also really stodgy, it wont let you specify the range of values you want or anything like that.

Pondering it myself, what I would do is something like:
Setup (Pseudo code)
// A card has a Number 0-51 and a position 0-51
int[][] StatMatrix = new int[52][52]; // Assume all are set to 0 as starting values
ShuffleCards();
ForEach (card in Cards) {
StatMatrix[Card.Position][Card.Number]++;
}
This gives us a matrix 52x52 indicating how many times a card has ended up at a certain position. Repeat this a large number of times (I would start with 1000, but people better at statistics than me may give a better number).
Analyze the matrix
If we have perfect randomness and perform the shuffle an infinite number of times then for each card and for each position the number of times the card ended up in that position is the same as for any other card. Saying the same thing in a different way:
statMatrix[position][card] / numberOfShuffle = 1/52.
So I would calculate how far from that number we are.

Related

How to find number of items dropped based on individual probabilities?

My goal is to independently calculate the number of items an enemy would drop after it is killed. For example, say there are 50 potions each with a 50% chance of being dropped, I'd like to randomly return a number from 0 to 50, based on independent trials.
Currently, this is the code I'm using:
int droppedItems(int n, float probability) {
int count = 0;
for (int x = 1; x <= n; ++x) {
if (random() <= probability) {
++count;
}
}
return count;
}
Where probability is a number from 0.0 to 1.0, random() returns 0.0 to 1.0, and n is the maximum number of items to be dropped. This is in C++ code, however, I'm actually using Visual Basic 6 - so there's no libraries to help with this.
This code works flawlessly. However, I'd like to optimize this so that if n happens to be 999999, it doesn't take forever (which it currently does).
Use the binomial distribution. Wiki - Binomial Distribution
Ideally, use the libraries for whatever language this pseudocode will be written in. There's no sense in reinventing the wheel unless of course you are trying to learn how to invent a wheel.
Specifically, you'll want something that will let you generate random values given a binomial distribution with a probability of success in any given trial and a number of trials.
EDIT :
I went ahead and did this (in python, since that's where I live these days). It relies on the very nice numpy library (hooray, abstraction!):
>>>import numpy
>>>numpy.random.binomial(99999,0.5)
49853
>>>numpy.random.binomial(99999,0.5)
50077
And, using timeit.Timer to check execution time:
# timing it across 10,000 iterations for 99,999 items per iteration
>>>timeit.Timer(stmt="numpy.random.binomial(99999,0.5)", setup="import numpy").timeit(10000)
0.00927[... seconds]
EDIT 2 :
As it turns out, there isn't a simple way to implement a random number generator based off of the binomial distribution.
There is an algorithm you can implement without library support which will generate random variables from the binomial distribution. You can view it here as a PDF
My guess is that given what you want to use it for (having monsters drop loot in a game), implementing the algorithm is not worth your time. There's room for fudge factor here!
I would change your code like this (note: this is not a binomial distribution):
Use your current code for small values, say n up to 100.
For n greater than one hundred, calculate the value of count for
100 using your current algorithm and then multiply the result by
n/100.
Again, if you really want to figure out how to implement the BTPE algorithm yourself, you can - I think the method I give above wins in the trade off between effort to write and getting "close enough".
As #IamChuckB pointed out already, the key word is binomial distribution. When the number of Bernoulli trials (number of items in your example) is large enough, a good approximation is the Poisson distribution, which is much simpler to calculate and draw numbers from (the exact algorithm is spelled out in the linked Wikipedia article).

Understanding "randomness"

I can't get my head around this, which is more random?
rand()
OR:
rand() * rand()
I´m finding it a real brain teaser, could you help me out?
EDIT:
Intuitively I know that the mathematical answer will be that they are equally random, but I can't help but think that if you "run the random number algorithm" twice when you multiply the two together you'll create something more random than just doing it once.
Just a clarification
Although the previous answers are right whenever you try to spot the randomness of a pseudo-random variable or its multiplication, you should be aware that while Random() is usually uniformly distributed, Random() * Random() is not.
Example
This is a uniform random distribution sample simulated through a pseudo-random variable:
BarChart[BinCounts[RandomReal[{0, 1}, 50000], 0.01]]
While this is the distribution you get after multiplying two random variables:
BarChart[BinCounts[Table[RandomReal[{0, 1}, 50000] *
RandomReal[{0, 1}, 50000], {50000}], 0.01]]
So, both are “random”, but their distribution is very different.
Another example
While 2 * Random() is uniformly distributed:
BarChart[BinCounts[2 * RandomReal[{0, 1}, 50000], 0.01]]
Random() + Random() is not!
BarChart[BinCounts[Table[RandomReal[{0, 1}, 50000] +
RandomReal[{0, 1}, 50000], {50000}], 0.01]]
The Central Limit Theorem
The Central Limit Theorem states that the sum of Random() tends to a normal distribution as terms increase.
With just four terms you get:
BarChart[BinCounts[Table[RandomReal[{0, 1}, 50000] + RandomReal[{0, 1}, 50000] +
Table[RandomReal[{0, 1}, 50000] + RandomReal[{0, 1}, 50000],
{50000}],
0.01]]
And here you can see the road from a uniform to a normal distribution by adding up 1, 2, 4, 6, 10 and 20 uniformly distributed random variables:
Edit
A few credits
Thanks to Thomas Ahle for pointing out in the comments that the probability distributions shown in the last two images are known as the Irwin-Hall distribution
Thanks to Heike for her wonderful torn[] function
I guess both methods are as random although my gutfeel would say that rand() * rand() is less random because it would seed more zeroes. As soon as one rand() is 0, the total becomes 0
Neither is 'more random'.
rand() generates a predictable set of numbers based on a psuedo-random seed (usually based on the current time, which is always changing). Multiplying two consecutive numbers in the sequence generates a different, but equally predictable, sequence of numbers.
Addressing whether this will reduce collisions, the answer is no. It will actually increase collisions due to the effect of multiplying two numbers where 0 < n < 1. The result will be a smaller fraction, causing a bias in the result towards the lower end of the spectrum.
Some further explanations. In the following, 'unpredictable' and 'random' refer to the ability of someone to guess what the next number will be based on previous numbers, ie. an oracle.
Given seed x which generates the following list of values:
0.3, 0.6, 0.2, 0.4, 0.8, 0.1, 0.7, 0.3, ...
rand() will generate the above list, and rand() * rand() will generate:
0.18, 0.08, 0.08, 0.21, ...
Both methods will always produce the same list of numbers for the same seed, and hence are equally predictable by an oracle. But if you look at the the results for multiplying the two calls, you'll see they are all under 0.3 despite a decent distribution in the original sequence. The numbers are biased because of the effect of multiplying two fractions. The resulting number is always smaller, therefore much more likely to be a collision despite still being just as unpredictable.
Oversimplification to illustrate a point.
Assume your random function only outputs 0 or 1.
random() is one of (0,1), but random()*random() is one of (0,0,0,1)
You can clearly see that the chances to get a 0 in the second case are in no way equal to those to get a 1.
When I first posted this answer I wanted to keep it as short as possible so that a person reading it will understand from a glance the difference between random() and random()*random(), but I can't keep myself from answering the original ad litteram question:
Which is more random?
Being that random(), random()*random(), random()+random(), (random()+1)/2 or any other combination that doesn't lead to a fixed result have the same source of entropy (or the same initial state in the case of pseudorandom generators), the answer would be that they are equally random (The difference is in their distribution). A perfect example we can look at is the game of Craps. The number you get would be random(1,6)+random(1,6) and we all know that getting 7 has the highest chance, but that doesn't mean the outcome of rolling two dice is more or less random than the outcome of rolling one.
Here's a simple answer. Consider Monopoly. You roll two six sided dice (or 2d6 for those of you who prefer gaming notation) and take their sum. The most common result is 7 because there are 6 possible ways you can roll a 7 (1,6 2,5 3,4 4,3 5,2 and 6,1). Whereas a 2 can only be rolled on 1,1. It's easy to see that rolling 2d6 is different than rolling 1d12, even if the range is the same (ignoring that you can get a 1 on a 1d12, the point remains the same). Multiplying your results instead of adding them is going to skew them in a similar fashion, with most of your results coming up in the middle of the range. If you're trying to reduce outliers, this is a good method, but it won't help making an even distribution.
(And oddly enough it will increase low rolls as well. Assuming your randomness starts at 0, you'll see a spike at 0 because it will turn whatever the other roll is into a 0. Consider two random numbers between 0 and 1 (inclusive) and multiplying. If either result is a 0, the whole thing becomes a 0 no matter the other result. The only way to get a 1 out of it is for both rolls to be a 1. In practice this probably wouldn't matter but it makes for a weird graph.)
The obligatory xkcd ...
It might help to think of this in more discrete numbers. Consider want to generate random numbers between 1 and 36, so you decide the easiest way is throwing two fair, 6-sided dice. You get this:
1 2 3 4 5 6
-----------------------------
1| 1 2 3 4 5 6
2| 2 4 6 8 10 12
3| 3 6 9 12 15 18
4| 4 8 12 16 20 24
5| 5 10 15 20 25 30
6| 6 12 18 24 30 36
So we have 36 numbers, but not all of them are fairly represented, and some don't occur at all. Numbers near the center diagonal (bottom-left corner to top-right corner) will occur with the highest frequency.
The same principles which describe the unfair distribution between dice apply equally to floating point numbers between 0.0 and 1.0.
Some things about "randomness" are counter-intuitive.
Assuming flat distribution of rand(), the following will get you non-flat distributions:
high bias: sqrt(rand(range^2))
bias peaking in the middle: (rand(range) + rand(range))/2
low:bias: range - sqrt(rand(range^2))
There are lots of other ways to create specific bias curves. I did a quick test of rand() * rand() and it gets you a very non-linear distribution.
Most rand() implementations have some period. I.e. after some enormous number of calls the sequence repeats. The sequence of outputs of rand() * rand() repeats in half the time, so it is "less random" in that sense.
Also, without careful construction, performing arithmetic on random values tends to cause less randomness. A poster above cited "rand() + rand() + rand() ..." (k times, say) which will in fact tend to k times the mean value of the range of values rand() returns. (It's a random walk with steps symmetric about that mean.)
Assume for concreteness that your rand() function returns a uniformly distributed random real number in the range [0,1). (Yes, this example allows infinite precision. This won't change the outcome.) You didn't pick a particular language and different languages may do different things, but the following analysis holds with modifications for any non-perverse implementation of rand(). The product rand() * rand() is also in the range [0,1) but is no longer uniformly distributed. In fact, the product is as likely to be in the interval [0,1/4) as in the interval [1/4,1). More multiplication will skew the result even further toward zero. This makes the outcome more predictable. In broad strokes, more predictable == less random.
Pretty much any sequence of operations on uniformly random input will be nonuniformly random, leading to increased predictability. With care, one can overcome this property, but then it would have been easier to generate a uniformly distributed random number in the range you actually wanted rather than wasting time with arithmetic.
"random" vs. "more random" is a little bit like asking which Zero is more zero'y.
In this case, rand is a PRNG, so not totally random. (in fact, quite predictable if the seed is known). Multiplying it by another value makes it no more or less random.
A true crypto-type RNG will actually be random. And running values through any sort of function cannot add more entropy to it, and may very likely remove entropy, making it no more random.
The concept you're looking for is "entropy," the "degree" of disorder of a string
of bits. The idea is easiest to understand in terms of the concept of "maximum entropy".
An approximate definition of a string of bits with maximum entropy is that it cannot be expressed exactly in terms of a shorter string of bits (ie. using some algorithm to
expand the smaller string back to the original string).
The relevance of maximum entropy to randomness stems from the fact that
if you pick a number "at random", you will almost certainly pick a number
whose bit string is close to having maximum entropy, that is, it can't be compressed.
This is our best understanding of what characterizes a "random" number.
So, if you want to make a random number out of two random samples which is "twice" as
random, you'd concatenate the two bit strings together. Practically, you'd just
stuff the samples into the high and low halves of a double length word.
On a more practical note, if you find yourself saddled with a crappy rand(), it can
sometimes help to xor a couple of samples together --- although, if its truly broken even
that procedure won't help.
The accepted answer is quite lovely, but there's another way to answer your question. PachydermPuncher's answer already takes this alternative approach, and I'm just going to expand it out a little.
The easiest way to think about information theory is in terms of the smallest unit of information, a single bit.
In the C standard library, rand() returns an integer in the range 0 to RAND_MAX, a limit that may be defined differently depending on the platform. Suppose RAND_MAX happens to be defined as 2^n - 1 where n is some integer (this happens to be the case in Microsoft's implementation, where n is 15). Then we would say that a good implementation would return n bits of information.
Imagine that rand() constructs random numbers by flipping a coin to find the value of one bit, and then repeating until it has a batch of 15 bits. Then the bits are independent (the value of any one bit does not influence the likelihood of other bits in the same batch have a certain value). So each bit considered independently is like a random number between 0 and 1 inclusive, and is "evenly distributed" over that range (as likely to be 0 as 1).
The independence of the bits ensures that the numbers represented by batches of bits will also be evenly distributed over their range. This is intuitively obvious: if there are 15 bits, the allowed range is zero to 2^15 - 1 = 32767. Every number in that range is a unique pattern of bits, such as:
010110101110010
and if the bits are independent then no pattern is more likely to occur than any other pattern. So all possible numbers in the range are equally likely. And so the reverse is true: if rand() produces evenly distributed integers, then those numbers are made of independent bits.
So think of rand() as a production line for making bits, which just happens to serve them up in batches of arbitrary size. If you don't like the size, break the batches up into individual bits, and then put them back together in whatever quantities you like (though if you need a particular range that is not a power of 2, you need to shrink your numbers, and by far the easiest way to do that is to convert to floating point).
Returning to your original suggestion, suppose you want to go from batches of 15 to batches of 30, ask rand() for the first number, bit-shift it by 15 places, then add another rand() to it. That is a way to combine two calls to rand() without disturbing an even distribution. It works simply because there is no overlap between the locations where you place the bits of information.
This is very different to "stretching" the range of rand() by multiplying by a constant. For example, if you wanted to double the range of rand() you could multiply by two - but now you'd only ever get even numbers, and never odd numbers! That's not exactly a smooth distribution and might be a serious problem depending on the application, e.g. a roulette-like game supposedly allowing odd/even bets. (By thinking in terms of bits, you'd avoid that mistake intuitively, because you'd realise that multiplying by two is the same as shifting the bits to the left (greater significance) by one place and filling in the gap with zero. So obviously the amount of information is the same - it just moved a little.)
Such gaps in number ranges can't be griped about in floating point number applications, because floating point ranges inherently have gaps in them that simply cannot be represented at all: an infinite number of missing real numbers exist in the gap between each two representable floating point numbers! So we just have to learn to live with gaps anyway.
As others have warned, intuition is risky in this area, especially because mathematicians can't resist the allure of real numbers, which are horribly confusing things full of gnarly infinities and apparent paradoxes.
But at least if you think it terms of bits, your intuition might get you a little further. Bits are really easy - even computers can understand them.
As others have said, the easy short answer is: No, it is not more random, but it does change the distribution.
Suppose you were playing a dice game. You have some completely fair, random dice. Would the die rolls be "more random" if before each die roll, you first put two dice in a bowl, shook it around, picked one of the dice at random, and then rolled that one? Clearly it would make no difference. If both dice give random numbers, then randomly choosing one of the two dice will make no difference. Either way you'll get a random number between 1 and 6 with even distribution over a sufficient number of rolls.
I suppose in real life such a procedure might be useful if you suspected that the dice might NOT be fair. If, say, the dice are slightly unbalanced so one tends to give 1 more often than 1/6 of the time, and another tends to give 6 unusually often, then randomly choosing between the two would tend to obscure the bias. (Though in this case, 1 and 6 would still come up more than 2, 3, 4, and 5. Well, I guess depending on the nature of the imbalance.)
There are many definitions of randomness. One definition of a random series is that it is a series of numbers produced by a random process. By this definition, if I roll a fair die 5 times and get the numbers 2, 4, 3, 2, 5, that is a random series. If I then roll that same fair die 5 more times and get 1, 1, 1, 1, 1, then that is also a random series.
Several posters have pointed out that random functions on a computer are not truly random but rather pseudo-random, and that if you know the algorithm and the seed they are completely predictable. This is true, but most of the time completely irrelevant. If I shuffle a deck of cards and then turn them over one at a time, this should be a random series. If someone peeks at the cards, the result will be completely predictable, but by most definitions of randomness this will not make it less random. If the series passes statistical tests of randomness, the fact that I peeked at the cards will not change that fact. In practice, if we are gambling large sums of money on your ability to guess the next card, then the fact that you peeked at the cards is highly relevant. If we are using the series to simulate the menu picks of visitors to our web site in order to test the performance of the system, then the fact that you peeked will make no difference at all. (As long as you do not modify the program to take advantage of this knowledge.)
EDIT
I don't think I could my response to the Monty Hall problem into a comment, so I'll update my answer.
For those who didn't read Belisarius link, the gist of it is: A game show contestant is given a choice of 3 doors. Behind one is a valuable prize, behind the others something worthless. He picks door #1. Before revealing whether it is a winner or a loser, the host opens door #3 to reveal that it is a loser. He then gives the contestant the opportunity to switch to door #2. Should the contestant do this or not?
The answer, which offends many people's intuition, is that he should switch. The probability that his original pick was the winner is 1/3, that the other door is the winner is 2/3. My initial intuition, along with that of many other people, is that there would be no gain in switching, that the odds have just been changed to 50:50.
After all, suppose that someone switched on the TV just after the host opened the losing door. That person would see two remaining closed doors. Assuming he knows the nature of the game, he would say that there is a 1/2 chance of each door hiding the prize. How can the odds for the viewer be 1/2 : 1/2 while the odds for the contestant are 1/3 : 2/3 ?
I really had to think about this to beat my intuition into shape. To get a handle on it, understand that when we talk about probabilities in a problem like this, we mean, the probability you assign given the available information. To a member of the crew who put the prize behind, say, door #1, the probability that the prize is behind door #1 is 100% and the probability that it is behind either of the other two doors is zero.
The crew member's odds are different than the contestant's odds because he knows something the contestant doesn't, namely, which door he put the prize behind. Likewise, the contestent's odds are different than the viewer's odds because he knows something that the viewer doesn't, namely, which door he initially picked. This is not irrelevant, because the host's choice of which door to open is not random. He will not open the door the contestant picked, and he will not open the door that hides the prize. If these are the same door, that leaves him two choices. If they are different doors, that leaves only one.
So how do we come up with 1/3 and 2/3 ? When the contestant originally picked a door, he had a 1/3 chance of picking the winner. I think that much is obvious. That means there was a 2/3 chance that one of the other doors is the winner. If the host game him the opportunity to switch without giving any additional information, there would be no gain. Again, this should be obvious. But one way to look at it is to say that there is a 2/3 chance that he would win by switching. But he has 2 alternatives. So each one has only 2/3 divided by 2 = 1/3 chance of being the winner, which is no better than his original pick. Of course we already knew the final result, this just calculates it a different way.
But now the host reveals that one of those two choices is not the winner. So of the 2/3 chance that a door he didn't pick is the winner, he now knows that 1 of the 2 alternatives isn't it. The other might or might not be. So he no longer has 2/3 dividied by 2. He has zero for the open door and 2/3 for the closed door.
Consider you have a simple coin flip problem where even is considered heads and odd is considered tails. The logical implementation is:
rand() mod 2
Over a large enough distribution, the number of even numbers should equal the number of odd numbers.
Now consider a slight tweak:
rand() * rand() mod 2
If one of the results is even, then the entire result should be even. Consider the 4 possible outcomes (even * even = even, even * odd = even, odd * even = even, odd * odd = odd). Now, over a large enough distribution, the answer should be even 75% of the time.
I'd bet heads if I were you.
This comment is really more of an explanation of why you shouldn't implement a custom random function based on your method than a discussion on the mathematical properties of randomness.
When in doubt about what will happen to the combinations of your random numbers, you can use the lessons you learned in statistical theory.
In OP's situation he wants to know what's the outcome of X*X = X^2 where X is a random variable distributed along Uniform[0,1]. We'll use the CDF technique since it's just a one-to-one mapping.
Since X ~ Uniform[0,1] it's cdf is: fX(x) = 1
We want the transformation Y <- X^2 thus y = x^2
Find the inverse x(y): sqrt(y) = x this gives us x as a function of y.
Next, find the derivative dx/dy: d/dy (sqrt(y)) = 1/(2 sqrt(y))
The distribution of Y is given as: fY(y) = fX(x(y)) |dx/dy| = 1/(2 sqrt(y))
We're not done yet, we have to get the domain of Y. since 0 <= x < 1, 0 <= x^2 < 1
so Y is in the range [0, 1).
If you wanna check if the pdf of Y is indeed a pdf, integrate it over the domain: Integrate 1/(2 sqrt(y)) from 0 to 1 and indeed, it pops up as 1. Also, notice the shape of the said function looks like what belisarious posted.
As for things like X1 + X2 + ... + Xn, (where Xi ~ Uniform[0,1]) we can just appeal to the Central Limit Theorem which works for any distribution whose moments exist. This is why the Z-test exists actually.
Other techniques for determining the resulting pdf include the Jacobian transformation (which is the generalized version of the cdf technique) and MGF technique.
EDIT: As a clarification, do note that I'm talking about the distribution of the resulting transformation and not its randomness. That's actually for a separate discussion. Also what I actually derived was for (rand())^2. For rand() * rand() it's much more complicated, which, in any case won't result in a uniform distribution of any sorts.
It's not exactly obvious, but rand() is typically more random than rand()*rand(). What's important is that this isn't actually very important for most uses.
But firstly, they produce different distributions. This is not a problem if that is what you want, but it does matter. If you need a particular distribution, then ignore the whole “which is more random” question. So why is rand() more random?
The core of why rand() is more random (under the assumption that it is producing floating-point random numbers with the range [0..1], which is very common) is that when you multiply two FP numbers together with lots of information in the mantissa, you get some loss of information off the end; there's just not enough bit in an IEEE double-precision float to hold all the information that was in two IEEE double-precision floats uniformly randomly selected from [0..1], and those extra bits of information are lost. Of course, it doesn't matter that much since you (probably) weren't going to use that information, but the loss is real. It also doesn't really matter which distribution you produce (i.e., which operation you use to do the combination). Each of those random numbers has (at best) 52 bits of random information – that's how much an IEEE double can hold – and if you combine two or more into one, you're still limited to having at most 52 bits of random information.
Most uses of random numbers don't use even close to as much randomness as is actually available in the random source. Get a good PRNG and don't worry too much about it. (The level of “goodness” depends on what you're doing with it; you have to be careful when doing Monte Carlo simulation or cryptography, but otherwise you can probably use the standard PRNG as that's usually much quicker.)
Floating randoms are based, in general, on an algorithm that produces an integer between zero and a certain range. As such, by using rand()*rand(), you are essentially saying int_rand()*int_rand()/rand_max^2 - meaning you are excluding any prime number / rand_max^2.
That changes the randomized distribution significantly.
rand() is uniformly distributed on most systems, and difficult to predict if properly seeded. Use that unless you have a particular reason to do math on it (i.e., shaping the distribution to a needed curve).
Multiplying numbers would end up in a smaller solution range depending on your computer architecture.
If the display of your computer shows 16 digits rand() would be say 0.1234567890123
multiplied by a second rand(), 0.1234567890123, would give 0.0152415 something
you'd definitely find fewer solutions if you'd repeat the experiment 10^14 times.
Most of these distributions happen because you have to limit or normalize the random number.
We normalize it to be all positive, fit within a range, and even to fit within the constraints of the memory size for the assigned variable type.
In other words, because we have to limit the random call between 0 and X (X being the size limit of our variable) we will have a group of "random" numbers between 0 and X.
Now when you add the random number to another random number the sum will be somewhere between 0 and 2X...this skews the values away from the edge points (the probability of adding two small numbers together and two big numbers together is very small when you have two random numbers over a large range).
Think of the case where you had a number that is close to zero and you add it with another random number it will certainly get bigger and away from 0 (this will be true of large numbers as well as it is unlikely to have two large numbers (numbers close to X) returned by the Random function twice.
Now if you were to setup the random method with negative numbers and positive numbers (spanning equally across the zero axis) this would no longer be the case.
Say for instance RandomReal({-x, x}, 50000, .01) then you would get an even distribution of numbers on the negative a positive side and if you were to add the random numbers together they would maintain their "randomness".
Now I'm not sure what would happen with the Random() * Random() with the negative to positive span...that would be an interesting graph to see...but I have to get back to writing code now. :-P
There is no such thing as more random. It is either random or not. Random means "hard to predict". It does not mean non-deterministic. Both random() and random() * random() are equally random if random() is random. Distribution is irrelevant as far as randomness goes. If a non-uniform distribution occurs, it just means that some values are more likely than others; they are still unpredictable.
Since pseudo-randomness is involved, the numbers are very much deterministic. However, pseudo-randomness is often sufficient in probability models and simulations. It is pretty well known that making a pseudo-random number generator complicated only makes it difficult to analyze. It is unlikely to improve randomness; it often causes it to fail statistical tests.
The desired properties of the random numbers are important: repeatability and reproducibility, statistical randomness, (usually) uniformly distributed, and a large period are a few.
Concerning transformations on random numbers: As someone said, the sum of two or more uniformly distributed results in a normal distribution. This is the additive central limit theorem. It applies regardless of the source distribution as long as all distributions are independent and identical. The multiplicative central limit theorem says the product of two or more independent and indentically distributed random variables is lognormal. The graph someone else created looks exponential, but it is really lognormal. So random() * random() is lognormally distributed (although it may not be independent since numbers are pulled from the same stream). This may be desirable in some applications. However, it is usually better to generate one random number and transform it to a lognormally-distributed number. Random() * random() may be difficult to analyze.
For more information, consult my book at www.performorama.org. The book is under construction, but the relevant material is there. Note that chapter and section numbers may change over time. Chapter 8 (probability theory) -- sections 8.3.1 and 8.3.3, chapter 10 (random numbers).
We can compare two arrays of numbers regarding the randomness by using
Kolmogorov complexity
If the sequence of numbers can not be compressed, then it is the most random we can reach at this length...
I know that this type of measurement is more a theoretical option...
Actually, when you think about it rand() * rand() is less random than rand(). Here's why.
Essentially, there are the same number of odd numbers as even numbers. And saying that 0.04325 is odd, and like 0.388 is even, and 0.4 is even, and 0.15 is odd,
That means that rand() has a equal chance of being an even or odd decimal.
On the other hand, rand() * rand() has it's odds stacked a bit differently.
Lets say:
double a = rand();
double b = rand();
double c = a * b;
a and b both have a 50% precent chance of being even or odd. Knowing that
even * even = even
even * odd = even
odd * odd = odd
odd * even = even
means that there a 75% chance that c is even, while only a 25% chance it's odd, making the value of rand() * rand() more predictable than rand(), therefore less random.
Use a linear feedback shift register (LFSR) that implements a primitive polynomial.
The result will be a sequence of 2^n pseudo-random numbers, ie none repeating in the sequence where n is the number of bits in the LFSR .... resulting in a uniform distribution.
http://en.wikipedia.org/wiki/Linear_feedback_shift_register
http://www.xilinx.com/support/documentation/application_notes/xapp052.pdf
Use a "random" seed based on microsecs of your computer clock or maybe a subset of the md5 result on some continuously changing data in your file system.
For example, a 32-bit LFSR will generate 2^32 unique numbers in sequence (no 2 alike) starting with a given seed.
The sequence will always be in the same order, but the starting point will be different (obviously) for a different seeds.
So, if a possibly repeating sequence between seedings is not a problem, this might be a good choice.
I've used 128-bit LFSR's to generate random tests in hardware simulators using a seed which is the md5 results on continuously changing system data.
Assuming that rand() returns a number between [0, 1) it is obvious that rand() * rand() will be biased toward 0. This is because multiplying x by a number between [0, 1) will result in a number smaller than x. Here is the distribution of 10000 more random numbers:
google.charts.load("current", { packages: ["corechart"] });
google.charts.setOnLoadCallback(drawChart);
function drawChart() {
var i;
var randomNumbers = [];
for (i = 0; i < 10000; i++) {
randomNumbers.push(Math.random() * Math.random());
}
var chart = new google.visualization.Histogram(document.getElementById("chart-1"));
var data = new google.visualization.DataTable();
data.addColumn("number", "Value");
randomNumbers.forEach(function(randomNumber) {
data.addRow([randomNumber]);
});
chart.draw(data, {
title: randomNumbers.length + " rand() * rand() values between [0, 1)",
legend: { position: "none" }
});
}
<script src="https://www.gstatic.com/charts/loader.js"></script>
<div id="chart-1" style="height: 500px">Generating chart...</div>
If rand() returns an integer between [x, y] then you have the following distribution. Notice the number of odd vs even values:
google.charts.load("current", { packages: ["corechart"] });
google.charts.setOnLoadCallback(drawChart);
document.querySelector("#draw-chart").addEventListener("click", drawChart);
function randomInt(min, max) {
return Math.floor(Math.random() * (max - min + 1)) + min;
}
function drawChart() {
var min = Number(document.querySelector("#rand-min").value);
var max = Number(document.querySelector("#rand-max").value);
if (min >= max) {
return;
}
var i;
var randomNumbers = [];
for (i = 0; i < 10000; i++) {
randomNumbers.push(randomInt(min, max) * randomInt(min, max));
}
var chart = new google.visualization.Histogram(document.getElementById("chart-1"));
var data = new google.visualization.DataTable();
data.addColumn("number", "Value");
randomNumbers.forEach(function(randomNumber) {
data.addRow([randomNumber]);
});
chart.draw(data, {
title: randomNumbers.length + " rand() * rand() values between [" + min + ", " + max + "]",
legend: { position: "none" },
histogram: { bucketSize: 1 }
});
}
<script src="https://www.gstatic.com/charts/loader.js"></script>
<input type="number" id="rand-min" value="0" min="0" max="10">
<input type="number" id="rand-max" value="9" min="0" max="10">
<input type="button" id="draw-chart" value="Apply">
<div id="chart-1" style="height: 500px">Generating chart...</div>
OK, so I will try to add some value to complement others answers by saying that you are creating and using a random number generator.
Random number generators are devices (in a very general sense) that have multiple characteristics which can be modified to fit a purpose. Some of them (from me) are:
Entropy: as in Shannon Entropy
Distribution: statistical distribution (poisson, normal, etc.)
Type: what is the source of the numbers (algorithm, natural event, combination of, etc.) and algorithm applied.
Efficiency: rapidity or complexity of execution.
Patterns: periodicity, sequences, runs, etc.
and probably more...
In most answers here, distribution is the main point of interest, but by mix and matching functions and parameters, you create new ways of generating random numbers which will have different characteristics for some of which the evaluation may not be obvious at first glance.
It's easy to show that the sum of the two random numbers is not necessarily random. Imagine you have a 6 sided die and roll. Each number has a 1/6 chance of appearing. Now say you had 2 dice and summed the result. The distribution of those sums is not 1/12. Why? Because certain numbers appear more than others. There are multiple partitions of them. For example the number 2 is the sum of 1+1 only but 7 can be formed by 3+4 or 4+3 or 5+2 etc... so it has a larger chance of coming up.
Therefore, applying a transform, in this case addition on a random function does not make it more random, or necessarily preserve randomness. In the case of the dice above, the distribution is skewed to 7 and therefore less random.
As others already pointed out, this question is hard to answer since everyone of us has his own picture of randomness in his head.
That is why, I would highly recommend you to take some time and read through this site to get a better idea of randomness:
http://www.random.org/
To get back to the real question.
There is no more or less random in this term:
both only appears random!
In both cases - just rand() or rand() * rand() - the situation is the same:
After a few billion of numbers the sequence will repeat(!).
It appears random to the observer, because he does not know the whole sequence, but the computer has no true random source - so he can not produce randomness either.
e.g.: Is the weather random?
We do not have enough sensors or knowledge to determine if weather is random or not.
The answer would be it depends, hopefully the rand()*rand() would be more random than rand(), but as:
both answers depends on the bit size of your value
that in most of the cases you generate depending on a pseudo-random algorithm (which is mostly a number generator that depends on your computer clock, and not that much random).
make your code more readable (and not invoke some random voodoo god of random with this kind of mantra).
Well, if you check any of these above I suggest you go for the simple "rand()".
Because your code would be more readable (wouldn't ask yourself why you did write this, for ...well... more than 2 sec), easy to maintain (if you want to replace you rand function with a super_rand).
If you want a better random, I would recommend you to stream it from any source that provide enough noise (radio static), and then a simple rand() should be enough.

Do you have a better idea to simulate coin flip?

Right now i have
return 'Heads' if Math.random() < 0.5
Is there a better way to do this?
Thanks
edit: please ignore the return value and "better" means exact 50-50 probability.
there's always the dead simple
coin = rand(1);
in many scripting languages this will give you a random int between 0 and your arg, so passing 1 gives you 0 or 1 (heads or tails).
a wee homage to xkcd:
string getHeadsOrTails {
return "heads"; //chosen by fair coin toss,
//guaranteed to be random
}
Numerical Recipes in C says not to trust the built in random number generators when it matters. You could probably implement the algorithm shown in the book as the function ran1(), which it claims passes all known statistical tests of randomness (in 1992) for less than around 108 calls.
The basic idea behind the ran1() algorithm is to add a shuffle to the output of the random number generator to reduce low order serial correlations. They use the Bays-Durham shuffle from section 3.2-3.3 in The Art of Computer Programming Volume 2, but I'd guess you could use the Fisher-Yates shuffle too.
If you need more random values than that, the same document also provides a generator (ran2) that should be good for at least 1017 values (my guess based on a period of 2.3 x 1018). The also provide a function (ran3) that uses a different method to generate random numbers, should linear congruential generators give you some sort of problem.
You can use any of these functions with your < 0.5 test to be more confident that you are getting a uniform distribution.
What you have is the way I would do it. If 0.0 <= Math.random() < 1.0, as is standard, then (Math.random() < 0.5) is going to give you heads when Math.random() is between 0.0 and 0.4999..., and tails when it's between 0.5 and 0.999... That's as fair a coin flip as you can get.
Of course I'm assuming a good implementation of Math.random().
On a linux system you could read bits in from /dev/random to get "better" random data, but an almost random method like Math.Random() is going to be fine for almost every application you can think of, short of serious cryptography work.
Try differentiating between odd and even numbers. Also, return an enumeration value (or a boolean), rather than a string.
I can't comment on people's posts because I don't have the reputation, but just an FYI about the whole <= vs. < topic addressed in Bill The Lizard's comment: Because it can be effectively assumed that random is generating any number between 0-1 (which isn't technically the case due to limitations on the size of a floating point number, but is more or less true in practice) there won't be a difference in num <= .5 or num < .5 because the probability of getting any one particular number in any continuous range is 0. IE: P(X=.5) = 0 when X = a random variable between 0 and 1.
The only real answer to this question is that you cannot "guarantee" probability. If you think about it, a real coin flip is not guaranteed 50/50 probability, it depends on the coin, the person flipping it, and if the coin is dropped and rolls across the floor. ;)
The point is that it's "random enough". If you're simulating a coin flip then the code you posted is more than fine.
Try
return 'Heads' if Math.random() * 100 mod 2 = 0
I don't really know what language you are using but if the random number is dividable by two then it is heads if it is not then it is tails.

Using an epsilon value to determine if a ball in a game is not moving?

I have balls bouncing around and each time they collide their speed vector is reduced by the Coefficient of Restitution.
Right now my balls CoR for my balls is .80 . So after many bounces my balls have "stopped" rolling because their speed has becoming some ridiculously small number.
In what stage is it appropriate to check if a speed value is small enough to simply call it zero (so I don't have the crazy jittering of the balls reacting to their micro-velocities). I've read on some forums before that people will sometimes use an epsilon constant, some small number and check against that.
Should I define an epsilon constant and do something like:
if Math.abs(velocity.x) < epsilon then velocity.x = 0
Each time I update the balls velocity and position? Is this what is generally done? Would it be reasonable to place that in my Vector classes setters for x and y? Or should I do it outside of my vector class when I'm calculating the velocities.
Also, what would be a reasonable epsilon value if I was using floats for my speed vector?
A reasonable value for epsilon is going to depend on the constraints of your system. If you are representing the ball graphically, then your epsilon might correspond to, say, a velocity of .1 pixels a second (ensuring that your notion of stopping matches the user's experience of the screen objects stopping). If you're doing a physics simulation, you'll want to tune it to the accuracy to which you're trying to measure your system.
As for how often you check - that depends as well. If you're simulating something in real time, the extra check might be costly, and you'll want to check every 10 updates or once per second or something. Or performance might not be an issue, and you can check with every update.
Instead of an epsilon for an IsStillMoving function, maybe you could use an UpdatePosition function, scheduled on an object-by-object basis based on its velocity.
I'd do something like this (in my own make-it-up-as-you-go pseudocode):
void UpdatePosition(Ball b) {
TimeStamp now = Clock.GetTime();
float secondsSinceLastUpdate = now.TimeSince(b.LastUpdate).InSeconds;
Point3D oldPosition = b.Position;
Point3D newPosition = CalculatePosition(b.Position, b.Velocity, interval);
b.MoveTo(newPosition);
float epsilonOfAccuracy = 0.5; // Accurate to one half-pixel
float pixelDistance = Camera.PixelDistance(oldPosition, newPosition);
float fps = System.CurrentFramesPerSecond;
float secondsToMoveOnePixel = (pixelDistance * secondsSinceLastUpdate) / fps;
float nextUpdateInterval = secondsToMoveOnePixel / epsilonOfAccuracy;
b.SetNextUpdateAt(now + nextUpdateInterval);
}
Balls moving very quickly would get updated on every frame. Balls moving more slowly might update every five or ten frames. And balls that have stopped (or nearly stopped) would update only very very rarely.
IMO your epsilon approach is fine. I would just experiment to see what looks or feels natural to the animation in the game.
Epsilon by nature is the smallest possible increment. Unfortunately, computers have different "minimal" increments of their own depending on the floating point representation. I would be very careful (and might even go higher than what I would calculate just for safety) playing around with that, especially if I want a code to be portable.
You may want to write a function that figures out the minimal increment on your floats rather than use a magic value.

Find number range intersection

What is the best way to find out whether two number ranges intersect?
My number range is 3023-7430, now I want to test which of the following number ranges intersect with it: <3000, 3000-6000, 6000-8000, 8000-10000, >10000. The answer should be 3000-6000 and 6000-8000.
What's the nice, efficient mathematical way to do this in any programming language?
Just a pseudo code guess:
Set<Range> determineIntersectedRanges(Range range, Set<Range> setofRangesToTest)
{
Set<Range> results;
foreach (rangeToTest in setofRangesToTest)
do
if (rangeToTest.end <range.start) continue; // skip this one, its below our range
if (rangeToTest.start >range.end) continue; // skip this one, its above our range
results.add(rangeToTest);
done
return results;
}
I would make a Range class and give it a method boolean intersects(Range) . Then you can do a
foreach(Range r : rangeset) { if (range.intersects(r)) res.add(r) }
or, if you use some Java 8 style functional programming for clarity:
rangeset.stream().filter(range::intersects).collect(Collectors.toSet())
The intersection itself is something like
this.start <= other.end && this.end >= other.start
This heavily depends on your ranges. A range can be big or small, and clustered or not clustered. If you have large, clustered ranges (think of "all positive 32-bit integers that can be divided by 2), the simple approach with Range(lower, upper) will not succeed.
I guess I can say the following:
if you have little ranges (clustering or not clustering does not matter here), consider bitvectors. These little critters are blazing fast with respect to union, intersection and membership testing, even though iteration over all elements might take a while, depending on the size. Furthermore, because they just use a single bit for each element, they are pretty small, unless you throw huge ranges at them.
if you have fewer, larger ranges, then a class Range as describe by otherswill suffices. This class has the attributes lower and upper and intersection(a,b) is basically b.upper < a.lower or a.upper > b.lower. Union and intersection can be implemented in constant time for single ranges and for compisite ranges, the time grows with the number of sub-ranges (thus you do not want not too many little ranges)
If you have a huge space where your numbers can be, and the ranges are distributed in a nasty fasion, you should take a look at binary decision diagrams (BDDs). These nifty diagrams have two terminal nodes, True and False and decision nodes for each bit of the input. A decision node has a bit it looks at and two following graph nodes -- one for "bit is one" and one for "bit is zero". Given these conditions, you can encode large ranges in tiny space. All positive integers for arbitrarily large numbers can be encoded in 3 nodes in the graph -- basically a single decision node for the least significant bit which goes to False on 1 and to True on 0.
Intersection and Union are pretty elegant recursive algorithms, for example, the intersection basically takes two corresponding nodes in each BDD, traverse the 1-edge until some result pops up and checks: if one of the results is the False-Terminal, create a 1-branch to the False-terminal in the result BDD. If both are the True-Terminal, create a 1-branch to the True-terminal in the result BDD. If it is something else, create a 1-branch to this something-else in the result BDD. After that, some minimization kicks in (if the 0- and the 1-branch of a node go to the same following BDD / terminal, remove it and pull the incoming transitions to the target) and you are golden. We even went further than that, we worked on simulating addition of sets of integers on BDDs in order to enhance value prediction in order to optimize conditions.
These considerations imply that your operations are bounded by the amount of bits in your number range, that is, by log_2(MAX_NUMBER). Just think of it, you can intersect arbitrary sets of 64-bit-integers in almost constant time.
More information can be for example in the Wikipedia and the referenced papers.
Further, if false positives are bearable and you need an existence check only, you can look at Bloom filters. Bloom filters use a vector of hashes in order to check if an element is contained in the represented set. Intersection and Union is constant time. The major problem here is that you get an increasing false-positive rate if you fill up the bloom-filter too much.
Information, again, in the Wikipedia, for example.
Hach, set representation is a fun field. :)
In python
class nrange(object):
def __init__(self, lower = None, upper = None):
self.lower = lower
self.upper = upper
def intersection(self, aRange):
if self.upper < aRange.lower or aRange.upper < self.lower:
return None
else:
return nrange(max(self.lower,aRange.lower), \
min(self.upper,aRange.upper))
If you're using Java
Commons Lang Range
has a
overlapsRange(Range range) method.

Resources