Hashing line segments - hashtable

I have a number of line segments given their endpoints. i need to use hashing to store the line segments one by one in a suitable data structure T. While generating, the two endpoints of each segment should be randomly selected and a newly generated segment should be inserted in T only if it is not already there in T.
Can someone suggest a suitable hashing function by which i can store the endpoints of the line segments uniquely.

A simple approach would be to just use a multiplier and addition.
hash = 2011 * x + y;
The benefit of it is that it is very fast to calculate. Other solutions could be to use more complicated iterations over the digits, say similar to the Java string hashing algorithm
for(int i = 0; i < n_digits; i++){
hash = hash * 31 + x_y_digit[i];
}

Related

Clarification on PRNG Algorithm as per ANSI X9.31

Newbee question:
I am studying ANSI X9.31 -1998 for implementing PRNG as per section 2.4. I am not able to properly understand the representation of varibles used - like "ede".
Is the "ede" an operation or a variable ?
Why a * is used before X? Is it some kind of a standard representation?
Is there any specific document which describes all these?
"A.2.4 Generating Pseudo Random Numbers Using the DEA
Let ede*X(Y) represent the DEA multiple encryption of Y under the key *X.
Let *K be a DEA key pair reserved only for the generation of pseudo random numbers, let V be a 64-bit seed value which is also kept secret, and let XOR be the exclusive-or operator. Let DT be a date/time vector which is updated on each iteration. I is an intermediate value. A 64-bit vector R is generated as follows:
I = ede*K(DT)
R = ede*K(I XOR V) and a new V is generated by V = ede*K(R XOR I).
Successive values of R may be concatenated to produce a pseudo random number of the desired length."
EDE means Encrypt, Decrypt, Encrypt, the usual protocol when using Triple DES.
The use of * looks a lot like the more usual subscription common to cryptography articles:
EDEX(Y) to mean using X as the key for the algorithm EDE.

How do I compute the approximate entropy of a bit string?

Is there a standard way to do this?
Googling -- "approximate entropy" bits -- uncovers multiple academic papers but I'd like to just find a chunk of pseudocode defining the approximate entropy for a given bit string of arbitrary length.
(In case this is easier said than done and it depends on the application, my application involves 16,320 bits of encrypted data (cyphertext). But encrypted as a puzzle and not meant to be impossible to crack. I thought I'd first check the entropy but couldn't easily find a good definition of such. So it seemed like a question that ought to be on StackOverflow! Ideas for where to begin with de-cyphering 16k random-seeming bits are also welcome...)
See also this related question:
What is the computer science definition of entropy?
Entropy is not a property of the string you got, but of the strings you could have obtained instead. In other words, it qualifies the process by which the string was generated.
In the simple case, you get one string among a set of N possible strings, where each string has the same probability of being chosen than every other, i.e. 1/N. In the situation, the string is said to have an entropy of N. The entropy is often expressed in bits, which is a logarithmic scale: an entropy of "n bits" is an entropy equal to 2n.
For instance: I like to generate my passwords as two lowercase letters, then two digits, then two lowercase letters, and finally two digits (e.g. va85mw24). Letters and digits are chosen randomly, uniformly, and independently of each other. This process may produce 26*26*10*10*26*26*10*10 = 4569760000 distinct passwords, and all these passwords have equal chances to be selected. The entropy of such a password is then 4569760000, which means about 32.1 bits.
Shannon's entropy equation is the standard method of calculation. Here is a simple implementation in Python, shamelessly copied from the Revelation codebase, and thus GPL licensed:
import math
def entropy(string):
"Calculates the Shannon entropy of a string"
# get probability of chars in string
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
# calculate the entropy
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
return entropy
def entropy_ideal(length):
"Calculates the ideal Shannon entropy of a string with given length"
prob = 1.0 / length
return -1.0 * length * prob * math.log(prob) / math.log(2.0)
Note that this implementation assumes that your input bit-stream is best represented as bytes. This may or may not be the case for your problem domain. What you really want is your bitstream converted into a string of numbers. Just how you decide on what those numbers are is domain specific. If your numbers really are just one and zeros, then convert your bitstream into an array of ones and zeros. The conversion method you choose will affect the results you get, however.
I believe the answer is the Kolmogorov Complexity of the string.
Not only is this not answerable with a chunk of pseudocode, Kolmogorov complexity is not a computable function!
One thing you can do in practice is compress the bit string with the best available data compression algorithm.
The more it compresses the lower the entropy.
The NIST Random Number Generator evaluation toolkit has a way of calculating "Approximate Entropy." Here's the short description:
Approximate Entropy Test Description: The focus of this test is the
frequency of each and every overlapping m-bit pattern. The purpose of
the test is to compare the frequency of overlapping blocks of two
consecutive/adjacent lengths (m and m+1) against the expected result
for a random sequence.
And a more thorough explanation is available from the PDF on this page:
http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html
There is no single answer. Entropy is always relative to some model. When someone talks about a password having limited entropy, they mean "relative to the ability of an intelligent attacker to predict", and it's always an upper bound.
Your problem is, you're trying to measure entropy in order to help you find a model, and that's impossible; what an entropy measurement can tell you is how good a model is.
Having said that, there are some fairly generic models that you can try; they're called compression algorithms. If gzip can compress your data well, you have found at least one model that can predict it well. And gzip is, for example, mostly insensitive to simple substitution. It can handle "wkh" frequently in the text as easily as it can handle "the".
Using Shannon entropy of a word with this formula : http://imgur.com/a/DpcIH
Here's a O(n) algorithm that calculates it :
import math
from collections import Counter
def entropy(s):
l = float(len(s))
return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))
Here's an implementation in Python (I also added it to the Wiki page):
import numpy as np
def ApEn(U, m, r):
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return -(N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return _phi(m) - _phi(m + 1)
Example:
>>> U = np.array([85, 80, 89] * 17)
>>> ApEn(U, 2, 3)
-1.0996541105257052e-05
The above example is consistent with the example given on Wikipedia.

calculate the average of three encrypted numbers

Is possible to calculate average of three encrypted integer? No constrain on the method of encrypting. The point of this is just to hide the three numbers and find average.
What you seem to be looking for is called Homomorphic Encryption: an encryption scheme which allows you to perform operations on encrypted data, with the encrypted result as the outcome.
Such a scheme would allow you to give encrypted data to a 3rd party, which could then do computations on it for you without knowing what they were computing.
In your case, you need two operations: addition and division. Until recently, homomorphic encryption schemes typically supported only 1 operation. But in september 2009 IMB announced the first fully homomorphic cryptosystem. Other researches published another system soon after that.
These cryptosystems might be be able to do what you want, but it is all cutting edge computer science research.
Decrypt the numbers, then calculate their average.
I don't see any simple ways to do what you ask, apart from decrypting the numbers first.
Taking the average (or the "arithmetic mean") requires adding the numbers. Now if you wanted to multiply the numbers, then you could do that neatly with RSA encryption. If p is the plaintext, c is the ciphertext, and e is the encryption key, then in RSA, c = p^e. If you have 3 separate integers, p1, p2, p3, and the product is pp then
pp^e = (p1 * p2 * p3)^e = p1^e * p2^e * p3^3 = c1 * c2 * c3 = cp
That is, you can either multiply the three plaintext integers together and then encrypt, or you can just multiply the three ciphertexts together, and get the same answer. This would get you some way towards the "geometric mean", where you multiply all the numbers together, and then take the cube-root (or nth root for n numbers). Unfortunately, calculating a cube root in modular arithmetic is non-trivial.
With ideal encryption methods: No.
With most real-world encryption methods: No.
With some stupidly simple to undo obfuscation method especially designed to allow averaging: Yes.
Calling the latter method "encryption" really would be using the wrong term.
If you could calculate the average of encrypted numbers without decrypting them, that would make decrypting the original numbers quite a lot easier, so I would be very surprised if this works with any serious encryption algorithm.
In general three encrypted numbers shouldn't maintain the same order if encrypted, so I'm pretty sure you have to decrypt them and calculate the avarage.
If, and only if, the method of encryption is a one-to-one mathematical function, then it is possible to do so while the numbers are encrypted.
For example, if my very unsecure method of encryption is to multiply every number of 2, then I would do the following:
function encrypt($number){
return $number*2;
}
$a=encrypt(3); // a= 9
$b=encrypt(5); // b= 15
$c=encrypt(6); // c= 18
$average = ($a+$b+$c)/6; // We divide by 6 because first we divide by 3 to get the average, then by 2 to do the decryption. The method will vary based on the mathematical function.
The only other possibility is to decrypt the numbers first.

Obscure / encrypt an order number as another number: symmetrical, "random" appearance?

Client has an simple increasing order number (1, 2, 3...). He wants end-users to receive an 8- or 9- digit (digits only -- no characters) "random" number. Obviously, this "random" number actually has to be unique and reversible (it's really an encryption of the actualOrderNumber).
My first thought was to just shuffle some bits. When I showed the client a sample sequence, he complained that subsequent obfuscOrderNumbers were increasing until they hit a "shuffle" point (point where the lower-order bits came into play). He wants the obfuscOrderNumbers to be as random-seeming as possible.
My next thought was to deterministically seed a linear congruential pseudo-random-number generator and then take the actualOrderNumber th value. But in that case, I need to worry about collisions -- the client wants an algorithm that is guaranteed not to collide in at least 10^7 cycles.
My third thought was "eh, just encrypt the darn thing," but if I use a stock encryption library, I'd have to post-process it to get the 8-or-9 digits only requirement.
My fourth thought was to interpret the bits of actualOrderNumber as a Gray-coded integer and return that.
My fifth though was: "I am probably overthinking this. I bet someone on StackOverflow can do this in a couple lines of code."
Pick a 8 or 9 digit number at random, say 839712541. Then, take your order number's binary representation (for this example, I'm not using 2's complement), pad it out to the same number of bits (30), reverse it, and xor the flipped order number and the magic number. For example:
1 = 000000000000000000000000000001
Flip = 100000000000000000000000000000
839712541 = 110010000011001111111100011101
XOR = 010010000011001111111100011101 = 302841629
2 = 000000000000000000000000000010
Flip = 010000000000000000000000000000
839712541 = 110010000011001111111100011101
XOR = 100010000011001111111100011101 = 571277085
To get the order numbers back, xor the output number with your magic number, convert to a bit string, and reverse.
Hash function? http://www.partow.net/programming/hashfunctions/index.html
Will the client require the distribution of obfuscated consecutive order numbers to look like anything in particular?
If you do not want to complicate yourself with encryption, use a combination of bit shuffling with a bit of random salting (if you have bits/digits to spare) XOR-superimposed over some fixed constant (or some function of something that would be readily available alongside the obfuscated order ID at any time, such as perhaps the customer_id who placed the order?)
EDIT
It appears that all the client desires is for an outside party to not be able to infer the progress of sales. In this case a shuffling solution (bit-mapping, e.g. original bit 1 maps to obfuscated bit 6, original bit 6 maps to obfuscated bit 3, etc.) should be more than sufficient. Add some random bits if you really want to make it harder to crack, provided that you have the additional bits available (e.g. assuming original order numbers go only up to 6 digits, but you're allowed 8-9 in the obfuscated order number, then you can use 2-3 digits for randomness before performing bit-mapping). Possibly XOR the result for additional intimidation (an inquisitive party might attempt to generate two consecutive obfuscated orders, XOR them against each other to get rid of the XOR constant, and would then have to deduce which of the non-zero bits come from the salt, and which ones came from an increment, and whether he really got two consecutive order numbers or not... He would have to repeat this for a significant number of what he'd hope are consecutive order numbers in order to crack it.)
EDIT2
You can, of course, allocate completely random numbers for the obfuscated order IDs, store the correspondence to persistent storage (e.g. DB) and perform collision detection as well as de-obfuscation against same storage. A bit of overkill if you ask me, but on the plus side it's the best as far as obfuscation goes (and you implement whichever distribution function your soul desires, and you can change the distribution function anytime you like.)
In 9 digit number, the first digit is a random index between 0 and 7 (or 1-8). Put another random digit at that position. The rest is the "real order number:
Orig order: 100
Random index: 5
Random digit: 4 (guaranteed, rolled a
dice :) )
Result: 500040100
Orig Nr: 101
Random index: 2
Random digit 6
Result: 200001061
You can decide that the 5th (or any other) digit is the index.
Or, if you can live with real order numbers of 6 digits, then you can introduce "secondary" index as well. And you can reverse the order of the digits in the "real" order nr.
I saw this rather late, (!) hence my rather belated response. It may be useful to others coming along later.
You said: "My third thought was "eh, just encrypt the darn thing," but if I use a stock encryption library, I'd have to post-process it to get the 8-or-9 digits only requirement."
That is correct. Encryption is reversible and guaranteed to be unique for a given input. As you point out, most standard encryptions do not have the right block size. There is one however, Hasty Pudding Cipher which can have any block size from 1 bit upwards.
Alternatively you can write your own. Given that you don't need something the NSA can't crack, then you can construct a simple Feistel cipher to meet your needs.
If your Order Id is unique, Simply you can make a prefix and add/mix that prefix with your order Id.
Something like this:
long pre = DateTime.Now.Ticks % 100;
string prefix = pre.ToString();
string number = prefix + YOURID.ToString()
<?PHP
$cry = array(0=>5,1=>3,2=>9,3=>2,4=>7,5=>6,6=>1,7=>8,8=>0,9=>4);
function enc($e,$cry,$k){
if(strlen($e)>10)die("max encrypt digits is 10");
if(strlen($e) >= $k)die("Request encrypt must be lesser than its length");
if(strlen($e) ==0)die("must pass some numbers");
$ct = $e;
$jump = ($k-1)-strlen($e);
$ency = $cry[(strlen($e))];
$n = 0;
for($a=0;$a<$k-1;$a++){
if($jump > 0){
if($a%2 == 1){
$ency .=rand(0,9);
$jump -=1;
}else{
if(isset($ct[$n])){
$ency.=$cry[$ct[$n]];
$n++;
}else{
$ency .=rand(0,9);
$jump -=1;
}
}
}else{
$ency.= $cry[$ct[$n]];
$n++;
}
}
return $ency;
}
function dec($e,$cry){
//$decy = substr($e,6);
$ar = str_split($e,1);
$len = array_search($ar[0], $cry);
$jump = strlen($e)-($len+1);
$val = "";
for($i=1;$i<strlen($e);$i++){
if($i%2==0){
if($jump >0){
//$val .=array_search($e[$i], $cry);
$jump--;
}else{
$val .=array_search($e[$i], $cry);
}
}else{
if($len > 0){
$val .=array_search($e[$i], $cry);
$len--;
}else{
$jump--;
}
}
}
return $val;
}
if(isset($_GET["n"])){
$n = $_GET["n"];
}else{
$n = 1000;
}
$str = 1253;
$str = enc($str,$cry,15);
echo "Encerypted Value : ".$str ."<br/>";
$str = dec($str,$cry);
echo "Decrypted Value : ".$str ."<br/>";
?>

How to test randomness (case in point - Shuffling)

First off, this question is ripped out from this question. I did it because I think this part is bigger than a sub-part of a longer question. If it offends, please pardon me.
Assume that you have a algorithm that generates randomness. Now how do you test it?
Or to be more direct - Assume you have an algorithm that shuffles a deck of cards, how do you test that it's a perfectly random algorithm?
To add some theory to the problem -
A deck of cards can be shuffled in 52! (52 factorial) different ways. Take a deck of cards, shuffle it by hand and write down the order of all cards. What is the probability that you would have gotten exactly that shuffle? Answer: 1 / 52!.
What is the chance that you, after shuffling, will get A, K, Q, J ... of each suit in a sequence? Answer 1 / 52!
So, just shuffling once and looking at the result will give you absolutely no information about your shuffling algorithms randomness. Twice and you have more information, Three even more...
How would you black box test a shuffling algorithm for randomness?
Statistics. The de facto standard for testing RNGs is the Diehard suite (originally available at http://stat.fsu.edu/pub/diehard). Alternatively, the Ent program provides tests that are simpler to interpret but less comprehensive.
As for shuffling algorithms, use a well-known algorithm such as Fisher-Yates (a.k.a "Knuth Shuffle"). The shuffle will be uniformly random so long as the underlying RNG is uniformly random. If you are using Java, this algorithm is available in the standard library (see Collections.shuffle).
It probably doesn't matter for most applications, but be aware that most RNGs do not provide sufficient degrees of freedom to produce every possible permutation of a 52-card deck (explained here).
Here's one simple check that you can perform. It uses generated random numbers to estimate Pi. It's not proof of randomness, but poor RNGs typically don't do well on it (they will return something like 2.5 or 3.8 rather ~3.14).
Ideally this would be just one of many tests that you would run to check randomness.
Something else that you can check is the standard deviation of the output. The expected standard deviation for a uniformly distributed population of values in the range 0..n approaches n/sqrt(12).
/**
* This is a rudimentary check to ensure that the output of a given RNG
* is approximately uniformly distributed. If the RNG output is not
* uniformly distributed, this method will return a poor estimate for the
* value of pi.
* #param rng The RNG to test.
* #param iterations The number of random points to generate for use in the
* calculation. This value needs to be sufficiently large in order to
* produce a reasonably accurate result (assuming the RNG is uniform).
* Less than 10,000 is not particularly useful. 100,000 should be sufficient.
* #return An approximation of pi generated using the provided RNG.
*/
public static double calculateMonteCarloValueForPi(Random rng,
int iterations)
{
// Assumes a quadrant of a circle of radius 1, bounded by a box with
// sides of length 1. The area of the square is therefore 1 square unit
// and the area of the quadrant is (pi * r^2) / 4.
int totalInsideQuadrant = 0;
// Generate the specified number of random points and count how many fall
// within the quadrant and how many do not. We expect the number of points
// in the quadrant (expressed as a fraction of the total number of points)
// to be pi/4. Therefore pi = 4 * ratio.
for (int i = 0; i < iterations; i++)
{
double x = rng.nextDouble();
double y = rng.nextDouble();
if (isInQuadrant(x, y))
{
++totalInsideQuadrant;
}
}
// From these figures we can deduce an approximate value for Pi.
return 4 * ((double) totalInsideQuadrant / iterations);
}
/**
* Uses Pythagoras' theorem to determine whether the specified coordinates
* fall within the area of the quadrant of a circle of radius 1 that is
* centered on the origin.
* #param x The x-coordinate of the point (must be between 0 and 1).
* #param y The y-coordinate of the point (must be between 0 and 1).
* #return True if the point is within the quadrant, false otherwise.
*/
private static boolean isInQuadrant(double x, double y)
{
double distance = Math.sqrt((x * x) + (y * y));
return distance <= 1;
}
First, it is impossible to know for sure if a certain finite output is "truly random" since, as you point out, any output is possible.
What can be done, is to take a sequence of outputs and check various measurements of this sequence against what is more likely. You can derive a sort of confidence score that the generating algorithm is doing a good job.
For example, you could check the output of 10 different shuffles. Assign a number 0-51 to each card, and take the average of the card in position 6 across the shuffles. The convergent average is 25.5, so you would be surprised to see a value of 1 here. You could use the central limit theorem to get an estimate of how likely each average is for a given position.
But we shouldn't stop here! Because this algorithm could be fooled by a system that only alternates between two shuffles that are designed to give the exact average of 25.5 at each position. How can we do better?
We expect a uniform distribution (equal likelihood for any given card) at each position, across different shuffles. So among the 10 shuffles, we could try to verify that the choices 'look uniform.' This is basically just a reduced version of the original problem. You could check that the standard deviation looks reasonable, that the min is reasonable, and the max value as well. You could also check that other values, such as the closest two cards (by our assigned numbers), also make sense.
But we also can't just add various measurements like this ad infinitum, since, given enough statistics, any particular shuffle will appear highly unlikely for some reason (e.g. this is one of very few shuffles in which cards X,Y,Z appear in order). So the big question is: which is the right set of measurements to take? Here I have to admit that I don't know the best answer. However, if you have a certain application in mind, you can choose a good set of properties/measurements to test, and work with those -- this seems to be the way cryptographers handle things.
There's a lot of theory on testing randomness. For a very simple test on a card shuffling algorithm you could do a lot of shuffles and then run a chi squared test that the probability of each card turning up in any position was uniform. But that doesn't test that consecutive cards aren't correlated so you would also want to do tests on that.
Volume 2 of Knuth's Art of Computer Programming gives a number of tests that you could use in sections 3.3.2 (Empirical tests) and 3.3.4 (The Spectral Test) and the theory behind them.
The only way to test for randomness is to write a program that attempts to build a predictive model for the data being tested, and then use that model to try to predict future data, and then showing that the uncertainty, or entropy, of its predictions tend towards maximum (i.e. the uniform distribution) over time. Of course, you'll always be uncertain whether or not your model has captured all of the necessary context; given a model, it'll always be possible to build a second model that generates non-random data that looks random to the first. But as long as you accept that the orbit of Pluto has an insignificant influence on the results of the shuffling algorithm, then you should be able to satisfy yourself that its results are acceptably random.
Of course, if you do this, you might as well use your model generatively, to actually create the data you want. And if you do that, then you're back at square one.
Shuffle alot, and then record the outcomes (if im reading this correctly). I remember seeing comparisons of "random number generators". They just test it over and over, then graph the results.
If it is truly random the graph will be mostly even.
I'm not fully following your question. You say
Assume that you have a algorithm that generates randomness. Now how do you test it?
What do you mean? If you're assuming you can generate randomness, there's no need to test it.
Once you have a good random number generator, creating a random permutation is easy (e.g. Call your cards 1-52. Generate 52 random numbers assigning each one to a card in order, and then sort according to your 52 randoms) . You're not going to destroy the randomness of your good RNG by generating your permutation.
The difficult question is whether you can trust your RNG. Here's a sample link to people discussing that issue in a specific context.
Testing 52! possibilities is of course impossible. Instead, try your shuffle on smaller numbers of cards, like 3, 5, and 10. Then you can test billions of shuffles and use a histogram and the chi-square statistical test to prove that each permutation is coming up an "even" number of times.
No code so far, therefore I copy-paste a testing part from my answer to the original question.
// ...
int main() {
typedef std::map<std::pair<size_t, Deck::value_type>, size_t> Map;
Map freqs;
Deck d;
const size_t ntests = 100000;
// compute frequencies of events: card at position
for (size_t i = 0; i < ntests; ++i) {
d.shuffle();
size_t pos = 0;
for(Deck::const_iterator j = d.begin(); j != d.end(); ++j, ++pos)
++freqs[std::make_pair(pos, *j)];
}
// if Deck.shuffle() is correct then all frequencies must be similar
for (Map::const_iterator j = freqs.begin(); j != freqs.end(); ++j)
std::cout << "pos=" << j->first.first << " card=" << j->first.second
<< " freq=" << j->second << std::endl;
}
This code does not test randomness of underlying pseudorandom number generator. Testing PRNG randomness is a whole branch of science.
For a quick test, you can always try compressing it. Once it doesn't compress, then you can move onto other tests.
I've tried dieharder but it refuses to work for a shuffle. All tests fail. It is also really stodgy, it wont let you specify the range of values you want or anything like that.
Pondering it myself, what I would do is something like:
Setup (Pseudo code)
// A card has a Number 0-51 and a position 0-51
int[][] StatMatrix = new int[52][52]; // Assume all are set to 0 as starting values
ShuffleCards();
ForEach (card in Cards) {
StatMatrix[Card.Position][Card.Number]++;
}
This gives us a matrix 52x52 indicating how many times a card has ended up at a certain position. Repeat this a large number of times (I would start with 1000, but people better at statistics than me may give a better number).
Analyze the matrix
If we have perfect randomness and perform the shuffle an infinite number of times then for each card and for each position the number of times the card ended up in that position is the same as for any other card. Saying the same thing in a different way:
statMatrix[position][card] / numberOfShuffle = 1/52.
So I would calculate how far from that number we are.

Resources