No such thing as a perfect hash function? [closed] - math

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
according to http://java-bytes.blogspot.com/2009/10/hashcode-of-string-in-java.html: "First off, its a known fact that there is no perfect hashing algorithm, for which there are no collisions."
The author is talking practically and not theoretically right? Because theoretically, here is a perfect hash function: "for a given object, assign it a new number". There are an infinite amount of numbers, so we'll always have something to assign to an object that's unique. In practice this isn't feasible though because we have a limited amount of memory.

Typically, a hash function maps from one set of objects (the universe) to a smaller set of objects (the codomain). Commonly, the universe is an infinite set, such as the set of all strings or the set of all numbers, and the codomain is a finite set, such as the set of all 512-bit strings, or the set of all numbers between 0 and some number k, etc. In Java, the hashCode function on objects has a codomain of values that can be represented by an int, which is all 32-bit integers.
I believe that what the author is talking about when they say "there is no perfect hash function" is that there is no possible way to map the infinite set of all strings into the set of all 32-bit integers without having at least one collision. In fact, if you pick 232 + 1 different strings, you're guaranteed to have at least one collision.
Your argument - couldn't we just assign each object a different hash code? - makes the implicit assumption that the codomain of the hash function is infinite. For example, if you were to try this approach to build a hash function for strings, the codomain of the hash function would have to be at least as large as the set of all possible natural numbers, since there are infinitely many strings. Most programming languages don't support hash codes that work this way, though you're correct that in theory this would work. Of course, someone might object and say that this doesn't count as a valid hash function, since typically hash functions have finite codomains.
Hope this helps!

Related

Write fast Common Lisp code [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm not sure, if some weird things make my Code faster:
Is it normally better to use inbuilt operations or write new specialized functions, that do the same thing?
(for example a version of #'map only for vectors; my version is often faster without type declarations)
Should I define new (complicated) types to use them in declarations?
(for example a typed list)
Should I define slots directly to an object? (for example px and py for a 2-dimensional object, or is it ok to use one slot pos of type vector, that I could reuse it for other purposes)
There are a few parts to this but here is a quick braindump
PROFILE!
Use a distribution of CL that has a profiler built in, I use sbcl for example http://www.sbcl.org/1.0/manual/Statistical-Profiler.html
The nice thing about the sbcl profiler is that once you have profiled a function, if you disassemble it, the machine code is annotated with statistics. This requires some knowledge of the target machine code.
Do not underestimate your implementation: They can have advanced type and flow analysis built in and are able to, for example, pick a vector only version of map when it makes sense.
Learn compiler macros: compiler macros can shadow functions this gives you a place to put extra optimizations based on the context of the form. IT does this without replacing the function so it can still be used in a higher order way.
Learn Type declarations
I found this series of blog posts helped me understand this technique http://nklein.com/tags/optimization/page/2/ Read em all!
ONE MASSIVE NOTE: Don't ever lie to your compiler about a type. Type declarations are a way of telling your compiler you know what the type is the compiler doesn't even have to use them, and when it does it doesn't have to check you are giving it the correct thing.
Unboxed data
Some implementations are able to unbox certain datatype in certain conditions. Sorry that is vague but you will need to read up for your implementation. For sbcl the 'sbcl internals' guide is very helpful.
For example:
(make-array 100 :element-type 'single-float :initial-element 0.0)
Can be stored as a contiguous block of memory in sbcl.
PROFILE AGAIN (With realistic data)
I spent 3 hours writing a crazy compiler macro based n dimensional matrix multiplication routine and then tested it against a 1 line built in solution. For matricies below 5 dimensions there was not a big difference! For higher dimensions, yeah It rocked but that 'performance benefit' is purely academic because those code paths were never touched. Luckily I undertook the task for fun as I was asking the same question you are now.
Algorithms
All the type specifiers in the world won't give you a 100times performance increase. This comes from better techniques. Read on the maths behind the problem, implements different helper functions that have different strengths and choose between them at runtime...then go back and use compiler macros to allow lisp to choose at compile time. OR specify the technique as a higher order functions, for example make-hash-table allows you to specify the hashing function and rehash sizes, this can be crucial in getting good performance at certain sizes.
Know the limits of BigO
Algorithmic complexity means nothing if you loose all the of performance due to memory locality issues. Conversly sometime we can achieve superlinear performance characteristics if, by spliting the problem among cores, the reduced dataset now fits in the l2 cache.
BigO is a great metric but it isn't the end of the story. This is the reason assoc lists are a totally valid alternative to hash-tables for low numbers of keys and certain access profiles.
Summary
There is a golden mantra I heard from somewhere in the lisp community that works so well:
Make it Fast and then make it Fast
If nothing else follow this. Chant it to yourself!
Get the program up and running quickly, in doing so you are more likely to spot the places where you can use a better technique or algorithm to get your several-orders-of-magnitude improvement. Do use CL's own functions first. Don't trade lisp's higher order nature too early by using macros, explore how far you can go with functions.
[Edit] More notes - the following is for sbcl
Type definitions on struct slots are used for optimizing, type declarations for class slots are not.
With regard to types, start with what makes the program easy to write and understand (Make it fast) and then look into access times if it is the bottleneck (make It Fast!)
(slot-value x 'name) is very fast when name is known. Look at how with-slots uses symbol-macrolet to it's advantage
So to kinda directly answer your original question:
built in first (also check libraries)
does it make the problem easier to write and understand?
use pos. By the time the performance of that indirection becomes and issue you will have found a dozen other ways to speed up the problem and the solution will be part of a wider optimization technique.

explicitly defining variable addresses (in Go) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Simplified Question:
Is it practical for a programmer to keep track of the addresses of variables, so that a variable's address can be used as a point of data on that variable?
Original Question:
I am attempting to wrap my head around how variables are stored and referenced by address using pointers in Go.
As a general principal, is it ever useful to assign a variable's address directly? I can imagine a situation in which data could be encoded in the physical (virtual) address of a variable, and not necessarily the value of that variable.
For instance, the 1000th customer has made a 500 dollars of purchases. Could I store an interger at location 1000 with a value of 500?
I know that the common way to do something like this is with an array, where the variable at position 999 corresponds to the 1000th customer, but my question is not about arrays, it's about assigning addresses directly.
Suppose I'm dealing with billions of objects. Is there an easy way to use the address as part of the data on the object, and the value stored at that location as different data?
for instance, an int at address 135851851904 holds a value of 46876, 135851851905 holds 123498761, etc. I imagine at this point an array or slice would be far too large to be efficient.
Incidentally, if my question due to a misunderstanding, is there a resource someone can provide which explains the topic in deep, but understandable detail? I have been unable to find a good resource on the subject that really explains the details.
is it ever useful to assign a variable's address directly?
You can use the unsafe package to achieve that but the idea is that you don't do it unless you have a concrete and otherwise unsolvable use-case that requires it.
Could I store an interger at location 1000 with a value of 500?
As mentioned before it is possible but choosing an arbitrary address won't get you far because it may not even be mapped. If you write to such a location you'll get a access violation (and your program will crash). If you happen to hit a valid address number you'll likely be overwriting other data that your program needs to run.
Is there an easy way to use the address as part of the data on the object, and the value stored at that location as different data?
In general no.
If you managed to build some kind of algebraic structure closed under the operations by which your own pointer-arithmetic is defined in a finite set of addresses which you can guarantee to always be a valid virtual memory segment then yes but it defeats the purpose of using a garbage collected language. Additionally it would be hell to read such a program.

serpent encryption - better than rijndael? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Is Serpent-256 better than Rijndael-256 in terms of security? (speed doesn't matter)
Would Serpent encryption combined with SHA-512 be enough to safeguard sensitive data?
And to what extent? (SECRET, TOP SECRET, CLASSIFIED etc.)
Moreover, Rijndael has a max of 16 rounds. Serpent has 32 rounds, so it must be more secure.
As I've read that the Rijndael cipher is cryptographically broken, why isn't Serpent
adopted more widely? Would it be that slow if implemented on hardware?
Any other technical specifications about Serpent that you can link me to, I would be very grateful.
Thank you.
The number of rounds, by itself, doesn't determine the security of a cipher. You need to take the round function into account before the number of rounds means anything.
Nonetheless, I'd agree that there's a pretty decent chance that Serpent is more secure than AES. There are attacks currently known against AES that reduce the complexity by a factor of approximately 4 compared to a pure brute-force attack.
Cryptographers count that as a successful attack--but from a practical viewpoint, it's of precisely zero consequence. Even if you restrict yourself to AES-128, it's basically reducing complexity from 16 times the estimated life of the universe to only 4 times the estimated life of the universe (I'm sort of making up numbers here, but you get the general idea). With AES-256, the number is so much larger the factor of four shrinks to a new level of utterly meaningless insignificance.
Until/unless a dramatically better attack is found, real security is completely unaffected. In essentially every case, the problems you need to deal with and worry about are in how the cipher is used, how keys are generated, stored, and exchanged, etc. Changing from AES to Serpent (or Mars, Twofish, etc.) is extremely unlikely to improve your security (or anybody else's).
I should probably add: I'm probably as strong an advocate as anybody of having more cipher algorithms available and standardized. If you do a little looking, you can find where I'm cited in the papers submitted to NIST during AES standardization on that subject, giving use cases where including more than one algorithm in the standard would have been useful. Nonetheless, I have to admit that no (publicly known) current attack even comes close to giving a real reason to choose a different cipher algorithm.

Theoretically hashing something 2^128 times with the MD5 algrithm [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
This is a purely hypothetical question, but if you were to start with 128 bits, and then hash them 2^128 times, say with the MD5 algorithm, would you eventually come back to your original bits? Would all possible combinations have been used? And if not, are there certain nubers that "hash back to themselves" faster than others?
I assume this is practically impossible to achieve (after looking at my calculators answer to 2^128...), and I'm pretty sure the answer would be different for different algorithms, but that doesn't stop one from theoretizing, does it?
So yeah, that's it, hope someone out there will have some more knowledge on this topic. Looking forward to seeing the answer(s), thanks in advance!
Edit:
To clarify: What interests me the most in this question is if it will go through all possible bit combinations or if there rather are several smaller cycles, tho any additional, relevant and interesting information is appreciated.
A good cryptographic hash should have some, but not too many, cycles in it, that makes it much harder to create rainbow tables for it. This occurs in the MD5 - actually a problem with MD5 is that it's a bit to easy too find hash collisions for a given hash for the algorithm. This weakness makes it computationally feasible to inject malicious data in a file that is hashed with MD5 for verification.
I think you think there's some Fermat's little theorem property of the MD5, but this is not the case. The hash function will probably start to walk in circles quite soon, and it should.
There's also a very memory efficient way to find MD5 cycles. Also have a look at MD5CRK.
If you really want a unique "hashing" of an 128-bit id, you should use an ordinary encryption algorithm, for instance with AES, of a particular number and a secret key. This gives you a "random", unique row of numbers form an increasing id, since you can always decrypt the information in a unique way, given the same key that was used to encrypt the data.

Math Functions that cannot be reversed?

I am curious about developing my own Simple Hashing mechanism.
I would like to know some math functions that are irreversible.
I know that raised to function and modulus are some functions that are irreversible in the sense that the reverse procedure gives two answers.
e.g.:- square root(4) = 2 or -2
I need a function that is not reversible because, even if anyone cracked my cipher, they should not be able to produce a decrypter, that can easily decrypt the passwords in my hashing.
Using this functions I can make my hashing more secure.
It would be helpful if someone could give more such functions with explanations.
Squaring in R is irreversible in the sense that it loses information. But that's not at all what hash functions are about.
Cryptographic hash functions have two main properties:
It's hard to find two inputs with the same output, called a collision
It's hard to find an input matching a given output, called a pre-image
Squaring on R has neither of these properties:
Finding a collision is trivial. Given x just calculate -x, both of which square to x*x.
Finding a pre-image is easy. Calculate the square-root. There are efficient algorithms for this. (Ignoring the problem that you can't output the infinite sequence of digits if the result is irrational)
Unfortunately there are no "simple" functions with these properties.
Related questions:
Why are hash functions one way? If I know the algorithm, why can't I calculate the input from it?
Modulo is irreversible. Absolute value is irreversible. Rounding is irreversible.
Power of 0.
Imaginary numbers are good as a computer can only pass the equation if it already knows what to do with it.
Rounding numbers.
Salting "functions" should be reversible. The point of a salt is just to add extra (hard to guess) data to the value you want to hash. This way, attackers have a much harder time reverse engineering hashes with their own guesses.
One common solution is to just prepend/append the salt to the text you're going to hash.
For example, if your hidden value was "password" and your salt was a random number between 0 and 255, the thing actually stored in your database might be md5(123password), 123. So it doesn't really make sense for the salt operation to be irreversible, as it's already hashed, which is, kind of, irreversible.

Resources