Amortized complexity of the union operation - math

There are lots of sources about amortized complexity or union-find, quick-find operations but I couldn't find anything about the proof of single union operation complexity.
So, how can I prove that the amortized complexity of the union operation is O(log n).
Thanks.

Related

Detecting all cycles in a directed graph with millions of nodes in Ocaml

I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.

Analyzing goals and choosing a good hash function

This isn’t a specific question with a specific solution; but it’s rather a response to the fact that I can’t find any good Stack Overflow qestions about how to choose a good a hashing function for hash tables and similar tasks.
So! Let’s talk hash functions, and how to choose one. How should a programming noob, who needs to choose a good hash function for their specific task, go about choosing one? When is the simple and quick Fowler-Noll-Vo appropriate? When should they vendor in MurmurHash3 instead? Do you have any links to good resources on comparing the various options?
The hash function for hash tables should have these two properties
Uniformity all outputs of H() should be evenly distributed as much as possible. In other words the for 32-bit hash function the probability for every output should be equal to 1/2^32. (for n-bit it should be 1/2^n). With uniform hash function the chance of collision is minimized to lowest possible for any possible input.
Low computational cost Hash functions for tables are expected to be FAST, compared to cryptographic hash functions where speed is traded for preimage resistance (eg it is hard to find the message from given hash value) and collision resistance.
For purposes of hash tables all cryptographic functions are BAD choice, since the computational cost is enormous. Because hashing here is used not for security but for fast access. MurmurHash is considered one of the fastest and uniform functions suitable for big hash tables or hash indexes. For small tables a trivial hash function should be OK. A trivial hash is where we mix values of object (by multiplication, addition and subtraction with some prime).
If your hash keys are strings (or other variable-length data) you might look at this paper by Ramakrishna and Zobel. They benchmark a few classes of hashing functions (for speed and low collisions) and exhibit a class that is better than the usual Bernstein hashes.

Time complexity of Hash table

I am confused about the time complexity of hash table many articles state that they are "amortized O(1)" not true order O(1) what does this mean in real applications. What is the average time complexity of the operations in a hash table, in actual implementation not in theory, and why are the operations not true O(1)?
It's impossible to know in advance how many collisions you will get with your hash function, as well as things like needing to resize. This can add an element of unpredictability to the performance of a hash table, making it not true O(1). However, virtually all hash table implementations offer O(1) on the vast, vast, vast majority of inserts. This is the same as array inserting - it's O(1) unless you need to resize, in which case it's O(n), plus the collision uncertainty.
In reality, hash collisions are very rare and the only condition in which you'd need to worry about these details is when your specific code has a very tight time window in which it must run. For virtually every use case, hash tables are O(1). More impressive than O(1) insertion is O(1) lookup.
For some uses of hash tables, it's impossible to create them of the "right" size in advance, because it is not known how many elements will need to be held simultaneously during the lifetime of the table. If you want to keep fast access, you need to resize the table from time to time as the number of element grows. This resizing takes linear time with respect to the number of elements already in the table, and is usually done on an insertion, when the number elements passes a threshold.
These resizing operations can be made seldom enough that the amortized cost of insertion is still constant (by following a geometric progression for the size of the table, for instance doubling the size each time it is resized). But one insertion from time to time takes O(n) time because it triggers a resize.
In practice, this is not a problem unless you are building hard real-time applications.
Inserting a value into a Hash table takes, on the average case, O(1) time. The hash function is
computed, the bucked is chosen from the hash table, and then item is inserted. In the worst case scenario,
all of the elements will have hashed to the same value, which means either the entire bucket list must be
traversed or, in the case of open addressing, the entire table must be probed until an empty spot is found.
Therefore, in the worst case, insertion takes O(n) time
refer: http://www.cs.unc.edu/~plaisted/comp550/Neyer%20paper.pdf (Hash Table Section)

Hash tables v self-balancing search trees

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.
I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.
I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.
I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?
I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?
I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).
Here is what I can think of:
There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace
Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.
Just wanted to add :
Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.
In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".
In the real world, the memory wall makes them far less efficient than they are on paper.
With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).
In a word: keep it simple. If that's simple for you then that's simple for your computer.
I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.
A few reasons I can think of:
Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).
Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

Which is faster to find an item in a hashtable or in a sorted list?

Which is faster to find an item in a hashtable or in a sorted list?
Algorithm complexity is a good thing to know, and hashtables are known to be O(1) while a sorted vector (in your case I guess it is better to use a sorted array than a list) will provide O(log n) access time.
But you should know that complexity notation gives you the access time for N going to the infinite. That means that if you know that your data will keep growing, complexity notation gives you some hint on the algorithm to chose.
When you know that your data will keep a rather low length: for instance having only a few entries in your array/hashtable, you must go with your watch and measure. So have a test.
For instance, in another problem: sorting an array. For a few entries bubble sort while O(N^2) may be quicker than .. the quick sort, while it is O(n log n).
Also, accordingly to other answers, and depending on your item, you must try to find the best hash function for your hashtable instance. Otherwise it may lead to dramatic bad performance for lookup in your hashtable (as pointed out in Hank Gay's answer).
Edit: Have a look to this article to understand the meaning of Big O notation .
Assuming that by 'sorted list' you mean 'random-accessible, sorted collection'. A list has the property that you can only traverse it element by element, which will result in a O(N) complexity.
The fastest way to find an element in a sorted indexable collection is by N-ary search, O(logN), while a hashtable without collissions has a find complexity of O(1).
Unless the hashing algorithm is extremely slow (and/or bad), the hashtable will be faster.
UPDATE: As commenters have pointed out, you could also be getting degraded performance from too many collisions not because your hash algorithm is bad but simply because the hashtable isn't big enough. Most library implementations (at least in high-level languages) will automatically grow your hashtable behind the scenes—which will cause slower-than-expected performance on the insert that triggers the growth—but if you're rolling your own, it's definitely something to consider.
The get operation in a SortedList is O(log n) while the same operation e a HashTable is O(1). So, normally, the HashTable would be much faster. But this depends on a number of factors:
The size of the list
Performance of the hashing algorithm
Number of collisions / quality of the hashing algorithm
It depends entirely on the amount of data you have stored.
Assuming you have enough memory to throw at it (so the hash table is big enough), the hash table will locate the target data in a fixed amount of time, but the need to calculate the hash will add some (also fixed) overhead.
Searching a sorted list won't have that hashing overhead, but the time required to do the work of actually locating the target data will increase as the list grows.
So, in general, a sorted list will generally be faster for small data sets. (For extremely small data sets which are frequently changed and/or infrequently searched, an unsorted list may be even faster, since it avoids the overhead of doing the sort.) As the data set becomes large, the growth of the list's search time overshadows the fixed overhead of hashing, and the hash table becomes faster.
Where that breakpoint is will vary depending on your specific hash table and sorted-list-search implementations. Run tests and benchmark performance on a number of typically-sized data sets to see which will actually perform better in your particular case. (Or, if the code already runs "fast enough", don't. Just use whichever you're more comfortable with and don't worry about optimizing something which doesn't need to be optimized.)
In some cases, it depends on the size of the collection (and to a lesser degree, implementation details). If your list is very small, 5-10 items maybe, I'd guess the list would be faster. Otherwise xtofl has it right.
HashTable would be more efficient for list containing more than 10 items. If the list has fewer than 10 items, the overhead due to hashing algo will be more.
In case you need a fast dictionary but also need to keep the items in an ordered fashion use the OrderedDictionary. (.Net 2.0 onwards)

Resources