Say we have a hashtable of size m, and at each bucket we store a hashtable of size p.
What would the worst case/average case search complexity be?
I am inclined to say that since computing a hash function is still atomic, the only worst case scenario is if the value is at the end of the linked list in the hashtable of size p, so O(n)?
I have no idea how to calculate the average case for this scenario and would appreciate any pointers!
Yes, it's still linear in the worst case, if conflicts are resolved by adding to a linked list. A two-level hashtable with n slots at the first level and p at the second is identical to a single-level hashtable with np slots. All elements you add could end up at the same second level slot, all be added to the same linked list, etc. Based on this observation, the average-case complexity is the same as that of a single-level hashtable with np slots that uses linked lists to resolve collisions.
Related
I am creating a hashtable and inserting n elements in it from an unsorted array. As I am calling the hashfunction n times. Wouldn't the time complexity to create/insert the hashtable be O(n) ?
I tried searching everywhere, but they mention complexity in case of collisions, but don't cover how can I create a hashtable in O(1) in a perfect case as I will have to traverse the array in order to pick element one by one and put it in the hashtable?
When inserting a new record into a hash table, using a perfect hash function, an unused index entry (used to point to the record) will be found immediately giving O(1). Later, the record will be found immediately when searched for.
Of course, hash functions are seldom perfect. As the hash index starts to become populated the function will at times require two or more attempts to find an unused index entry to insert a new record and every later attempt to search for that record will also require two or more attempts before the correct index entry is found. So the actual search complexity of a hash table may end up as O(1.5) or more but that value is made up of searches where the record is most often found in the first attempt while others may require two or more.
I guess the trick is to find a hashing algorithm that is "good enough" which means a compromise between an index that isn't too big, an average complexity that is reasonably low and a worst case that is acceptable.
I posted on another search question here and showed how hashing could be used and a good hashing function be determined. The question required looking up a 64-bit value in a table containing 16000 records and replacing it with another value from the record. Computing the second value from the first was not possible. My algorithm sported an average search comnplexity of <1.5 with worst case of 14. The index size was a bit large for my taste but I think a more exhaustive search could have found a better one. In comparison, binary searches required, at best, about twice as many clock cycles to perform a lookup as the hash function did, probably because their time complexity was greater.
I was curious why deleting a node from a double linked list is faster than a single linked. According to my lecture, it takes O(1) for a double linked list compared to O(n) for a single linked. According to my thought process, I thought they both should be O(n) since you have to traverse across possibly all the elements so it depends on the size.
I understand it's going be associated with the fact that each node has a previous pointer and a next pointer to the next node, I just can't understand how it would be a constant operation in the sense of O(1)
This partially depends on how you're interpreting the setup. Here are two different versions.
Version 1: Let's suppose that you want to delete a linked list node containing a specific value x from a singly or doubly-linked list, but you don't know where in the list it is. In that case, you would have to traverse the list, starting at the beginning, until you found the node to remove. In both a singly- and doubly-linked list, you can then remove it in O(1) time, so the overall runtime is O(n). That said, it's harder to do the remove step in the singly-linked list, since you need to update a pointer in the preceding cell (which isn't pointed at by the cell to remove), so you need to store two pointers as you do this.
Version 2: Now let's suppose you're given a pointer to the cell to remove and need to remove it. In a doubly-linked list, you can do this by using the next and previous pointers to identify the two cells around the cell to remove and then rewiring them to splice the cell out of the list. This takes time O(1). But what about a singly-linked list? To remove this cell from the list, you have to change the next pointer of the cell that appears before the cell to remove so that it no longer points to the cell to remove. Unfortunately, you don't have a pointer to that cell, since the list is only singly-linked. Therefore, you have to start at the beginning of the list, walk downwards across the nodes, and find the node that comes right before the one to remove. This takes time O(n), so the runtime for the remove step is O(n) in the worst case, rather than O(1). (That said, if you know two pointers - the cell you want to delete and the cell right before it, then you can remove the cell in O(1) time since you don't have to scan the list to find the preceding cell.)
In short: if you know the cell to remove in advance, the doubly-linked list lets you remove it in time O(1) while a singly-linked list would require time O(n). If you don't know the cell in advance, then it's O(n) in both cases.
Hope this helps!
The list does not have to be traversed in order to connect the previous node to the following node in a double-linked list. You simply point
curr.Prev.Next = curr.Next and
curr.Next.Prev = curr.Prev.
In a single-linked list, you have to traverse the list to find the previous node. Traversal can be O(n) in a non-sorted list.
An alternative approach seems to be to use double pointers as outlined in this excellent resource: https://github.com/mkirchner/linked-list-good-taste. This means you do not need to keep track of a current and previous pointer as you only use a single pointer to pointer that can modify in place directly. Please let me know if this is inaccurate as I just learnt this.
This is a homework question, but I think there's something missing from it. It asks:
Provide a sequence of m keys to fill a hash table implemented with linear probing, such that the time to fill it is minimum.
And then
Provide another sequence of m keys, but such that the time fill it is maximum. Repeat these two questions if the hash table implements quadratic probing
I can only assume that the hash table has size m, both because it's the only number given and because we have been using that letter to address a hash table size before when describing the load factor. But I can't think of any sequence to do the first without knowing the hash function that hashes the sequence into the table.
If it is a bad hash function, such that, for instance, it hashes every entry to the same index, then both the minimum and maximum time to fill it will take O(n) time, regardless of what the sequence looks like. And in the average case, where I assume the hash function is OK, how am I supposed to know how long it will take for that hash function to fill the table?
Aren't these questions linked to the hash function stronger than they are to the sequence that is hashed?
As for the second question, I can assume that, regardless of the hash function, a sequence of size m with the same key repeated m-times will provide the maximum time, because it will cause linear probing from the second entry on. I think that will take O(n) time. Is that correct?
Well, the idea behind these questions is to test your understanding of probing styles. For linear probing, if a collision occurs, you simply test the next cell. And it goes on like this until you find an available cell to store your data.
Your hash table doesn't need to be size m but it needs to be at least size m.
First question is asking that if you have a perfect hash function, what is the complexity of populating the table. Perfect hashing function addresses each element without collision. So for each element in m, you need O(1) time. Total complexity is O(m).
Second question is asking for the case that hash(X)=cell(0), which all of the elements will search till the first empty cell(just rear of the currently populated table).
For the first element, you probe once -> O(1)
For the second element, you probe twice -> O(2)
for the nth element, you probe n times -> O(n)
overall you have m elements, so -> O(n*(n+1)/2)
For quadratic probing, you have the same strategy. The minimum case is the same, but the maximum case will have O(nlogn). ( I didn't solve it, just it's my educated guess.)
This questions doesn't sound terribly concerned with the hash function, but it would be nice to have. You seem to pretty much get it, though. It sounds to me like the question is more concerned with "do you know what a worst-case list of keys would be?" than "do you know how to exploit bad hash functions?"
Obviously, if you come up with a sequence where all the entries hash to different locations, then you have O(1) insertions for O(m) time in total.
For what you are saying about hashing all the keys to the same location, each insertion should take O(n) if that's what you are suggesting. However, that's not the total time for inserting all the elements. Also, you might want to consider not literally using the same key over and over but rather using keys that would produce the same location in the table. I think, by convention, inserting the same key should cause a replacement, though I'm not 100% sure.
I'll apologize in advance if I gave too much information or left anything unclear. This question seems pretty cut-and-dried save the part about not actually knowing the hash function, and it was kind of hard to really say much without answering the whole question.
I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.
I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.
I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.
I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?
I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?
I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).
Here is what I can think of:
There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace
Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.
Just wanted to add :
Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.
In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".
In the real world, the memory wall makes them far less efficient than they are on paper.
With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).
In a word: keep it simple. If that's simple for you then that's simple for your computer.
I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.
A few reasons I can think of:
Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).
Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.
Which is faster to find an item in a hashtable or in a sorted list?
Algorithm complexity is a good thing to know, and hashtables are known to be O(1) while a sorted vector (in your case I guess it is better to use a sorted array than a list) will provide O(log n) access time.
But you should know that complexity notation gives you the access time for N going to the infinite. That means that if you know that your data will keep growing, complexity notation gives you some hint on the algorithm to chose.
When you know that your data will keep a rather low length: for instance having only a few entries in your array/hashtable, you must go with your watch and measure. So have a test.
For instance, in another problem: sorting an array. For a few entries bubble sort while O(N^2) may be quicker than .. the quick sort, while it is O(n log n).
Also, accordingly to other answers, and depending on your item, you must try to find the best hash function for your hashtable instance. Otherwise it may lead to dramatic bad performance for lookup in your hashtable (as pointed out in Hank Gay's answer).
Edit: Have a look to this article to understand the meaning of Big O notation .
Assuming that by 'sorted list' you mean 'random-accessible, sorted collection'. A list has the property that you can only traverse it element by element, which will result in a O(N) complexity.
The fastest way to find an element in a sorted indexable collection is by N-ary search, O(logN), while a hashtable without collissions has a find complexity of O(1).
Unless the hashing algorithm is extremely slow (and/or bad), the hashtable will be faster.
UPDATE: As commenters have pointed out, you could also be getting degraded performance from too many collisions not because your hash algorithm is bad but simply because the hashtable isn't big enough. Most library implementations (at least in high-level languages) will automatically grow your hashtable behind the scenes—which will cause slower-than-expected performance on the insert that triggers the growth—but if you're rolling your own, it's definitely something to consider.
The get operation in a SortedList is O(log n) while the same operation e a HashTable is O(1). So, normally, the HashTable would be much faster. But this depends on a number of factors:
The size of the list
Performance of the hashing algorithm
Number of collisions / quality of the hashing algorithm
It depends entirely on the amount of data you have stored.
Assuming you have enough memory to throw at it (so the hash table is big enough), the hash table will locate the target data in a fixed amount of time, but the need to calculate the hash will add some (also fixed) overhead.
Searching a sorted list won't have that hashing overhead, but the time required to do the work of actually locating the target data will increase as the list grows.
So, in general, a sorted list will generally be faster for small data sets. (For extremely small data sets which are frequently changed and/or infrequently searched, an unsorted list may be even faster, since it avoids the overhead of doing the sort.) As the data set becomes large, the growth of the list's search time overshadows the fixed overhead of hashing, and the hash table becomes faster.
Where that breakpoint is will vary depending on your specific hash table and sorted-list-search implementations. Run tests and benchmark performance on a number of typically-sized data sets to see which will actually perform better in your particular case. (Or, if the code already runs "fast enough", don't. Just use whichever you're more comfortable with and don't worry about optimizing something which doesn't need to be optimized.)
In some cases, it depends on the size of the collection (and to a lesser degree, implementation details). If your list is very small, 5-10 items maybe, I'd guess the list would be faster. Otherwise xtofl has it right.
HashTable would be more efficient for list containing more than 10 items. If the list has fewer than 10 items, the overhead due to hashing algo will be more.
In case you need a fast dictionary but also need to keep the items in an ordered fashion use the OrderedDictionary. (.Net 2.0 onwards)