Hash tables: Open addressing and removing elements - hashtable

I'm trying to understand open addressing in hash tables but there is one question which isn't answered in my literature. It concerns the deletion of elements in such a hash table if quadratic probing is used. Then the removed element is replaced by a sentinel element. The get() operation then knows that it has to go further and the add() method would overwrite the first sentinel it finds. But what happens if I want to add an element with a key that is already in the hash table but behind a sentinel in a probing path? Instead of overwriting the value of the instance with the same key which is already in the table, the add() method would overwrite the sentinel. And then we have multiple elements with the same key in the hash table. I see that as a problem since it costs memory and also since removing the element from the hash table would merely remove the first of them, so that the element could still be found in the table (i.e. it is not removed).
So it seems that it is necessary to search the whole probing path for the key of the element one wants to insert before replacing a sentinel element. Am I overlooking something? How is this problem handled in practice?

But what happens if I want to add an element with a key that is
already in the hash table but behind a sentinel in a probing path?
Instead of overwriting the value of the instance with the same key
which is already in the table, the add() method would overwrite the
sentinel.
add() has to check every element after the sentinel(s) in the probing path till it finds an empty element, as you pointed out later. If it could not find the new element in the probing path and there are sentinel elements on it, it can use the first sentinel slot to store the new element.
There is a hash table implementation on http://www.algolist.net/Data_structures/Hash_table/Open_addressing (HashMap.java). Its put() method does exactly this. (The collision resolution is linear probing in the referenced snippet but I don't think it's an important difference from the point of view of the algorithm.)
After a lot of remove operations there could be too many sentinel elements in the table. A solution for this would be to rebuild the hash table (i.e. rehash everything) occasionally (based on the number of items and the number of sentinel elements). This operation would eliminate the sentinel elements.
Another approach is eliminating the sentinel (DELETED) element from the probing path when you remove an element. Practically, you don't have sentinel elements in the table in this case; there are only FREE and OCCUPIED slots. It could be expensive.
So it seems that it is necessary to search the whole probing path for
the key of the element one wants to insert before replacing a sentinel
element.
Yes, it is. You have to search until you find an empty element.
How is this problem handled in practice?
I don't know too much about real life hash table implementations. I suppose plenty of them are available on the internet in open source projects. I've just checked the Hashtable and HashMap classes in Java. Both use chaining instead of open addressing.

Sorry for the late answer, but Java has an example of a hash table with open addressing: java.util.IdentityHashMap.
Also, you can use GNU Trove Project. Its maps are all open addressing hash tables, as explained on its overview page.

Related

How to have valid references to objects owned by containers that dynamically move their objects?

If you have a pointer or reference to an object contained in a container, say a dynamic array or a hash table/map, there's the problem that the objects don't permanently stay there, and so it seems any references to these objects become invalid before too long. For example a dynamic array might need to reallocate, and a hash table might have to rehash, changing the positions of the buckets in the array.
In languages like Java (and I think C#), and probably most languages this may not be a problem. In these languages many things are references instead of the object itself. You can create a reference to the 3rd element of a dynamic array, you basically create a new reference by copying the reference to the object which lives somewhere else.
But in say C++ where a dynamic array or hash table will actually store the objects directly in its memory owned by the container what are you supposed to do? There's only one place where the object I create can live. I can create the object by allocating it somewhere, and then store pointers to that object in a dynamic array or a hash table, or any other container. However if I decide to have the container be the owner of those objects I run into problems with having a pointer or reference to those objects.
In the case of a dynamic array like an std::vector you can reference an object in the array with a index instead of a memory address. If the array is reallocated the index is still valid. However I run into the same problem if I remove an element in the array, then the index is potentially no longer valid.
In the case of something like a hash table, the table might dynamically rehash, changing the position of all the values in the buckets. Is the only way of having references to hash table values to just search for or hash the key every time you want to access it?
What are the ways of having references to objects that live in containers like these, or any others?
There aren't any magic or generally used solutions to this. You have to make tradeoffs. If you are optimizing things at this low level, one good approach might be to use a container class that informs you when it does a reallocation. It'd be interesting to find out if there is any container library with this property

Best way to update an entire document in marklogic

I would like to replace an xml document in a database without any metadata (e.g. permissions, properties or collections). Managed documents (dls) is not an option.
Using xdmp:document-insert() does not retain permissions, collections etc.
Using xdmp:node-replace() works well with parts of the document but requires knowing the root node in advance.
Is there a recommended way to update an entire document in MarkLogic?
You don't really need to know the root element itself. If you know the document URI, you can do something like:
xdmp:node-replace(fn:doc($uri)/*, $new-xml)
If you have any node of the document, you can also do:
xdmp:node-replace($node/fn:root(), $new-xml)
But just using xdmp:document-insert() isn't that much more difficult either:
xdmp:document-insert($uri, $new-xml, xdmp:document-get-permissions($uri), xdmp:document-get-collections($uri), xdmp:document-get-quality($uri))
Note: document-properties are preserved at document-insert. See also: http://docs.marklogic.com/xdmp:document-insert
Additionally, there is not much performance difference between these methods. The biggest difference in that respect is that xdmp:node-replace() requires a node from the original document, meaning it has to be retrieved from the database first. If the replacement does not depend on the original document, then xdmp:document-insert() would be fastest.
HTH!
+1 to #grtjn. Note that why using xdmp:node-replace is no more efficient then xdmp:document-insert is that all document updates update the entire document. It is a common understandable misconception that xdmp:node-replace would operate similar to say an RDBMS field update -- only 'touching' the affected field. In the RDBMS case that is often a mistaken misconception as well.
Similar to not needing to read the old document body, if you know what the permissions, collections, and quality should be you can supply those (or defaults) rather than querying them with xdmp:document-get-permissions() etc. It may not make a measurable difference, but as with xdmp:node-replace() if you don't need to query a value its simpler not to -- and removes unneeded dependencies and error opportunities (such as what if the document doesn't exist?)

Hash tables: Whats the best way to mark a bucket empty?

Ran into an annoying problem - I need some way to tell if the bucket I'm trying to fill is empty or not (the buckets are stored as an array of value type structs for key-value pairs).
If I were to reserve a key value for marking things empty then that would just mean that some data unfortunate enough to stumble on that hash value would never be accessible.
On the other hand, including a boolean in the KVP struct would increase the size of the struct from 16 to 24, (such a waste and I'm tight on memory as it is). Has anybody figured out a good solution for this?
This is a problem that is as intrinsic to hash tables as collisions. A related problem is dealing with deleting from a hash table, again, in the context of collisions. There's no solution that doesn't involve some compromise in performance, so it's pretty common to see hash table implementations that have a particular key that is illegal.
By far the most direct solution is to just special-case the key value that you're using to mean empty. That is, if the user is trying to store a key value 0, you just put it in a special array you keep around for that purpose.
Really lame hash tables that only work with pointers don't usually have this issue, since you can always find a pointer value which the caller can't pass in (such as a pointer to an object you own). Obviously hash tables using linked lists or array elements don't have this problem either, but then, there's a massive performance penalty for those.
You could probably find some clever way to encode it inside the table itself, by using multiple elements. The only way this would be better is if its somehow unified with deleted element handling or something else, so it would be free or faster than checking some separate list.

Storing doubly linked lists in Riak without a race condition?

We want to use Riak's Links to create a doubly linked list.
The algorithm for it is quite simple, I believe:
Let 'N0' be the new element to insert
Get the head of the list, including its 'next' link (N1)
Set the 'previous' of N1 to be the N0.
Set the 'next' of N0 to be N1
Set the 'next' of the head of the list to be N0.
The problem that we have is that there is an obvious race condition here, because if 2 concurrent clients get the head of the list, one of the items will likely be 'lost'. Any way to avoid that?
Riak is an eventually consistent system when talking about CAP theorem.
Provided you set the bucket property allow_multi=true, if two concurrent clients get the head of the list then write, you will have sibling records. On your next read you'll receive multiple values (siblings) and will then have to resolve the conflict and write the result. Given that we don't have any sort of atomicity this will possibly lead to additional conflicts under heavy write concurrency as you attempt to update the linked objects. Not impossible to resolve, but definitely tricky.
You're probably better off simply serializing the entire list into a single object. This makes your conflict resolution much simpler.

LinkedHashMap's impl - Uses Double Linked List, not a Single Linked List; Why

As I referred to the documentation of LinkedHashMap, It says, a double linked list(DLL) is maintained internally
I was trying to understand why a DLL was chosen over S(ingle)LL
The biggest advantage I get with a DLL would be traversing backwards, but I dont see any use case for LinkedHashMap() exploiting this advantage, since there is no previous() sort of operation like next() in the Iterable interface..
Can anyone explain why was a DLL, and not a SLL?
It's because with an additional hash map, you can implement deletes in O(1). If you are using a singly linked list, delete would take O(n).
Consider a hash map for storing a key value pair, and another internal hash map with keys pointing to nodes in the linked list. When deleting, if it's a doubly linked list, I can easily get to the previous element and make it point to the following element. This is not possible with a singly linked list.
http://www.quora.com/Java-programming-language/Why-is-a-Java-LinkedHashMap-or-LinkedHashSet-backed-by-a-doubly-linked-list

Resources