Hash tables: Whats the best way to mark a bucket empty? - hashtable

Ran into an annoying problem - I need some way to tell if the bucket I'm trying to fill is empty or not (the buckets are stored as an array of value type structs for key-value pairs).
If I were to reserve a key value for marking things empty then that would just mean that some data unfortunate enough to stumble on that hash value would never be accessible.
On the other hand, including a boolean in the KVP struct would increase the size of the struct from 16 to 24, (such a waste and I'm tight on memory as it is). Has anybody figured out a good solution for this?

This is a problem that is as intrinsic to hash tables as collisions. A related problem is dealing with deleting from a hash table, again, in the context of collisions. There's no solution that doesn't involve some compromise in performance, so it's pretty common to see hash table implementations that have a particular key that is illegal.
By far the most direct solution is to just special-case the key value that you're using to mean empty. That is, if the user is trying to store a key value 0, you just put it in a special array you keep around for that purpose.
Really lame hash tables that only work with pointers don't usually have this issue, since you can always find a pointer value which the caller can't pass in (such as a pointer to an object you own). Obviously hash tables using linked lists or array elements don't have this problem either, but then, there's a massive performance penalty for those.
You could probably find some clever way to encode it inside the table itself, by using multiple elements. The only way this would be better is if its somehow unified with deleted element handling or something else, so it would be free or faster than checking some separate list.

Related

How to have valid references to objects owned by containers that dynamically move their objects?

If you have a pointer or reference to an object contained in a container, say a dynamic array or a hash table/map, there's the problem that the objects don't permanently stay there, and so it seems any references to these objects become invalid before too long. For example a dynamic array might need to reallocate, and a hash table might have to rehash, changing the positions of the buckets in the array.
In languages like Java (and I think C#), and probably most languages this may not be a problem. In these languages many things are references instead of the object itself. You can create a reference to the 3rd element of a dynamic array, you basically create a new reference by copying the reference to the object which lives somewhere else.
But in say C++ where a dynamic array or hash table will actually store the objects directly in its memory owned by the container what are you supposed to do? There's only one place where the object I create can live. I can create the object by allocating it somewhere, and then store pointers to that object in a dynamic array or a hash table, or any other container. However if I decide to have the container be the owner of those objects I run into problems with having a pointer or reference to those objects.
In the case of a dynamic array like an std::vector you can reference an object in the array with a index instead of a memory address. If the array is reallocated the index is still valid. However I run into the same problem if I remove an element in the array, then the index is potentially no longer valid.
In the case of something like a hash table, the table might dynamically rehash, changing the position of all the values in the buckets. Is the only way of having references to hash table values to just search for or hash the key every time you want to access it?
What are the ways of having references to objects that live in containers like these, or any others?
There aren't any magic or generally used solutions to this. You have to make tradeoffs. If you are optimizing things at this low level, one good approach might be to use a container class that informs you when it does a reallocation. It'd be interesting to find out if there is any container library with this property

why HashSet is good for search operations?

hashset underlaying data structure is hashtable .how it will identify duplicates and why it is good for if our frequent operation is search operation ?
It uses hash code of the object which is quickly computed integer. This hash code tries to be as even distributed over all potential object values as possible.
As a result it can distribute the inserted values into a array (hashtable) with very low probability of conflict. Then the search operation is quite quick - get the hash code, access the array, compare and get the value - usually constant time. The same actually happens for finding duplicates.
The conflicts of hash code are resolved as well - there can be potentially more values for the same entry within the hash table - there comes the equal into play. But they are rather rare so they don't affect average performance significantly.

Unity 3d how heavy is a dictionary?

Everyone was telling me that a List is heavy on performance, so I was wondering is it the same with a dictionary? Because a dictionary doesn't have a fixed size. Is there also a dictionary with a fixed size, just like a normal array?
Thanks in advance!
A list can be heavy on performance, but it depends on your use case.
If your use case is the indexing of a very large data set, in which you plan to search for elements during runtime, then a Dictionary will behave with O(1) Time Complexity for retrievals (which is great!).
If you plan to insert/remove a little bit of data here and there at runtime then that's okay. But, if you plan to do constant insertions at runtime then you will be taking a hit on performance due to the hashing and collision handling functions.
If your use case requires a lot of insertions, removals, iteration through the consecutive data, then a list would be and fast. But if you are planning to search constantly at runtime, then a list could take a hit performance-wise.
Regarding the Dictionary and size:
If you know the size/general range of your data set then you could technically account for that and initialize accordingly. Or you could write your own Dictionary and Hash Table implementation.
In all:
Each data structure has it's advantages and disadvantages. So think about what you plan to do with the data at runtime, then pick accordingly.
Also, keeping a data structure time and space complexity table is always handy :P
This is depends on your needs.
If you just add and then iterate items in a List in sequental way - this is a good choice.
If you have a key for every item and need fast random access by key - use Dictionary.
In both cases you can specify the initial size of the collection to reduce memory allocation.
If you have a varying number of items in the collection, you'll want to use a list vs recreating an array with the new number of items in the collection.
With a dictionary, it's a little easier to get to specific items in the collection, given you have a key and just need to look it up, so performance is a little better when getting an item from the collection.
List and dictionary are part of the System.Collections namespace, which are mutable types. There is a System.Collections.Immutable namespace, but it's not yet supported in Unity.

Auto increment feature in Database

I use SQL Server and when I create a new table I make a specific field an auto increment
primary key. The problem is some people told me making the field an auto increment for the primary key means when deleting any record (they don't care about the auto increment field number) the field increases so at some point - if the type of my field is integer for example - the range of integer will be consumed totally and i will be in trouble. So they tell me not to use this feature any more.
The best solution is making this through the code by getting the max of my primary key then if the value does not exist the max will be 1 other wise max + 1.
Any suggestions about this problem? Can I use the auto increment feature?
I want also to know the cases which are not preferable to use auto increment ..and the alternatives...
note :: this question is general not specific to any DBMS , i wanna to know is this true also for DBMSs like ORACLE ,Mysql,INFORMIX,....
Thanks so much.
You should use identity (auto increment) columns. The bigint data type can store values up to 2^63-1 (9,223,372,036,854,775,807). I don't think your system is going to reach this value soon, even if you are inserting and deleting lots of records.
If you implement the method you propose properly, you will end up with a lot of locking problems. Otherwise, you will have to deal with exceptions thrown because of constraint violation (or even worse - non-unique values, if there is no primary key constraint).
An int datatype in SQL Server can hold values from -2,147,483,648 through 2,147,483,647.
If you seed your identity column with -2,147,483,648, e.g. FooId identity(-2,147,483,648, 1) then you have over 4 billion values to play with.
If you really think this is still not enough, you could use a bigint, which can hold values from -9,223,372,036,854,775,808 through 9,223,372,036,854,775,807, but this almost guaranteed to be overkill. Even with large data volumes and/or a large number of transactions, you will probably either run out of disk space or exhaust the lifetime of your application before you exhaust the identity values when using an int, and almost certainly when using a bigint.
To summarise, you should use an identity column and you should not care about gaps in the values since a) you have enough candidate values and b) it's an abstract number with no logical meaning.
If you were to implement the solution you suggest, with the code deriving the next identity column, you would have to consider concurrency, since you will have to synchronise access to the current maximum identity value between two competing transactions. Indeed, you may end up introducing a significant performance degradation, since you will have to first read the max value, calculate and then insert (not to mention the extra work involved in synchronising concurrent transactions). If, however, you use an identity column, concurrency will be handled for you by the database engine.
The solution they suggest can, and most likely will, create a concurrency problem and/or scalability problem. If two sessions use the Max technique you describe at the same time, they can come up with the same number and then both try to add it at the same time. This will create a constraint violation.
You can work around that problem by locking the table or catching exceptions, and keep re-inserting.. but that's a really bad way to do things. Locking will reduce performance and cause scalability issues (and if you're planning as many records as to be worried about overflowing an int then you will need scalability).
Identity fields are atomic operations. Two sessions cannot create the same identity field, so this problem is non-existent when using it.
If you're concerned that an identity field may overflow, then use a larger datatype, such as bigint. You would be hard pressed to generate enough records to overflow that.
Now, there are valid reasons NOT to use an identity field, but this is not one of them.
Continue to use the identity feature with PK in SQL Server. In mysql, there is also auto increment feature. Don't worry that you run out of integer range, you will run out of hard disk space before that happens.
I would advice AGAINST using the Identity/Auto-increment, because:
It's implementation is broken in SQL server 2005/2008. Read more
It doesn't work well if you are going to use an ORM to map your database to objects. Read more
I would advice you to use the Hi/Lo generator if you usually access your database through a program and don't depend on sending insert statements manually to the DB. You can read more about it in the second link.

What is a hash map in programming and where can it be used

I have often heard people talking about hashing and hash maps and hash tables. I wanted to know what they are and where you can best use them for.
First you shoud maybe read this article.
When you use lists and you are looking for a special item you normally have to iterate over the complete list. This is very expensive when you have large lists.
A hashtable can be a lot faster, under best circumstances you will get the item you are looking for with only one access.
How is it working? Like a dictionary ... when you are looking for the word "hashtable" in a dictionary, you are not starting with the first word under 'a'. But rather you go straight forward to the letter 'h'. Then to 'ha', 'has' and so on, until you found your word. You are using an index within your dictionary to speed up your search.
A hashtable does basically the same. Every item gets an unique index (the so called hash). You use this hash for lookups. The hash may be an index in a normal linked list. For instance your hash could be a number like 2130 which means that you should look at position 2130 in your list. A lookup at a known index within a normal list is very easy and fast.
The problem of the whole approach is the so called hash function which assigns this index to each item. When you are looking for an item you should be able to calculate the index in advance. Just like in a real dictionary, where you see that the word 'hashtable' starts with the letter 'h' and therefore you know the approximate position.
A good hash function provides hashcodes that are evenly distrubuted over the space of all possible hashcodes. And of course it tries to avoid collisions. A collision happens when two different items get the same hashcode.
In C# for instance every object has a GetHashcode() method which provides a hash for it (not necessarily unique). This can be used for lookups and sorting with in your dictionary.
When you start using hashtables you should always keep in mind, that you handle collisions correctly. It can happen quite easily in large hashtables that two objects got the same hash (maybe your overload of GetHashcode() is faulty, maybe something else happened).
Basically, a HashMap allows you to store items with identifiers. They are stored in a table format with the identifier being hashed using a hashing algorithm.
Typically they are more efficient to retrieve items than search trees etc.
You may find this helpful: http://www.relisoft.com/book/lang/pointer/8hash.html
Hope it helps,
Chris
Hashing (in the noncryptographic sense) is a blanket term for taking an input and then producing an output to identify it with. A trivial example of a hash is adding the sum of the letters of a string, i.e:
f(abc) = 6
Note that this trivial hash scheme would create a collision between the strings abc, bca, ae, etc. An effective hash scheme would produce different values for each string, naturally.
Hashmaps and hashtables are datastructures (like arrays and lists), that use hashing to store data. In a hashtable, a hash is produced (either from a provided key, or from the object itself) that determines where in the table the object is stored. This means that as long as the user of the hashtable is aware of the key, retrieving the object is extremely fast.
In a list, in comparison, you would need to in some way search through the list in order to find your sought object. This also represents the backside of hashtables, which is that it is very complicated to find an object in it without knowing the key, because where the object is stored in the table has no relevance to its value nor when it was inputed.
Hashmaps are similar to hashtables, but only one example of each object is stored in it (hence no key needs to be provided, the object itself is the key).
This is of course a very simple explanation, so I suggest you read in depth from this point on. I hope I didn't make any silly mistakes. =)
Hashmap is used for storing data in key value pairs. We can use a hashmap for storing objects in a application and use it further in the same application for storing, updating, deleting values. Hashmap key and values are stored in a bucket to a specific entry, this entry location is determined using Hashcode function. This hashcode function determines the hash where the value is stored. The detailed explanantion of how hashmap works is described in this video: https://youtu.be/iqYC1odZSNo
Hash maps saves a lot of time as compared to other search criteria. We have a hash key that corresponds to a hash code which further helps to find its index value. In terms of implementation, hash maps takes a string converts it into an integer and remaps it to convert it into an index of an array which helps to find the required value.
To go in detail we can look for handling collisions in hash maps. Like instead of using array we can go with the linked list.
There is a short video available to understand it.
Available here :
Implementation example --> https://www.youtube.com/watch?v=shs0KM3wKv8
Sample:
int hashCode(String s)
{
logic
}

Resources