Why doesn't the computer have to do a comparison search to find a hash table value? - hashtable

I'm brushing up on how hash tables work, and so I understand how the hash function calculates a unique (for the purpose of this question) hash table value to go with a stored value, so when the stored value is searched the hash function gives the computer the hash table value.
OK, so now we have the hash table value, but how is this better? Don't we still have to iterate through until we find the matching hash table value?

The hash function will be used to be mapped to an index directly in your array. So no search or iteration is done

The hash table is stored in an array. The hash value is mapped to an array index. Depending on the implementation, either the hash value is the array index or it is a number from a larger range which is taken modulo the size of the array.
Then once it looks at that spot in the array, it has to check that the value there matches, since multiple values may have the same hash value. Usually, it actually navigates a linked list of all values which have been hashed to the same spot in the hash table. This is a much, much shorter list than the full list (especially if the size of the hash table is proportional to the amount of data in it).

There are lots of different hash tables, each with differing details about the implementation, but the simplest hash table uses a hash code as index into an array:
#define TABLESIZE 1000
char **gHashTable[TABLESIZE];
void clearHashTable() {
memset(gHashTable, 0, sizeof(gHashTable));
}
int calculateHashCode(char *string) {
int val = 0;
for (int i = 0; string[i] != '\0'; ++i)
val += string[i];
return val;
}
void insertInHash(char *string) {
int hashCode = calculateHashCode(string);
gHashTable[hashCode % TABLESIZE] = string;
}
int isInHashTable(char *string) {
int hashCode = calculateHashCode(string);
return gHashTable[hashCode % TABLESIZE] != 0;
}
Now this simple hash supports fast lookup on strings. It doesn't handle collisions well, the hash function is terrible, and a number of other problems, but it will work.

Related

Constraint on an array with same values group together

I have two rand arrays: pointer and value. Whatever values in the pointer should also come in value with same number of times. For eg: if pointer[i] == 2, then value should have a value 2 which occur two times and should be after 1.
Expected result is shown below.
Sample code:
class ABC;
rand int unsigned pointer[$];
rand int unsigned value[20];
int count;
constraint c_mode {
pointer.size() == count;
solve pointer before value;
//======== Pointer constraints =========//
// To avoid duplicates
unique {pointer};
foreach(pointer[i]) {
// Make sure pointer is inside 1 to 4
pointer[i] inside {[1:4]};
// Make sure in increasing order
if (i>0)
pointer[i] > pointer[i-1];
}
//======== Value constraints =========//
//Make sure Pointer = 2 has to come two times in value, but this is not working as expected
foreach(pointer[i]) {
value.sum with (int'(item == pointer[i])) == pointer[i];
}
// Ensure it will be in increasing order but not making sure that pointers are not grouping together
// For eg: if pointer = 2, then 2 has to come two times together and after 1 in the array order. This is not met with the below constraint
foreach(value[i]) {
foreach(value[j]) {
((i>j) && (value[i] inside pointer) && (value[j] inside pointer)) -> value[i] >= value[j];
}
}
}
function new(int num);
count = num;
endfunction
endclass
module tb;
initial begin
int unsigned index;
ABC abc = new(4);
abc.randomize();
$display("-----------------");
$display("Pointer = %p", abc.pointer);
$display("Value = %p", abc.value);
$display("-----------------");
end
endmodule
I would implement this using a couple of helper arrays:
class pointers_and_values;
rand int unsigned pointers[];
rand int unsigned values[];
local rand int unsigned values_dictated_by_pointers[][];
local rand int unsigned filler_values[][];
// ...
endclass
The values_dictated_by_pointers array will contain the groups of values that your pointers mandate. The other array will contain the dummy values that come between these groups. So, the values array will contain filler_values[0], values_dictated_by_pointers[0], filler_values[1], values_dictated_by_pointers[1], etc.
Computing the values mandated by the pointers is easy:
constraint compute_values_dicated_by_pointers {
values_dictated_by_pointers.size() == pointers.size();
foreach (pointers[i]) {
values_dictated_by_pointers[i].size() == pointers[i];
foreach (values_dictated_by_pointers[i,j])
values_dictated_by_pointers[i][j] == pointers[i];
}
}
You need as many groups as you need pointers. In each group you have as many elements as the pointer value for that group. Also, each element of a group has the same value as the group's pointer value.
For the filler values you didn't mention what they should look like. I interpreted your problem description to say that the values in the pointers array should only come in the patters described above. This means that they are not allowed as filler values. Depending on whether you want to allow filler values before the first value, you will need either as many filler groups as you have pointers or one extra. In the following code I allowed filler values before the "real" values:
constraint compute_filler_values {
filler_values.size() == pointers.size() + 1;
foreach (filler_values[i, j])
!(filler_values[i][j] inside { pointers });
}
You'll also need to constrain the size of each of the filler value groups, otherwise the solver will leave them as 0. Here you can change the constraints to match your requirements. I chose to always insert filler values and to never insert more than 3 filler values.
constraint max_number_of_filler_values {
foreach (filler_values[i]) {
filler_values[i].size() > 0;
filler_values[i].size() <= 3;
}
}
For the real values array, you can compute its value in post_randomize() by interleaving the other two arrays:
function void post_randomize();
values = filler_values[0];
foreach (pointers[i])
values = { values, values_dictated_by_pointers[i], filler_values[i] };
endfunction
If you need to be able to constrain values as well, then you'll have to implement this interleaving operation using constraints. I'm not going to show this, as this is probably pretty complicated in itself and warrants an own question.
Be aware that the code above might not work on all EDA tools, because of spotty support for random multi-dimensional arrays. I only got this to work on Aldec Riviera Pro on EDA Playground.

SQLite dropping zeros

I'm using google wrapper (sqlite3pp) to insert a char array that contain some zeros. The problem that is the SQLite is dropping the zero and the next elements after it.
char array[11] = {1,2,3,4,5,0,3,4,0,6,7};
sqlite3pp::command cmd(db, "INSERT INTO messages (id, payload) VALUES (?, ?)");
cmd.bind(1,index);
cmd.bind(2,&array[0],sizeof(array));
This code only insert: 1 2 3 4 5
The payload type is varchar.
Any idea?
sqlite3pp defines, among others, these two overloads for the bind() function:
int bind(int idx, char const* value, bool fstatic = true);
int bind(int idx, void const* value, int n, bool fstatic = true);
You want to use the second one with explicit length, but the first one is selected, while sizeof(array), evaluated to be 11, is truncated to bool value true and passed as fstatic instead of size. The wrapper then thinks the value is a plain NUL-terminated string and thus stores just the part till the first zero.
You can help the compiler to select the right version e.g. by providing the implicit parameter like so:
cmd.bind(2, &array[0], sizeof(array), true);
(Or false when the array will be deallocated before the query is done executing.)
Additionally, there can be problems with reading the rows as well - e.g. the default sqlite3pp getter for std::string won't work with binary zeroes and the content needs to be retrieved explicitly like this:
payload.assign(static_cast<const char*>(i->get<const void*>(2)), i->column_bytes(2));

QMap Memory Error

I am doing one project in which I define a data types like below
typedef QVector<double> QFilterDataMap1D;
typedef QMap<double, QFilterDataMap1D> QFilterDataMap2D;
Then there is one class with the name of mono_data in which i have define this variable
QFilterMap2D valid_filters;
mono_data Scan_data // Class
Now i am reading one variable from a .mat file and trying to save it in to above "valid_filters" QMap.
Qt Code: Switch view
for(int i=0;i<1;i++)
{
for(int j=0;j<1;j++)
{
Scan_Data.valid_filters[i][j]=valid_filters[i][j];
printf("\nValid_filters=%f",Scan_Data.valid_filters[i][j]);
}
}
The transferring is done successfully but then it gives run-time error
Windows has triggered a breakpoint in SpectralDataCollector.exe.
This may be due to a corruption of the heap, and indicates a bug in
SpectralDataCollector.exe or any of the DLLs it has loaded.
The output window may have more diagnostic information
Can anyone help in solving this problem. It will be of great help to me.
Thanks
Different issues here:
1. Using double as key type for a QMap
Using a QMap<double, Foo> is a very bad idea. the reason is that this is a container that let you access a Foo given a double. For instance:
map[0.45] = foo1;
map[15.74] = foo2;
This is problematic, because then, to retrieve the data contained in map[key], you have to test if key is either equal, smaller or greater than other keys in the maps. In your case, the key is a double, and testing if two doubles are equals is not a "safe" operation.
2. Using an int as key while you defined it was double
Here:
Scan_Data.valid_filters[i][j]=valid_filters[i][j];
i is an integer, and you said it should be a double.
3. Your loop only test for (i,j) = (0,0)
Are you aware that
for(int i=0;i<1;i++)
{
for(int j=0;j<1;j++)
{
Scan_Data.valid_filters[i][j]=valid_filters[i][j];
printf("\nValid_filters=%f",Scan_Data.valid_filters[i][j]);
}
}
is equivalent to:
Scan_Data.valid_filters[0][0]=valid_filters[0][0];
printf("\nValid_filters=%f",Scan_Data.valid_filters[0][0]);
?
4. Accessing a vector with operator[] is not safe
When you do:
Scan_Data.valid_filters[i][j]
You in fact do:
QFilterDataMap1D & v = Scan_Data.valid_filters[i]; // call QMap::operator[](double)
double d = v[j]; // call QVector::operator[](int)
The first one is safe, and create the entry if it doesn't exist. The second one is not safe, the jth element in you vector must already exist otherwise it would crash.
Solution
It seems you in fact want a 2D array of double (i.e., a matrix). To do this, use:
typedef QVector<double> QFilterDataMap1D;
typedef QVector<QFilterDataMap1D> QFilterDataMap2D;
Then, when you want to transfer one in another, simply use:
Scan_Data.valid_filters = valid_filters;
Or if you want to do it yourself:
Scan_Data.valid_filters.clear();
for(int i=0;i<n;i++)
{
Scan_Data.valid_filters << QFilterDataMap1D();
for(int j=0;j<m;j++)
{
Scan_Data.valid_filters[i] << valid_filters[i][j];
printf("\nValid_filters=%f",Scan_Data.valid_filters[i][j]);
}
}
If you want a 3D matrix, you would use:
typedef QVector<QFilterDataMap2D> QFilterDataMap3D;

two dimensional vector

I wanted to have a linked list of nodes with below structure.
struct node
{
string word;
string color;
node *next;
}
for some reasons I decided to use vector instead of list.my question is that is it possible to implement a vector which it's j direction is bounded and in i direction is unlimited and to add more two strings at the end of my vertex.
in other words is it possible to implement below structure in vector ?
j
i color1 color2 …
word1 word2 …
I am not good with C/C++, so this answer will only be very general. Unless you are extremely concerned about speed or memory optimization (most of the time you shouldn't be), use encapsulation.
Make a class. Make an interface which says what you want to do. Make the simples possible implementation of how to do it. Most of the time, the simplest implementation is good enough, unless it contains some bugs.
Let's start with the interface. You could have made it part of the question. To me it seems that you want a two-dimensional something-like-an-array of strings, where one dimension allows only values 0 and 1, and the other dimension allows any non-genative integers.
Just to make sure there is no misunderstanding: The bounded dimension is always size 2 (not at most 2), right? So we are basicly speaking about 2×N "rectangles" of strings.
What methods will you need? My guesses: A constructor for a new 2×0 size rectangle. A method to append a new pair of values, which increases the size of the rectangle from 2×N to 2×(N+1) and sets the two new values. A method which returns the current length of the rectangle (only the unbounded dimension, because the other one is constant). And a pair of random-access methods for reading or writing a single value by its coordinates. Is that all?
Let's write the interface (sorry, I am not good at C/C++, so this will be some C/Java/pseudocode hybrid).
class StringPairs {
constructor StringPairs(); // creates an empty rectangle
int size(); // returns the length of the unbounded dimension
void append(string s0, string s1); // adds two strings to the new J index
string get(int i, int j); // return the string at given coordinates
void set(int i, int j, string s); // sets the string at given coordinates
}
We should specify what will the functions "set" and "get" do, if the index is out of bounds. For simplicity, let's say that "set" will do nothing, and "get" will return null.
Now we have the question ready. Let's get to the answer.
I think the fastest way to write this class would be to simply use the existing C++ class for one-dimensional vector (I don't know what it is and how it is used, so I just assume that it exists, and will use some pseudocode; I will call it "StringVector") and do something like this:
class StringPairs {
private StringVector _vector0;
private StringVector _vector1;
private int _size;
constructor StringPairs() {
_vector0 = new StringVector();
_vector1 = new StringVector();
_size = 0;
}
int size() {
return _size;
}
void append(string s0, string s1) {
_vector0.appens(s0);
_vector1.appens(s1);
_size++;
}
string get(int i, int j) {
if (0 == i) return _vector0.get(j);
if (1 == i) return _vector1.get(j);
return null;
}
void set(int i, int j, string s) {
if (0 == i) _vector0.set(j, s);
if (1 == i) _vector1.set(j, s);
}
}
Now, translate this pseudocode to C++, and add any new methods you need (it should be obvious how).
Using the existing classes to build your new classes can help you program faster. And if you later change your mind, you can change the implementation while keeping the interface.

Open Addressing vs. Separate Chaining

Which hashmap collision handling scheme is better when the load factor is close to 1 to ensure minimum memory wastage?
I personally think the answer is open addressing with linear probing, because it doesn't need any additional storage space in case of collisions. Is this correct?
Answering the question: Which hashmap collision handling scheme is better when the load factor is close to 1 to ensure minimum memory wastage?
Open addressing/probing that allows a high fill. Because as you said so yourself, there is no extra space required for collisions (just, well, possibly time -- of course this is also assuming the hash function isn't perfect).
If you did not specify "load factor close to 1" or included "cost" metrics in the question then it would be entirely different.
Happy coding.
A hashmap that is that full will degrade into a linear search, so you will want to keep them under 90% full.
You are right about open addressing using less memory, chaining will need a pointer or offset field in each node.
I have created a hasharray data structure for when I need very lightweight hashtables that will not have alot of inserts. To keep memory usage low all data is embedded in the same block of memory, with the HashArray structure at the start, then two arrays for hashs & values. Hasharray can only be used with the lookup key is stored in the value.
typedef uint16_t HashType; /* this can be 32bits if needed. */
typedef uint16_t HashSize; /* this can be made 32bits if large hasharrays are needed. */
struct HashArray {
HashSize length; /* hasharray length. */
HashSize count; /* number of hash/values pairs contained in the hasharray. */
uint16_t value_size; /* size of each value. (maximum size of value 64Kbytes) */
/* these last two fields are just for show, they are not defined in the HashArray struct. */
uint16_t hashs[length]; /* array of hashs for each value, this helps with resolving bucket collision */
uint8_t values[length * value_size]; /* array holding all values. */
};
#define hasharray_get_hashs(array) (HashType *)(((uint8_t *)(array)) + sizeof(HashArray))
#define hasharray_get_values(array) ((uint8_t *)(array)) + sizeof(HashArray) + \
((array)->length * sizeof(HashType))
#define hasharray_get_value(array, idx) (hasharray_get_values(array) + ((idx) * (array)->value_size))
The macros hasharray_get_hashs & hasharray_get_values are used to get the 'hashs' & 'values' arrays.
I have used this for adding fast lookup of complex objects that are already stored in an array. The objects have a string 'name' field which is used for the lookup. The names are hashed and inserted into the hasharray with the objects index. The values stored in the hasharray can be indexes/pointers/whole objects (I only use small 16bit index values).
If you want to pack the hasharray till it is almost full, then you will want to use full 32bit Hashs instead of the 16bit ones defined above. Larger 32bit hashs will help keep searchs fast when the hasharray is more then 90% full.
The hasharray as defined above can only hold a maximum of 65535, which is fine since I never use it on anything that would have more the a few hundred values. Anything that needs more that that should just use an normal hashtable. But if memory is really an issue, the HashSize type could be changed to 32bits. Also I use power-of-2 lengths to keep the hash lookup fast. Some people prefer to use prime bucket lengths, but that is only needed if the hash function has bad distribution.
#define hasharray_empty_hash 0xFFFF /* hash value to mark empty slots. */
void *hasharray_search(HashArray *array, HashType hash, uint32_t *next) {
HashType *hashs = hasharray_get_hashs(array);
uint32_t mask = array->length - 1;
uint32_t start_idx;
uint32_t idx;
hash = (hash == hasharray_empty_hash) ? 0 : hash; /* need one hash value to mark empty slots. */
start_hash_idx = (hash & mask);
if(*next == 0) {
idx = start_idx; /* new search. */
} else {
idx = *next & mask; /* continuing search to next slot. */
}
/* find hash in hash array. */
do {
/* check for hash match. */
if(hashs[idx] == hash) goto found_hash;
/* check for end of chain. */
if(hashs[idx] == hasharray_empty_hash) break;
idx++;
idx &= mask;
} while(idx != start_idx);
/* maximum tries reached (i.e. did a linear search of whole array) or end of chain. */
return NULL;
found_hash:
*next = idx + 1; /* where to continue search at, if this is not the right value. */
return hasharray_get_values(array) + (idx * array->value_size);
}
hash collisions will happen so the code that calls hasharray_search() needs to compare the search key with the one stored in the value object. If they don't match then hasharray_search() is called again. Also non-unique keys can exist, since searching can continue until 'NULL' is returned to find all values that match one key. The search function uses linear probing to be cache freindly.
typedef struct {
char *name; /* this is the lookup key. */
char *type;
/* other field info... */
} Field;
typedef struct {
Field *list; /* array of Field objects. */
HashArray *lookup; /* hasharray for fast lookup of Field objects by name. The values stored in this hasharray are 16bit indices. */
uint32_t field_count; /* number of Field objects in 'list'. */
} Fields;
extern Fields *fields_new(uint16_t count) {
Fields *fields;
fields = calloc(1, sizeof(Fields));
fields->list = calloc(count, sizeof(Field));
/* allocate hasharray to hold at most 'count' uint16_t values.
* The hasharray will round 'count' up to the next power-of-2.
* That power-of-2 length must be atleast (count+1), so that there will always be one empty slot.
*/
fields->lookup = hasharray_new(count, sizeof(uint16_t));
fields->field_count = count;
}
extern Field *fields_lookup_by_name(Fields *fields, const char *name) {
HashType hash = str_to_hash(name);
Field *field;
uint32_t next = 0;
uint16_t *rc;
uint16_t idx;
do {
rc = hasharray_search(fields->lookup, hash, &next);
if(rc == NULL) break; /* field not found. */
/* found a possible match. */
idx = *rc;
assert(idx < fields->field_count);
field = &(fields->list[idx]);
/* compare lookup name with field's name. */
if(strcmp(name, field->name) == 0) {
/* found match. */
return field;
}
/* field didn't match continue search to next field. */
} while(1);
return NULL;
}
The worst case searching will degrade to a linear search of the whole array if it is 99% full and the key doesn't exist. If the keys are integers, then a linear search shouldn't be to bad, also only keys with the same hash value will need to be compared. I try to keep the hasharrays sized so they are only about 70-80% full, the space wasted on empty slots isn't much if the values are only 16bit values. With this design you only waste 4bytes per empty slot when using 16bit hashs & 16bit index values. The array of objects (Field structs in the above example) has no empty spots.
Also most hashtable implementations that I have seen don't store the computed hashs and require full key compares to resolve bucket collisions. Comparing the hashs helps a lot since only a small part of the hash value is used to lookup the bucket.
As the others said, in linear probing, when load factor near to 1, the time complexity near to linear search. (When it's full, its infinite.) There is a memory-efficiency trade off here. While segregate chaining always give us theoretically constant time.
Normally, under linear probing, it's recommended to keep the load factor between 1/8 and 1/2. when the array is 1/2 full, we resize it to double the size of original array. (Reference: Algorithms. by Robert Sedgewick. Kevin Wayne. ). When delete, we resize the array to 1/2 of original size as well. If you are really interested, it's good for you to begin with the book I mentioned above.
In practical, it's said that 0.72 is an empirical value we usually use.

Resources