Does anyone know how to do this and what the pseudo code would look like?
As we all know a hash table stores key,value pairs and when a key is a called, the function will return the value associated with that key. What I want to do is understand the underlying structure in creating that mapping function. For example, if we lived in a world where there were no previously defined functions except for arrays, how could we replicate the Hashmaps that we have today?
Actually, some of todays Hashmap implentations are indeed made out of arrays as you propose. Let me sketch how this works:
Hash Function
A hash function transforms your keys into an index for the first array (array K). A hash function such as MD5 or a simpler one, usually including a modulo operator, can be used for this.
Buckets
A simple array-based Hashmap implementation could use buckets to cope with collissions. Each element ('bucket') in array K contains itself an array (array P) of pairs. When adding or querying for an element, the hash function points you to the correct bucket in K, which contains your desired array P. You then iterate over the elements in P until you find a matching key, or you assign a new element at the end of P.
Mapping keys to buckets using the Hash
You should make sure that the number of buckets (i.e. the size of K) is a power of 2, let's say 2^b. To find the correct bucket index for some key, compute Hash(key) but only keep the first b bits. This is your index when cast to an integer.
Rescaling
Computing the hash of a key and finding the right bucket is very quick. But once a bucket becomes fuller, you will have to iterate more and more items before you get to the right one. So it is important to have enough buckets to properly distribute the objects, or your Hashmap will become slow.
Because you generally don't know how much objects you will want to store in the Hashmap in advance, it is desirable to dynamically grow or shrink the map. You can keep a count of the number of objects stored, and once it goes over a certain threshold you recreate the entire structure, but this time with a larger or smaller size for array K. In this way some of the buckets in K that were very full will now have their elements divided among several buckets, so that performance will be better.
Alternatives
You may also use a two-dimensional array instead of an array-of-arrays, or you may exchange array P for a linked list. Furthermore, instead of keeping a total count of stored objects, you may simply choose to recreate (i.e. rescale) the hashmap once one of the buckets contains more than some configured number of items.
A variation of what you are asking is described as 'array hash table' in the Hash table Wikipedia entry.
Code
For code samples, take a look here.
Hope this helps.
Could you be more precise? Does one array contain the keys, the other one the values?
If so, here is an example in Java (but there are few specificities of this language here):
for (int i = 0; i < keysArray.length; i++) {
map.put(keysArray[i], valuesArray[i]);
}
Of course, you will have to instantiate your map object (if you are using Java, I suggest to use a HashMap<Object, Object> instead of an obsolete HashTable), and also test your arrays in order to avoid null objects and check if they have the same size.
Sample Explanation:
At the below source, basically it does two things:
1. Map Representation
Some (X number of List) of lists
X being 2 power N number of lists is bad. A (2 power N)-1, or (2 power N)+1, or a prime number is good.
Example:
List myhashmap [hash_table_size];
// an array of (short) lists
// if its long lists, then there are more collisions
NOTE: this is array of arrays, not two arrays (I can't see a possible generic hashmap, in a good way with just 2 arrays)
If you know Algorithms > Graph theory > Adjacency list, this looks exactly same.
2. Hash function
And the hash function converts string (input) to a number (hash value), which is index of an array
initialize the hash value to first char (after converted to int)
for each further char, left shift 4 bits, then add char (after converted to int)
Example,
int hash = input[0];
for (int i=1; i<input.length(); i++) {
hash = (hash << 4) + input[i]
}
hash = hash % list.size()
// list.size() here represents 1st dimension of (list of lists)
// that is 1st dimension size of our map representation from point #1
// which is hash_table_size
See at the first link:
int HTable::hash (char const * str) const
Source:
http://www.relisoft.com/book/lang/pointer/8hash.html
How does a hash table work?
Update
This is the Best source: http://algs4.cs.princeton.edu/34hash/
You mean like this?
The following is using Ruby's irb as an illustration:
cities = ["LA", "SF", "NY"]
=> ["LA", "SF", "NY"]
items = ["Big Mac", "Hot Fudge Sundae"]
=> ["Big Mac", "Hot Fudge Sundae"]
price = {}
=> {}
price[[cities[0], items[1]]] = 1.29
=> 1.29
price
=> {["LA", "Hot Fudge Sundae"]=>1.29}
price[[cities[0], items[0]]] = 2.49
=> 2.49
price[[cities[1], items[0]]] = 2.99
=> 2.99
price
=> {["LA", "Hot Fudge Sundae"]=>1.29, ["LA", "Big Mac"]=>2.49, ["SF", "Big Mac"]=>2.99}
price[["LA", "Big Mac"]]
=> 2.49
Related
I need an efficient data structure to store a multidimensional sparse array.
There are only 2 operations over the array:
batch insert of values, usually of a larger number of new values that existed in the array before. Very unlikely that there is a key collision on insert, however if it happens then the value is not updated.
query values in certain range (e.g. read range from index [2, 3, 10, 2] to [2, 3, 17, 6] in order)
From the start I know number of dimensions (usually between 3 to 10) and their sizes (each index can be stored in Int64 and product of all sizes doesn't exceed 2^256) and the upper limit on possible number of the array cells (usually 2^26-2^32).
Currently I use a balanced binary tree for storing the sparse array, the UInt256 key is formed as usual:
key = (...(index_0 * dim_size_1 + index_1) + ... + index_n-1) * dim_size_n + index_n
with operation time complexities (and I understand it can't be any better):
insert in O(log N)
search in O(log N)
Current implementation has problems:
expensive encoding of an index tuple into the key and a key back into the indexes
lack of locality of reference which would be beneficial during range queries
Is it a good idea to replace my tree with a skip list for the locality of reference?
When is it better to have a recursive (nested) structure of sparse arrays for each dimension instead of a single array with the composite key if the array sparseness is given?
I'm interested in any examples of efficient in-memory multidimensional array implementations and in specialized literature on the topic.
It depends on how sparse your matrix is. It's hard to give give numbers, but if it is "very" sparse then you may want to try using a PH-Tree (disclaimer: self advertisement). It is essentially a multidimensional radix-tree.
It natively supports 64bit integers (Java and C++). It is not balanced but depth is inherently limited to the number of bits per dimension (usually 64). It is natively a "map", i.e. it allows only one value per coordinate (there is also a multimap version that allows multiple values). The C++ version is limited to 62 dimensions.
Operations are in the order of O(log N) but should be significantly faster than a (balanced) binary tree.
Please note that the C++ version doesn't compile with MSVC at the moment but there is a patch coming. Let me know if you run into problems.
I'm looking for an std style container, i.e. with iterators and such with a
structure along the lines of:
template <hashable T, hasable U> class relationships {
relation(std::[vector or list]<std::pair<T,U>> list);
const std::pair<T,U>& operator [](const T index);
const std::pair<T,U>& operator [](const U index);
}
This is for a two way mapping, into a order list of pairs, every value of both T and U are unique, and both are hashable, and the pairs of related T and U have a specific ordering to them, that should be reproduce by the following loop
for (auto it : relationships) {
// do something with it
}
would be equivalent to
for (auto it : list) {
// do something with it
}
I also want efficient lookup i.e. operator [], should be equivalent to an std::unorderd_map for both types.
Finally I'm look for solutions based around the Standard Library using C++14 and DO NOT WANT TO USE BOOST.
I seen how to implement a Hash map previously using binary search trees, however I looking for insight in how to efficiently maintain the structure for two indexes plus ordered elements, or existing solutions if one exists:
my current idea is something using nodes along the line of
template <typename T, typename U> struct node {
std::pair<T, U> value; // actual value
// hashs for sorting binary trees
size_t hashT;
size_t hashU;
// linked list for ordering
node * prevL;
node * nextL;
// binary search tree for type T lookup
node * parentT;
node * prevT;
node * nextT;
// binary search tree for type U lookup
node * parentU;
node * prevU;
node * nextU;
}
However that seams inefficient
my other idea is to store a vector or values, which has order, and then two sorted index vectors of std::pair<size_t, size_t> with first being the hash, and second the index, however how should I deal with performing a binary search on the index vector and handle hash collisions. I believe this solution would be more memory efficient and similar speed, but not sure on all the implementation details.
EDIT: I don't need fast insertions, just lookup and iteration, the mapping is would be generated once and then used to find relationships.
Regarding performance it all depends on the algorithm and the type of T and U you are trying to use . If you build your data and then do not change it, a simple solution would be the following:
Use a vector<pair<T,U>> for constructing your data
duplicate this vector
sort one vector according to T, one according to U
use binary search for fast lookup, either in the first vector if looking by T, or in the second if looking by U
hide all this behind a construct/sort/access interface. You might not want to use operator[] since you are expected to look into your data structure only once sorted
Of course the solution is not perfect in the sense that you are copying data. However, remember that you will have no extra hidden allocation as you would with a hashmap. For example, for T = U = int, I would think that there will be no more memory in use than a std::unordered_map, since each node needs to store a pointer.
What should be size of map if different objects(say 3) have same hash code, and as a result, present in same bucket?
The resulting size of the hash table depends on what collision resolution scheme we are using.
In the simplest case, we are using something like separate chaining (with linked lists).
In this case, we will have an array of N buckets and each bucket contains a reference to a linked list.
If we proceed to insert 3 items into the hash table, all of which share the same hash code, then the single target linked list would grow to length 3.
Thus, at a high level, we need at least N "units" of space to store bucket references plus 3 "units" of space to store the elements of the (occupied) linked list.
The exact size of these "units", depends on implementation details, such as word size (32-bit vs. 64-bit) and the exact definition of the linked list (singly- vs. doubly-linked list).
Assuming that we use singly-linked lists (for each bucket) on a 32-bit machine, the total size would be (approximately) 32 * N + (32 + x) * 3, where x refers to the size of the data type we are storing (e.g. ints, doubles, string, etc.)
If you would like to learn more, I would suggest googling "hash table collision" for more info.
The following code serves to create a counter for each pair of float64.
Because the keys of a map cannot be a slice, I have to use arrays as keys, which forces me to to define a dimension with a constant.
counter := make( map[ [2]float64 ] int )
for _, comb := range combinations{ //combinations is a [n][2]float64
for _, row := range data{
counter[ [...]float64{ row[comb[0]] , row[comb[1]] } ]++
}
}
Having said that, is there a way to make this map dependent on the length of the keys (dependent on the dimensions of combinations?
I tried using a struct as key, but as far as I remember (I might be wrong), it was a bit slower... For my purposes (to apply this for all combinations ~n!) this is not the ideal solution.
Right now I'm only considering combinations of size 2 and 3, and I had to split this in two separate functions, which makes my code very verbose and harder to maintain.
Can you find a way to simplify this, so I can scale it to more dimensions?
Thanks for any input
Why not use the pointer to a slice as key?
You could create a slice with make with a big enough capacity, while you do not surpass it's capacity, the pointer will remain the same.
Take a look here https://play.golang.org/p/333tRMpBLv , it exemplifies my suggestion. see that while len < cap the pointer of the slice is not changed, append only creates a new slice when len exceeds cap.
I coded a java implementation of Hashtable, and I want to test the complexity. The hash table is structured as ad array of double linked list(always implemented by me). The dimension of array is m. I implemented a division hashing function, multiplication one and universal one. For now I'm testing the first one hashing.
I've developed a testing suite made this way:
U (maximum value for a key) = 10000;
m (number of position in the hashkey) = 709;
n (number of element to be inserted) = variable.
So I made multiple insert, where gradually I inserted array with different n. I checked the time of execution with the System.nanoTime().
The graph that comes out is the next:
http://imgur.com/AVpKKZu
Supposed that insert is O(1), n insert are O(n). So should this graph be a O(n)?
If I change my values like this:
U = 1000000
m = 1009
n = variable-> ( I inserted once for time, array with incrementally dimension by 25000 elements, from the one with 25000 elements to the one with 800000 elements ).
The graph i got looks like a little strange:
http://imgur.com/l8OcQYJ
The unique key of elements to be inserted are chosen pseudo randomly between the universe of key U.
But, with different executions, also if I store the same keys in a file, the behavior of the graph always changes with some peaks.
Hope you may help me. If someone needs code, can comment and I will be pleasure to show.