Dictionary implementation (Balance Binary Search tree v.s. hash table)

Dictionary implementation (Balance Binary Search tree v.s. hash table) - dictionary

Under what circumstances would it be better to implement a Dictionary ADT using a balanced binary search tree rather than a hash table?
My assumption was that it is always better to use a binary search tree because of its natural ordering.
But it's true that the hash table's search time can be as good as O(1) , v.s. O(logn) for the binary tree.
so I'm not sure what the circumtaces would be.

Hash tables might have a performance issue when they get filled up and need to reallocate memory (in the context of a hard real-time system).Binary trees don't have this issue.
Hash tables need more memory than they actually use, where as binary trees use as much memory as they need.

Your question already contains the answer:
If you don't require any intrinsic ordering then use a hashtable for better performance. If your requirements demand some kind of ordering then consider using a tree.

The time complexity for Dictionary is:
-----------------------------------------
| Operation | Dictionary | BST |
-----------------------------------------
| Insert | O(1) | O(log(n)) |
-----------------------------------------
| Delete | O(1) | O(log(n)) |
-----------------------------------------
| Search | O(1) | O(log(n)) |
-----------------------------------------
So where do you use BST vs Dictionary? Here are some main advantages of BST.
With BST you always have O(log(n)) operation, but resizing a hash table is a costly operation
If you need to get keys in a sorted order you can get them traversing inorder tree. Sorting is not natural to a dictionary
Doing statistics, like finding the closest lower and greater element, or range query.

Related

SQLite optimization: simultaneous search for lower and upper bounds

For a long running algorithm based on SQLite3, I have a simple but huge table defined like that:
CREATE TABLE T(ID INTEGER PRIMARY KEY, V INTEGER);
The inner loop of the algorithm will need to find, given some integer N, the biggest ID that is less or equal to N, the value V associated to it, as well as the smallest ID that is strictly bigger than N.
The following pair of requests does work:
SELECT ID, V FROM T WHERE ID <= ? ORDER BY ID DESC LIMIT 1;
SELECT ID FROM T WHERE ID > ? LIMIT 1;
But I feel that it should be possible to merge those two requests into a single one. When SQLite has consulted the primary index to find the ID just smaller than N (first request), the next entry in the B-tree index is already the answer to the second request.
To give an order of magnitude, the table T has more than one billion of rows, and the inner requests will need to be executed more than 100 billions of times. Hence each microsecond counts. Of course I will use a fast SSD on a server with plenty of RAM. PostgreSQL could also be an option if it is quicker for that usage without taking more disk space.

This is an answer to my own question. While I didn't found yet a better SQL request as posted in the question, I made some preliminary speed measurements that moved slightly the perspective. Here are my observations:
There is a big difference in search or insert performance depending whether ID values are sequential or random order. In my application, ID will be mostly sequential, but with plenty of exceptions.
Executing the pair of SQL requests takes less time than the sum of each request taken separately. This is most visible with random order. This means that when the second SQL request runs, the B-Tree to access the next ID is always in cache memory and walking through the index is faster the second time.
The search and insertion times per request increase with the number of rows. In sequential order, the difference is small, but in random order the increase is substantial. Indexing a B-Tree is inherently O(log N), and in addition OS cache becomes less performant as the file size increases.
Here are my measurements on a fast server with SSD:
| Insertion (µs)| Search (µs)
# rows | sequ. random | sequ. random
10^5 | 0.60 0.9 | 1.1 1.3
10^6 | 0.64 3.1 | 1.2 2.5
10^7 | 0.66 4.3 | 1.2 3.0
10^8 | 0.70 5.6 | 1.3 4.2
10^9 | 0.73 | 1.3 4.6
My conclusion is that SQLite internal logic doesn't seem to be the bottleneck for the foreseen algorithm. That bottleneck for huge tables is disk access, even on a fast SSD. I don't expect to have a better performance with another database engine, nor with a custom made B-Tree.

Hash Table Implementation - alternatives to collision detection

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented? Is collision detection the only way to achieve an efficient hash table?

Ultimately a finite sized hash table is going to have collisions, at least any generally programmed one. If your key is type string then the hash table has an infinite number of possible keys, but with a hash table, you have just a finite number of buckets. So fundamentally there has to be collisions. If you were to implement a hash table where it ignores collisions, then you would have a very strange, indeterministic data structure that would appear to remove elements at random.
Now, the data structure used on the backend doesn't have to be a linked list. You could implement it as a red-black tree and get log(n) performance out of a collision. You should checkout the article 5 Myths About Hash Tables and also this Stack Overflow question about HashMaps vs Maps.
Now, if you know something about you key type, say the key is a 2 character long string, then there are only a finite number of possible keys, you can then proceed to create a "hash" function that converts the key to a relatively small integer, you could create a look-up table that is guaranteed to not have collisions.
It is important to note that a well-implemented hash table will not suffer very much from collisions. There are bigger problems in the world like world hunger (or even how to implement an efficient hash function) than the computer having to traverse three nodes in a linked list once every 5 days.

Other than collision detection and throwing a LinkedList in a hashtable, what are some other ways that a Hash Table can be implemented?
Other ways include:
having another container type linked from the nodes where elements have collided, such as a balanced binary tree or vector/array
GCC's hash table underpinning std::unordered_X uses a single singly-linked list of values, and a contiguous array of buckets container iterators into the list; that's got some great characteristics including optimal iteration speed regardless of the current load_factor()
using open addressing / closed hashing, which - when an insert/find/erase finds another key in the bucket it has hashed to, uses some algorithm to find another bucket to look in instead (and so on until it finds the key, a deleted element it can insert over, or an unused bucket); there are a number of options for this kind of "probing", the simplest being a try-the-next-bucket approach, another being quadratic 1, 4, 9, 16..., another the use of alternative hash functions.
perfect hash functions (below)
Is collision detection the only way to achieve an efficient hash table?
sometimes it's possible to find a perfect hash function that won't have collisions, but that's generally only true for very limited input sets, whether due to the nature of the inputs (e.g. month and year of birth of living people only has order-of a thousand possible values), or because a small number are known at compile time (e.g. a set of 200 keywords for a compiler).

What exactly is table size in SAS HashTable specified by hashexp?

I would like to have a little clarification on the definiton of a bucket in SAS hashtable. The question is exactly about the hashexp parameter.
According to the SAS DOCs, hashexp is:
The hash object's internal table size, where the size of the hash table is 2n.
The value of HASHEXP is used as a power-of-two exponent to create the hash table size. For example, a value of 4 for HASHEXP equates to a hash table size of 24, or 16. The maximum value for HASHEXP is 20.
The hash table size is not equal to the number of items that can be stored. Imagine the hash table as an array of 'buckets.' A hash table size of 16 would have 16 'buckets.' Each bucket can hold an infinite number of items. The efficiency of the hash table lies in the ability of the hashing function to map items to and retrieve items from the buckets.
You should set the hash table size relative to the amount of data in the hash object in order to maximize the efficiency of the hash object lookup routines. Try different HASHEXP values until you get the best result. For example, if the hash object contains one million items, a hash table size of 16 (HASHEXP = 4) would work, but not very efficiently. A hash table size of 512 or 1024 (HASHEXP = 9 or 10) would result in the best performance.
The question is what exactly is a hash table size, while it is not a amount of data in the hash object?
Should it be understood as if we wanted to allocate as much memory as it may be neccessary but not less, no more. It is a power of two to get things work fast. But it does not limit the amount of data possibly used, it only indicates about how much is going to be used, right?

Paul Dorfman (the master of hashing) goes into a fair bit of detail on page 10 of this whitepaper:
http://www2.sas.com/proceedings/forum2008/037-2008.pdf
As I understand it, hashtables store their data in binary trees. Each bucket created by hashexp represents the number of binary trees that will be used to store the data. A hashexp of 0 would use a single tree, while a hashexp of 8 would use 256 trees. When a lookup is performed against the hash object, an internal algorithm determines which tree the key should exist in (based on the hashed value). It then checks that tree for the value. By automatically knowing which of the 256 trees to look in (for example) it would have saved itself 8 comparisons (2^8) when compared to a single binary tree.
The whole thing seems a lot more complex than that but that's my interpretation of why it works out faster.

As Rob Penridge pointed out, Paul Dorfman is indeed the SAS Hash Object Guru. Hashexp is not related to the size of the hash table, again as mentioned in Rob's answer.
If you have a table with 100obs and 10 numeric variables which is loaded into a hash table, then size of the hash table is simply 100obs*10vars*8bytes(assuming all numeric vars are stored as 8byte fields) 7.8KB give or take a 10%.
Remember that SAS dynamically allocates RAM space as records are added to the Hash table in memory, so you do not need to specify in advance what size it should be.[I've been using hash tables regularly, but cant think of any place where one can specify the size in advance].
General tip: if you want to know how big your hash table is going to be, run a PROC CONTENTS on the dataset you want to load into Hash table and multiply "Observation Length" & "No. of obs in dataset", this will give the memory size needed in bytes. If you have that much memory then you can load it into memory.

Berkeleydb - B-Tree versus Hash Table

I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable.
A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.
Apart from these considerations what else should be factored in?
If I don't need to support range queries can I just use a Hashtable access method?

When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).

It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.
If you are really concerned about performance you could try both methods and run your own benchmarks =]

For many applications, a database is accessed at random, interactively
or with "transactions". This might happen if you have data coming in
from a web server. However, you often have to populate a large
database all at once, as a "batch" operation. This might happen if you
are doing a data analysis project, or migrating an old database to a
new format.
When you are populating a database all at once, a B-Tree or other
sorted index is preferable because it allows the batch insertions to
be done much more efficiently. This is accomplished by sorting the
keys before putting them into the database. Populating a BerkeleyDB
database with 10 million entries might take an hour when the entries
are unsorted, because every access is a cache miss. But when the
entries are sorted, the same procedure might take only ten minutes.
The proximity of consecutive keys means you'll be utilizing various
caches for almost all of the insertions. Sorting can be done very
quickly, so the whole operation could be sped up by several times just
by sorting the data before inserting it. With hashtable indexing,
because you don't know in advance which keys will end up next to each
other, this optimization is not possible.
Update: I decided to provide an actual example. It is based on the
following script "db-test"
#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;
We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)
$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M enw.tab
$ time shuf enw.tab > test-shuf
16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G test.db
You can see that pre-sorting the Btree keys drops the insertion time
down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so
there's a big net gain - the database creation goes 5x faster. With
the Hash format, the creation times are equally slow whether the input
is sorted or not. Also note that the database file size is smaller for
the sorted insertions; presumably this has to do with tree balancing.
The speedup must be due to some kind of caching, but I'm not sure
where. It is likely that we have fewer cache misses in the kernel's
page cache with sorted insertions. This would be consistent with the
CPU usage numbers - when there is a page cache miss, then the process
has to wait while the page is retrieved from disk, so the CPU usage is
lower.
I ran the same tests with two smaller files as well, for comparison.
File | WP index | Wikt. words | /usr/share/dict/words
Entries | 16e6 | 4.7e6 | 1.2e5
Size | 700M | 65M | 1.1M
shuf time | 34s | 4s | 0.06s
sort time | 1:10s | 6s | 0.12s
-------------------------------------------------------------------------
| total DB CPU | |
| time size usage| |
-------------------------------------------------------------------------
Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86%
sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97%
Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98%
sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup | 5x | 2.7x | 2.2x
With the largest dataset, sorted insertions give us a 5x speedup.
With the smallest, we still get a 2x speedup - even though the data
fits easily into memory, so that CPU usage is always high. This seems
to imply that we are benefiting from another source of efficiency in
addition to the page cache, and that the 5x speedup was actually due
in equal parts to page cache and something else - perhaps the better
tree balancing?
In any case, I tend to prefer the Btree format because it allows
faster batch operations. Even if the final database is accessed at
random, I use batch operations for development, testing, and
maintenance. Life is easier if I can find a way to speed these up.

To quote the two main authors of Berkeley DB in this write up of the architecture:
The main difference between Btree and Hash access methods is that
Btree offers locality of reference for keys, while Hash does not. This
implies that Btree is the right access method for almost all data
sets; however, the Hash access method is appropriate for data sets so
large that not even the Btree indexing structures fit into memory. At
that point, it's better to use the memory for data than for indexing
structures. This trade-off made a lot more sense in 1990 when main
memory was typically much smaller than today.
So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.

Hash tables v self-balancing search trees

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.
I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.
I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.
I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?
I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?
I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).

Here is what I can think of:
There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace

Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.

Just wanted to add :
Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.

In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".
In the real world, the memory wall makes them far less efficient than they are on paper.
With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).
In a word: keep it simple. If that's simple for you then that's simple for your computer.

I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.

A few reasons I can think of:
Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).
Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex