Berkeleydb - B-Tree versus Hash Table

Berkeleydb - B-Tree versus Hash Table - hashtable

I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable.
A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.
Apart from these considerations what else should be factored in?
If I don't need to support range queries can I just use a Hashtable access method?

When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).

It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.
If you are really concerned about performance you could try both methods and run your own benchmarks =]

For many applications, a database is accessed at random, interactively
or with "transactions". This might happen if you have data coming in
from a web server. However, you often have to populate a large
database all at once, as a "batch" operation. This might happen if you
are doing a data analysis project, or migrating an old database to a
new format.
When you are populating a database all at once, a B-Tree or other
sorted index is preferable because it allows the batch insertions to
be done much more efficiently. This is accomplished by sorting the
keys before putting them into the database. Populating a BerkeleyDB
database with 10 million entries might take an hour when the entries
are unsorted, because every access is a cache miss. But when the
entries are sorted, the same procedure might take only ten minutes.
The proximity of consecutive keys means you'll be utilizing various
caches for almost all of the insertions. Sorting can be done very
quickly, so the whole operation could be sped up by several times just
by sorting the data before inserting it. With hashtable indexing,
because you don't know in advance which keys will end up next to each
other, this optimization is not possible.
Update: I decided to provide an actual example. It is based on the
following script "db-test"
#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;
We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)
$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M enw.tab
$ time shuf enw.tab > test-shuf
16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G test.db
You can see that pre-sorting the Btree keys drops the insertion time
down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so
there's a big net gain - the database creation goes 5x faster. With
the Hash format, the creation times are equally slow whether the input
is sorted or not. Also note that the database file size is smaller for
the sorted insertions; presumably this has to do with tree balancing.
The speedup must be due to some kind of caching, but I'm not sure
where. It is likely that we have fewer cache misses in the kernel's
page cache with sorted insertions. This would be consistent with the
CPU usage numbers - when there is a page cache miss, then the process
has to wait while the page is retrieved from disk, so the CPU usage is
lower.
I ran the same tests with two smaller files as well, for comparison.
File | WP index | Wikt. words | /usr/share/dict/words
Entries | 16e6 | 4.7e6 | 1.2e5
Size | 700M | 65M | 1.1M
shuf time | 34s | 4s | 0.06s
sort time | 1:10s | 6s | 0.12s
-------------------------------------------------------------------------
| total DB CPU | |
| time size usage| |
-------------------------------------------------------------------------
Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86%
sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97%
Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98%
sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup | 5x | 2.7x | 2.2x
With the largest dataset, sorted insertions give us a 5x speedup.
With the smallest, we still get a 2x speedup - even though the data
fits easily into memory, so that CPU usage is always high. This seems
to imply that we are benefiting from another source of efficiency in
addition to the page cache, and that the 5x speedup was actually due
in equal parts to page cache and something else - perhaps the better
tree balancing?
In any case, I tend to prefer the Btree format because it allows
faster batch operations. Even if the final database is accessed at
random, I use batch operations for development, testing, and
maintenance. Life is easier if I can find a way to speed these up.

To quote the two main authors of Berkeley DB in this write up of the architecture:
The main difference between Btree and Hash access methods is that
Btree offers locality of reference for keys, while Hash does not. This
implies that Btree is the right access method for almost all data
sets; however, the Hash access method is appropriate for data sets so
large that not even the Btree indexing structures fit into memory. At
that point, it's better to use the memory for data than for indexing
structures. This trade-off made a lot more sense in 1990 when main
memory was typically much smaller than today.
So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.

Related

Quickest way to get currently provisioned throughput for all collections in an account?

I would like to gather information about currently provisioned throughput for all (Mongo API) collections in all databases in a Cosmos account. This is to detect any variances from an expected baseline.
For my use case the results need to report autoscale but not database provisioned throughput.
The following works but is quite slow (I ran it in Azure cloudshell for an account with 94 collections across 41 databases. The first attempt took 4 mins 48 seconds with some noticeable lengthy delays between some of the results. The second attempt was 3 minutes faster at 1 min 48.
Even the second attempt is much too slow for my liking though.
Set-AzContext -Subscription "..."
$rgName = "..."
$accountName = "..."
Get-AzCosmosDBMongoDBDatabase -ResourceGroupName $rgName -AccountName $accountName | ForEach-Object {$Dbname = $_.Name; Get-AzCosmosDBMongoDBCollection -ResourceGroupName $rgName -AccountName $accountName -Database $Dbname | ForEach-Object {$collName = $_.Name; Get-AzCosmosDBMongoDBCollectionThroughput -ResourceGroupName $rgName -AccountName $accountName -DatabaseName $Dbname -Name $collName | Select-Object -Property Throughput, MinimumThroughput,#{Name = 'DatabaseName'; Expression = {$Dbname}},#{Name = 'CollectionName'; Expression = {$collName}} -ExpandProperty AutoscaleSettings}}
Is there any way of getting the desired results much quicker than the above?

You could possibly try using the Azure Management Library for Cosmos DB but I can't say for sure this would be any faster. There is a sample on GitHub here that can show you how to enumerate through database and collection objects and get the throughput on each.
The challenge here overall is control plane operations in Cosmos are served by a master partition within the account and it has an extremely small amount of RU/s to service requests. In fact, if you make too many meta data requests to this master partition you can get rate limited and receive 429 responses. The fact that this is taking such a long period of time is likely a good thing in that you aren't seeing 429s.

SQLite optimization: simultaneous search for lower and upper bounds

For a long running algorithm based on SQLite3, I have a simple but huge table defined like that:
CREATE TABLE T(ID INTEGER PRIMARY KEY, V INTEGER);
The inner loop of the algorithm will need to find, given some integer N, the biggest ID that is less or equal to N, the value V associated to it, as well as the smallest ID that is strictly bigger than N.
The following pair of requests does work:
SELECT ID, V FROM T WHERE ID <= ? ORDER BY ID DESC LIMIT 1;
SELECT ID FROM T WHERE ID > ? LIMIT 1;
But I feel that it should be possible to merge those two requests into a single one. When SQLite has consulted the primary index to find the ID just smaller than N (first request), the next entry in the B-tree index is already the answer to the second request.
To give an order of magnitude, the table T has more than one billion of rows, and the inner requests will need to be executed more than 100 billions of times. Hence each microsecond counts. Of course I will use a fast SSD on a server with plenty of RAM. PostgreSQL could also be an option if it is quicker for that usage without taking more disk space.

This is an answer to my own question. While I didn't found yet a better SQL request as posted in the question, I made some preliminary speed measurements that moved slightly the perspective. Here are my observations:
There is a big difference in search or insert performance depending whether ID values are sequential or random order. In my application, ID will be mostly sequential, but with plenty of exceptions.
Executing the pair of SQL requests takes less time than the sum of each request taken separately. This is most visible with random order. This means that when the second SQL request runs, the B-Tree to access the next ID is always in cache memory and walking through the index is faster the second time.
The search and insertion times per request increase with the number of rows. In sequential order, the difference is small, but in random order the increase is substantial. Indexing a B-Tree is inherently O(log N), and in addition OS cache becomes less performant as the file size increases.
Here are my measurements on a fast server with SSD:
| Insertion (µs)| Search (µs)
# rows | sequ. random | sequ. random
10^5 | 0.60 0.9 | 1.1 1.3
10^6 | 0.64 3.1 | 1.2 2.5
10^7 | 0.66 4.3 | 1.2 3.0
10^8 | 0.70 5.6 | 1.3 4.2
10^9 | 0.73 | 1.3 4.6
My conclusion is that SQLite internal logic doesn't seem to be the bottleneck for the foreseen algorithm. That bottleneck for huge tables is disk access, even on a fast SSD. I don't expect to have a better performance with another database engine, nor with a custom made B-Tree.

High memory consumption in hana table partitioning

I have a big table having records around 4 billion ,table is partitioned but i need to perform the partitioning again. while doing the partitioning memory consumption of the hana system reached to its limit 4TB and started impacting other system.
How we can optimize the partitioning so get completed without consuming that much of memory

To re-partition tables, both the original table structure as well as the new table structure needs to be kept in memory at the same time.
For the target table structures, data will be inserted into delta stores and later on merged, which again consumes memory.
To increase performance, re-partitioning happens in parallel threads, which, you may guess, again uses additional memory.
The administration guide provides a hint to lower the number of parallel threads:
Parallelism and Memory Consumption
Partitioning operations consume a
high amount of memory. To reduce the memory consumption, it is
possible to configure the number of threads used.
You can change the
default value of the parameter split_threads in the partitioning
section of the indexserver.ini configuration file.
By default, 16 threads are used. In the case of a parallel partition/merge, the
individual operations use a total of the configured number of threads
for each host. Each operation takes at least one thread.
So, that's the online option to re-partition if your system does not have enough memory for parallel threads.
Alternatively, you may consider an offline re-partitioning that would involve exporting the table (as CSV!), truncating(!) the table, altering the partitioning on the now empty table and re-importing the data.
Note, that I wrote "truncate" as this will preserve all privileges and references to the table (views, synonyms, roles, etc.) which would be lost if you dropped and recreated the table.

Fragmentation in SQLite used in a round-robin fashion without VACUUM

There's an SQLite database being used to store static-sized data in a round-robin fashion.
For example, 100 days of data are stored. On day 101, day 1 is deleted and then day 101 is inserted.
The number of rows is the same between days. The the individual fields in the rows are all integers (32-bit or less) and timestamps.
The database is stored on an SD card with poor I/O speed,
something like a read speed of 30 MB/s.
VACUUM is not allowed because it can introduce a wait of several seconds
and the writers to that database can't be allowed to wait for write access.
So the concern is fragmentation, because I'm inserting and deleting records constantly
without VACUUMing.
But since I'm deleting/inserting the same set of rows each day,
will the data get fragmented?
Is SQLite fitting day 101's data in day 1's freed pages?
And although the set of rows is the same,
the integers may be 1 byte day and then 4 bytes another.
The database also has several indexes, and I'm unsure where they're stored
and if they interfere with the perfect pattern of freeing pages and then re-using them.
(SQLite is the only technology that can be used. Can't switch to a TSDB/RRDtool, etc.)

SQLite will reuse free pages, so you will get fragmentation (if you delete so much data that entire pages become free).
However, SD cards are likely to have a flash translation layer, which introduces fragmentation whenever you write to some random sector.
Whether the first kind of fragmentation is noticeable depends on the hardware, and on the software's access pattern.
It is not possible to make useful predictions about that; you have to measure it.
In theory, WAL mode is append-only, and thus easier on the flash device.
However, checkpoints would be nearly as bad as VACUUMs.

Dictionary implementation (Balance Binary Search tree v.s. hash table)

Under what circumstances would it be better to implement a Dictionary ADT using a balanced binary search tree rather than a hash table?
My assumption was that it is always better to use a binary search tree because of its natural ordering.
But it's true that the hash table's search time can be as good as O(1) , v.s. O(logn) for the binary tree.
so I'm not sure what the circumtaces would be.

Hash tables might have a performance issue when they get filled up and need to reallocate memory (in the context of a hard real-time system).Binary trees don't have this issue.
Hash tables need more memory than they actually use, where as binary trees use as much memory as they need.

Your question already contains the answer:
If you don't require any intrinsic ordering then use a hashtable for better performance. If your requirements demand some kind of ordering then consider using a tree.

The time complexity for Dictionary is:
-----------------------------------------
| Operation | Dictionary | BST |
-----------------------------------------
| Insert | O(1) | O(log(n)) |
-----------------------------------------
| Delete | O(1) | O(log(n)) |
-----------------------------------------
| Search | O(1) | O(log(n)) |
-----------------------------------------
So where do you use BST vs Dictionary? Here are some main advantages of BST.
With BST you always have O(log(n)) operation, but resizing a hash table is a costly operation
If you need to get keys in a sorted order you can get them traversing inorder tree. Sorting is not natural to a dictionary
Doing statistics, like finding the closest lower and greater element, or range query.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex