For a long running algorithm based on SQLite3, I have a simple but huge table defined like that:
CREATE TABLE T(ID INTEGER PRIMARY KEY, V INTEGER);
The inner loop of the algorithm will need to find, given some integer N, the biggest ID that is less or equal to N, the value V associated to it, as well as the smallest ID that is strictly bigger than N.
The following pair of requests does work:
SELECT ID, V FROM T WHERE ID <= ? ORDER BY ID DESC LIMIT 1;
SELECT ID FROM T WHERE ID > ? LIMIT 1;
But I feel that it should be possible to merge those two requests into a single one. When SQLite has consulted the primary index to find the ID just smaller than N (first request), the next entry in the B-tree index is already the answer to the second request.
To give an order of magnitude, the table T has more than one billion of rows, and the inner requests will need to be executed more than 100 billions of times. Hence each microsecond counts. Of course I will use a fast SSD on a server with plenty of RAM. PostgreSQL could also be an option if it is quicker for that usage without taking more disk space.
This is an answer to my own question. While I didn't found yet a better SQL request as posted in the question, I made some preliminary speed measurements that moved slightly the perspective. Here are my observations:
There is a big difference in search or insert performance depending whether ID values are sequential or random order. In my application, ID will be mostly sequential, but with plenty of exceptions.
Executing the pair of SQL requests takes less time than the sum of each request taken separately. This is most visible with random order. This means that when the second SQL request runs, the B-Tree to access the next ID is always in cache memory and walking through the index is faster the second time.
The search and insertion times per request increase with the number of rows. In sequential order, the difference is small, but in random order the increase is substantial. Indexing a B-Tree is inherently O(log N), and in addition OS cache becomes less performant as the file size increases.
Here are my measurements on a fast server with SSD:
| Insertion (µs)| Search (µs)
# rows | sequ. random | sequ. random
10^5 | 0.60 0.9 | 1.1 1.3
10^6 | 0.64 3.1 | 1.2 2.5
10^7 | 0.66 4.3 | 1.2 3.0
10^8 | 0.70 5.6 | 1.3 4.2
10^9 | 0.73 | 1.3 4.6
My conclusion is that SQLite internal logic doesn't seem to be the bottleneck for the foreseen algorithm. That bottleneck for huge tables is disk access, even on a fast SSD. I don't expect to have a better performance with another database engine, nor with a custom made B-Tree.
Related
I have a table with the following columns:
.create-merge table events
(
Time: datetime,
SiteId: int,
SiteCode: string,
...
)
Site ID and code both provide unique value for a site, theoretically it does not matter which one to use unless I need certain data type in the output. However I see a noticeable difference in performance between the queries:
events | summarize count() by SiteCode
~ 300 ms on a 150M rows table
events | summarize count() by SiteId
~ 560 ms on a 150M rows table
The difference is small in the absolute value, but the string one is almost two times faster than the integer one (for consistent results, I issue requests from a client in the same region). The string code consists of 10-20 characters and intuitively seems to have larger footprint in the computer memory as opposed to 4-byte integer, hence I would expect longer processing of the string one, but it works conversely.
What could be the reason for that? I am missing something fundamental from ADX internals?
Assuming that you are using EngineV3, you are seeing the impact of the dictionary encoding optimization implemented in this engine, where in certain cases string values are encoded to small and efficient int values, hence the better performance. As EngineV3 continues to improve this optimization may be added to int values as well.
I have a bunch of documents. Right now only about 100,000. But I could potentially have millions. These documents are each about 15KB each.
Right now the way I'm calculating the partition key is to take the Id field from Sql, which is set to autoincrement by 1, and dividing that number by 1000. I think this is not a good idea.
Sometimes I have to hit the CosmosDB very hard with parallel writes. When I do this, the documents usually have very closely grouped SQL Ids. For example, like this:
12000
12004
12009
12045
12080
12090
12102
As you can see, all of these documents would be written at the same time to the same partition because they would all have a partition key of 12. And from the documentation I've read, this is not good. I should be spreading my writes across partitions.
I'm considering changing this so that the PartitionKey is the Sql Id divided by 10,000 plus the last digit. Assuming that the group of Ids being written at the same time are randomly distributed (which they pretty much are).
So like this:
(12045 / 10000).ToString() + (12045 % 10).ToString()
This means, given my list above, the partition keys would be:
12000: 10
12004: 14
12009: 19
12045: 15
12080: 10
12090: 10
12102: 12
Instead of writing all 7 to a single partition, this will write all 7 to partitions 10, 12, 14, 15, and 19 (5 total). Will this result in faster write times? What are the effects on read time? Am I doing this right?
Also, is it better to have the first part of the key be the Id / 1000 or Id / 1000000? In other words, is it better to have lots of small partitions or should I aim to fill up the 10 GB limit of single partitions?
you should aim at evenly distributing load between your partitions. 10gb is the limit,you shouldn't aim to hit that limit (because that would mean you wont be able to add documents to the partition anymore).
Creating a synthetic partition key is a valid way to distribute your documents evenly between partitions. Its up to you to find\invent a key that would fit your load pattern.
You could simply take the last digit of your Id, thus nicely spreading the documents over exactly 10 partitions.
In regards to your comment on max partitions: the value of the partitionKey is hashed and THAT hash determines the physical partitions. So when your partitionKey has 1.000 possible values, it does not mean you have 1.000 partitions.
I have created a database with one single table (check the code bellow). I plan to insert 10 rows per minute, which is about 52 million rows in ten years from now.
My question is, what can I expect in terms of database capacity and how long it will take to execute select query. Of course, I know you can not provide me an absolute values, but if you can give me any tips on change/speed rates, traps etc. I would be very glad.
I need to tell you, there will be 10 different observations (this is why I will insert ten rows per minute).
create table if not exists my_table (
date_observation default current_timestamp,
observation_name text,
value_1 real(20),
value_1_name text,
value_2 real(20),
value_2_name text,
value_3 real(20),
value_3_name text);
Database capacity exceeds known storage device capacity as per Limits In SQLite.
The more pertinent paragraphs are :-
Maximum Number Of Rows In A Table
The theoretical maximum number of rows in a table is 2^64
(18446744073709551616 or about 1.8e+19). This limit is unreachable
since the maximum database size of 140 terabytes will be reached
first. A 140 terabytes database can hold no more than approximately
1e+13 rows, and then only if there are no indices and if each row
contains very little data.
Maximum Database Size
Every database consists of one or more "pages". Within a single
database, every page is the same size, but different database can have
page sizes that are powers of two between 512 and 65536, inclusive.
The maximum size of a database file is 2147483646 pages. At the
maximum page size of 65536 bytes, this translates into a maximum
database size of approximately 1.4e+14 bytes (140 terabytes, or 128
tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not
have access to hardware capable of reaching this limit. However, tests
do verify that SQLite behaves correctly and sanely when a database
reaches the maximum file size of the underlying filesystem (which is
usually much less than the maximum theoretical database size) and when
a database is unable to grow due to disk space exhaustion.
Speed determination has many aspects and is thus not a simple how fast will it go, like a car. The file system, the memory, optimisation are all factors that need to be taken into consideration. As such the answer is the same as the length of the anecdotal piece of string.
Note 18446744073709551616 is if you utilise negative numbers otherwise the more frequently mentioned number of 9223372036854775807 is the limit (i.e a 64 bit signed integer)
To utilise negative rowid numbers and therefore the higher range you have to insert at least 1 negative value explicitly into a rowid or alias thereof as per If no negative ROWID values are inserted explicitly, then automatically generated ROWID values will always be greater than zero.
Consider a 2-way set associative cache with the timing of its different units that is
given in Table 2. Consider two possible designs for this cache: unpipelined and pipelined. (30
points)
a) What would be the cycle time for both caches? What is the maximum frequency at
which the processor can run in both cases?
b) Given that all memory references for some program hit in the cache, then compare the
performance of the two caches in terms of cache access latency and throughput. Assume
that tag comparisons are performed using two separate comparators in both cases and
each of the units in the pipelined cache is used in one cycle only.
c) Assuming that the miss rate for both caches to be 5% and the miss penalty is 10 ns for
unpipelined cache and 7 ns for the pipelined cache. Compare the performance of the two
caches in terms of AMAT.
Table 2. Cache Attributes
Unit Delay (ns)
Cache Indexing 0.4
Reading Tags and Data 0.4
Writing Data into Cache 0.3
Comparator 0.2
Block Multiplexor 0.2
I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable.
A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.
Apart from these considerations what else should be factored in?
If I don't need to support range queries can I just use a Hashtable access method?
When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).
It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.
If you are really concerned about performance you could try both methods and run your own benchmarks =]
For many applications, a database is accessed at random, interactively
or with "transactions". This might happen if you have data coming in
from a web server. However, you often have to populate a large
database all at once, as a "batch" operation. This might happen if you
are doing a data analysis project, or migrating an old database to a
new format.
When you are populating a database all at once, a B-Tree or other
sorted index is preferable because it allows the batch insertions to
be done much more efficiently. This is accomplished by sorting the
keys before putting them into the database. Populating a BerkeleyDB
database with 10 million entries might take an hour when the entries
are unsorted, because every access is a cache miss. But when the
entries are sorted, the same procedure might take only ten minutes.
The proximity of consecutive keys means you'll be utilizing various
caches for almost all of the insertions. Sorting can be done very
quickly, so the whole operation could be sped up by several times just
by sorting the data before inserting it. With hashtable indexing,
because you don't know in advance which keys will end up next to each
other, this optimization is not possible.
Update: I decided to provide an actual example. It is based on the
following script "db-test"
#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;
We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)
$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M enw.tab
$ time shuf enw.tab > test-shuf
16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G test.db
You can see that pre-sorting the Btree keys drops the insertion time
down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so
there's a big net gain - the database creation goes 5x faster. With
the Hash format, the creation times are equally slow whether the input
is sorted or not. Also note that the database file size is smaller for
the sorted insertions; presumably this has to do with tree balancing.
The speedup must be due to some kind of caching, but I'm not sure
where. It is likely that we have fewer cache misses in the kernel's
page cache with sorted insertions. This would be consistent with the
CPU usage numbers - when there is a page cache miss, then the process
has to wait while the page is retrieved from disk, so the CPU usage is
lower.
I ran the same tests with two smaller files as well, for comparison.
File | WP index | Wikt. words | /usr/share/dict/words
Entries | 16e6 | 4.7e6 | 1.2e5
Size | 700M | 65M | 1.1M
shuf time | 34s | 4s | 0.06s
sort time | 1:10s | 6s | 0.12s
-------------------------------------------------------------------------
| total DB CPU | |
| time size usage| |
-------------------------------------------------------------------------
Btree shuf | 41m, 1.3G, 42% | 5:00s, 180M, 88% | 6.4s, 3.9M, 86%
sort | 7m, 920M, 91% | 1:50s, 120M, 99% | 2.9s, 2.6M, 97%
Hash shuf | 44m, 1.1G, 39% | 5:30s, 129M, 87% | 6.2s, 2.4M, 98%
sort | 47m, 1.1G, 36% | 5:30s, 129M, 86% | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup | 5x | 2.7x | 2.2x
With the largest dataset, sorted insertions give us a 5x speedup.
With the smallest, we still get a 2x speedup - even though the data
fits easily into memory, so that CPU usage is always high. This seems
to imply that we are benefiting from another source of efficiency in
addition to the page cache, and that the 5x speedup was actually due
in equal parts to page cache and something else - perhaps the better
tree balancing?
In any case, I tend to prefer the Btree format because it allows
faster batch operations. Even if the final database is accessed at
random, I use batch operations for development, testing, and
maintenance. Life is easier if I can find a way to speed these up.
To quote the two main authors of Berkeley DB in this write up of the architecture:
The main difference between Btree and Hash access methods is that
Btree offers locality of reference for keys, while Hash does not. This
implies that Btree is the right access method for almost all data
sets; however, the Hash access method is appropriate for data sets so
large that not even the Btree indexing structures fit into memory. At
that point, it's better to use the memory for data than for indexing
structures. This trade-off made a lot more sense in 1990 when main
memory was typically much smaller than today.
So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.