LMDB: How to store large value sizes efficiently - openldap

I have been using LMDB to store key value pairs where the value sizes are of the order 200 Bytes. I am running into a scenarios where value sizes could grow upto 8KB or more.
According to: https://lmdb.readthedocs.io/en/release/#storage-efficiency-limits and https://github.com/lmdbjava/benchmarks/blob/master/results/20160710/README.md, LMDB is most efficient for value sizes in page size (4096KB) increments, otherwise it can lead to fragmentation due to overflow pages.
My main questions are:
Do I need to break down my value into page size increments for optimal performance?
Are lexicographic sorted keys in LMDB placed in adjacent pages? Let's say my value is about 14KB and I break it down into 8K, 4K and 2K chunks, with key values :key_chunk1, key_chunk_2, key_chunk_3, will they be in adjacent pages? Let's say the last chunk (The 2KB value) is on a new page, and the next lexicographically sorted key is of 4K, will this be in a new page as it cannot fit in the existing page?

Related

What will happen to my sqlite database ten years from now in terms of capacity and query speed

I have created a database with one single table (check the code bellow). I plan to insert 10 rows per minute, which is about 52 million rows in ten years from now.
My question is, what can I expect in terms of database capacity and how long it will take to execute select query. Of course, I know you can not provide me an absolute values, but if you can give me any tips on change/speed rates, traps etc. I would be very glad.
I need to tell you, there will be 10 different observations (this is why I will insert ten rows per minute).
create table if not exists my_table (
date_observation default current_timestamp,
observation_name text,
value_1 real(20),
value_1_name text,
value_2 real(20),
value_2_name text,
value_3 real(20),
value_3_name text);
Database capacity exceeds known storage device capacity as per Limits In SQLite.
The more pertinent paragraphs are :-
Maximum Number Of Rows In A Table
The theoretical maximum number of rows in a table is 2^64
(18446744073709551616 or about 1.8e+19). This limit is unreachable
since the maximum database size of 140 terabytes will be reached
first. A 140 terabytes database can hold no more than approximately
1e+13 rows, and then only if there are no indices and if each row
contains very little data.
Maximum Database Size
Every database consists of one or more "pages". Within a single
database, every page is the same size, but different database can have
page sizes that are powers of two between 512 and 65536, inclusive.
The maximum size of a database file is 2147483646 pages. At the
maximum page size of 65536 bytes, this translates into a maximum
database size of approximately 1.4e+14 bytes (140 terabytes, or 128
tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not
have access to hardware capable of reaching this limit. However, tests
do verify that SQLite behaves correctly and sanely when a database
reaches the maximum file size of the underlying filesystem (which is
usually much less than the maximum theoretical database size) and when
a database is unable to grow due to disk space exhaustion.
Speed determination has many aspects and is thus not a simple how fast will it go, like a car. The file system, the memory, optimisation are all factors that need to be taken into consideration. As such the answer is the same as the length of the anecdotal piece of string.
Note 18446744073709551616 is if you utilise negative numbers otherwise the more frequently mentioned number of 9223372036854775807 is the limit (i.e a 64 bit signed integer)
To utilise negative rowid numbers and therefore the higher range you have to insert at least 1 negative value explicitly into a rowid or alias thereof as per If no negative ROWID values are inserted explicitly, then automatically generated ROWID values will always be greater than zero.

How JSON format affects storage and capacity units

I have this rudimentary question about DynamoDb -- is it worth shortening attribute names and removing whitespace in order to save on throughput and storage?
I am planning to store millions of items that looks like this:
{
"currency": "USD",
"openPrice": 0.1,
"closePrice": 0.1,
"highPrice": 0.1,
"lowPrice": 0.1
}
if I reformat this JSON fragment to look like this:
{"c":"USD","op":0.1,"cp":0.1,"hp":0.1,"lp":0.1}
would this shorter JSON result in cost savings because of lower storage and fewer throughput capacity units?
Thanks.
Yes, attribute names are factored into the total item size. This becomes important in one of two scenarios:
If your access patterns involve scans; or queries that expect to retrieve many items - because in these cases you get billed for aggregated size (ie sum of all accessed items) so small items use consumed capacity more efficiently than larger items
If you have items with many attributes such that the size of the attribute names and the size of their values pushes the total item size past the 1KB boundary - this is important because write capacity is billed in 1KB increments
I would use common sense in naming: prefer short and concise naming but try to avoid cryptic names (e.g. c is not a great name, but neither is customerDataItemIdentifier - a better choice might be custId)

Load factor of hash tables with tombstones

So the question came up about whether tombstones should be included when calculating the load factor of a hash table.
I thought that, given that the load factor is used to determine when to expand capacity, tombstones should not be included. An obvious example is if you almost fill and then remove every value in a hash table. Here insertions are super easy (no collisions) so I believe the load factor shouldn't include them.
But you could look at this and think that with all the tombstones lookups will be slow (potentially searching almost the entire space).
So I thought I'd ask the question. Should the load factor of a hashtable include tombstones in the calculation?
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally. It's advantages:
Well, simplicity of understanding and implementation.
Hash table data structure shouldn't store many numbers with some thresholds -- likely only one number. This is meaningful when hash table is very small and the size of the header affects total data structure memory efficiency (in bytes to store an entry).
In certain (and common) case: append/update only hash table, more complex models of behaviour degenerate to the "just load factor" model, in other words, load factor model defines relatively optimal behaviour.
See also my answer on load factor model. I prefer [min load, target load, max load] + growth factor frame model.
If you develop general-purpose hash table with tombstones, I think you can just pick up my results (below). I spend maybe several weeks solely developing this model. Maybe you can make some improvements or further research, I would be glad.
Two main hash table dynamic behaviour patterns are targeted:
growing hash table (maybe in growing phase), with little or no removals
initial fill of hash table, when proper capacity was not specified (or unknown)
hash table that remains of the same or nearly the same size, number of removals is equal or nearly equal to number of insertions
caches with upper size bound, LRUs, tables with entry expires
Two thresholds are defined:
max size (i. e. number of alive entries), table size * max load
min number of free (i. e. empty, without alive entry nor tombstone) slots, computed by magic formula.
If hash table size exceeds max size, we assume we are in the "growing pattern", rehash to the table size to be able to store current size * growth factor entries, i. e. choose table size closest possible to current size * growth factor / target load.
If the number of free slots becomes below than min number of free slots, we are in "cache pattern", rehash "to the current size", i. e. to the table size closest possible to current size / target load.
Read the source where all the above logics are coded.
Also, article Tombstones purge from hashtable: theory and practice sheds some light.
If you develop specially purposed hash table, which dymanic properties are known (or could be studied), I recommend you to develop your own model, fitting your case. Don't rely on pure math and CS theory, evaluate your model in benchmarks.

How consumed throughput is influenced by write into local secondary index with no change in data?

Condider a table A with index A-index. I write around 100 items into A in batches (using PutRequest within BatchWriteItem).
If I repeat the operation with the same set of items, they will be just replacing the existing items. But how does that impact the local secondary index? Since it's a complete replace, does it replace in index also, thereby consuming throughput there too? Or does it figure out the items are exactly same and hence doesn't perform any operation, thereby resulting in no additional consumed throughput for index?
Found the answer by running a trial program and noticing the results in ConsumedCapacity attribute for table and indices.
During replace, if there are no changes, the consumed throughput is not calculated as DynamoDB figures out it's exactly the same. But if there are changes, throughput per item is calculated.

DynamoDB read and write

what constitutes an actual read in DynamoDB?
is it reading every line in a table or what data is returned?
is this why a scan is so expensive - you read the entire table and are charged for every table line that is read?
Can you put ElasticCache (Memcached) in front of DynamoDB to keep the cost down?
Finally are you charged for a query that yields no results?
See this link: http://aws.amazon.com/dynamodb/faqs/
1 Write = 1 Write per second for an item up to 1Kb in size.
1 Read = 2 Reads per second for an item up to 1Kb in size, or 1 per second if you required fully consistent results.
For example, if your items are 512 bytes and you need to read 100
items per second from your table, then you need to provision 100 units
of Read Capacity.
If your items are larger than 1KB in size, then you should calculate
the number of units of Read Capacity and Write Capacity that you need.
For example, if your items are 1.5KB and you want to do 100
reads/second, then you would need to provision 100 (read per second) x
2 (1.5KB rounded up to the nearest whole number) = 200 units of Read
Capacity.
Note that the required number of units of Read Capacity is determined
by the number of items being read per second, not the number of API
calls. For example, if you need to read 500 items per second from your
table, and if your items are 1KB or less, then you need 500 units of
Read Capacity. It doesn’t matter if you do 500 individual GetItem
calls or 50 BatchGetItem calls that each return 10 items.
The above applies to all the usual methods, GET, BATCH X & QUERY.
SCAN is a little different, they don't document exactly how they calculate the usage but they do offer the following:
The Scan API will iterate through your entire dataset and apply the
filter conditions to every row. Since only 1MB of data can be scanned
at a time, you may need to do multiple round trips (using a
continuation token) to complete the scan. Further, using this API may
consume much of your provisioned read throughput. Hence, this method
has limited scaling characteristics and we do not recommend that you
use it as a part of your application’s regular behavior.
So to answer your question directly: The calculation is made on what data is returned in all cases except for SCAN, where there isn't really any clear indication on how they charge. A query that yields no results will not cost you anything.
You can definitely set up a caching system infront of Dynamo, definitely recommend you look into that if you want to keep your reads down.
Hope that helps!

Resources