Is it possible to get average Berkeley DB record size - berkeley-db

I'm using db_stat to get approximate number of records in the BDB (to avoid iteration over the whole database):
[me#home magic]$ db_stat -d random.db
Thu Mar 3 13:38:25 2016 Local time
61561 Hash magic number
8 Hash version number
Little-endian Byte order
Flags
643 Number of pages in the database
4096 Underlying database page size
0 Specified fill factor
2340 Number of keys in the database
2340 Number of data items in the database
299 Number of hash buckets
303540 Number of bytes free on bucket pages (75% ff)
15 Number of overflow pages
39282 Number of bytes free in overflow pages (36% ff)
114 Number of bucket overflow pages
322730 Number of bytes free in bucket overflow pages (30% ff)
0 Number of duplicate pages
0 Number of bytes free in duplicate pages (0% ff)
1 Number of pages on the free list
Is it possible to get average record size as well?
I guess I can use following info to get overall size:
643 Number of pages in the database
4096 Underlying database page size
643*4096 = 2633728 Bytes (corresponds with the file size) and get approximate record size 2633728/2340 = 1125
So my question - would using additional info from db_stat info give me more accurate result?

You've computed the upper bound on average record size:
643 pages * 4096 bytes / page = 2633728 bytes total
2633728 bytes / 2340 keys (records) = 1126 bytes / record
You can get closer to the truth by subtracting all the "bytes free on XXX pages" from the total. This is space that's not in use by the database because of inefficiencies in how it was populated. (As an aside, this doesn't look too bad, but whenever there are a significant number of overflow pages, you could consider a larger page size. Of course, there are downsides to larger page sizes too. Yay, databases!)
2633728 bytes
- 303540 bytes free on bucket pages
- 39282 bytes free in overflow pages
- 322730 bytes free in bucket overflow pages
- 0 bytes free in duplicate pages
--------
1968176 bytes total / 2340 keys = 841 bytes / record
This figure still isn't really the average record size, but I think it's as close as you can get from db_stat. It includes the supporting database structure for each record, and other database overhead.

Related

What count as one read in DynamoDB?

In AWS documentation, it stated that
"For provisioned mode tables, you specify throughput capacity in terms of read capacity units (RCUs) and write capacity units (WCUs):
One read capacity unit represents **one strongly consistent read per second**, or two eventually consistent reads per second, for an item up to 4 KB in size."
But what count as one read? If I loop through different partitions to read from dynamodb, will each loop count as one read? Thank you.
Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html
For a GetItem and BatchGetItem operation which read an individual item, the size of the entire item is used to calculate the amount of RCU (read capacity units) used, even if you only ask to read specific parts from this item. As you quoted, this size is than rounded up to a multiple of 4K: If the item is 3.9K you'll pay one RCU for a strongly-consistent read (ConsistentRead=true), and two RCUs for a 4.1K item. Again, as you quoted, if you asked for an eventual-consistent read (ConsistentRead=false) the number of RCUs would be halved.
For transactions (TransactGetItems) the number of RCUs is double what it would have been with consistent reads.
For scans - Scan or Query - the cost is calculated the same as reading a single item, except for one piece of good news: The rounding up happens for the entire size read, not for each individual item. This is very important for small items - for example consider that you have items of 100 bytes each. Reading each one individually costs you one RCU even though it's only 100 bytes, not 4K. But if you Query a partition that has 40 of these items, the total size of these 40 items is 4000 bytes so you pay just one RCU to read all 40 items - not 40 RCUs. If the length of the entire partion is 4 MB, you'll pay 1024 RCUs when ConsistentRead=true, or 512 RCUs when ConsistentRead=false, to read the entire partition - regardless of how many items this partition contains.

DynamoDB Batch Write Item Limits

I am currently doing a batch load to DynamoDB and dividing our data items into batch units:
According to the limits documentation:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Some of the limits are:
There are more than 25 requests in the batch.
Any individual item in a batch exceeds 400 KB.
The total request size exceeds 16 MB.
The big unknown for me is how is possible with 25 items of a maximum of 400 Kb, the payload will exceed 16Mbs. Accounting for table names of less than 255 bytes, etc. I don't understand the limit or am I missing something simple.
Thanks.
The 16MB size is actually the total size of the request. Consider an object with many many small objects, the DynamoDB request map could be larger than the size of the combined items.

What will happen to my sqlite database ten years from now in terms of capacity and query speed

I have created a database with one single table (check the code bellow). I plan to insert 10 rows per minute, which is about 52 million rows in ten years from now.
My question is, what can I expect in terms of database capacity and how long it will take to execute select query. Of course, I know you can not provide me an absolute values, but if you can give me any tips on change/speed rates, traps etc. I would be very glad.
I need to tell you, there will be 10 different observations (this is why I will insert ten rows per minute).
create table if not exists my_table (
date_observation default current_timestamp,
observation_name text,
value_1 real(20),
value_1_name text,
value_2 real(20),
value_2_name text,
value_3 real(20),
value_3_name text);
Database capacity exceeds known storage device capacity as per Limits In SQLite.
The more pertinent paragraphs are :-
Maximum Number Of Rows In A Table
The theoretical maximum number of rows in a table is 2^64
(18446744073709551616 or about 1.8e+19). This limit is unreachable
since the maximum database size of 140 terabytes will be reached
first. A 140 terabytes database can hold no more than approximately
1e+13 rows, and then only if there are no indices and if each row
contains very little data.
Maximum Database Size
Every database consists of one or more "pages". Within a single
database, every page is the same size, but different database can have
page sizes that are powers of two between 512 and 65536, inclusive.
The maximum size of a database file is 2147483646 pages. At the
maximum page size of 65536 bytes, this translates into a maximum
database size of approximately 1.4e+14 bytes (140 terabytes, or 128
tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not
have access to hardware capable of reaching this limit. However, tests
do verify that SQLite behaves correctly and sanely when a database
reaches the maximum file size of the underlying filesystem (which is
usually much less than the maximum theoretical database size) and when
a database is unable to grow due to disk space exhaustion.
Speed determination has many aspects and is thus not a simple how fast will it go, like a car. The file system, the memory, optimisation are all factors that need to be taken into consideration. As such the answer is the same as the length of the anecdotal piece of string.
Note 18446744073709551616 is if you utilise negative numbers otherwise the more frequently mentioned number of 9223372036854775807 is the limit (i.e a 64 bit signed integer)
To utilise negative rowid numbers and therefore the higher range you have to insert at least 1 negative value explicitly into a rowid or alias thereof as per If no negative ROWID values are inserted explicitly, then automatically generated ROWID values will always be greater than zero.

Sqlite on embedded device

Scenario: i want to use SQLITE on embedded device with an eMMC instead of magnetic hard disk. All flash memories have a limited write cycles. All recent memories have a wear leveling system that allow to increase the lifetime of the memory. Each write is distribuited on the whole address space of the device (mapping between logical address and physical address). The main problem with a flash memory is the Write Amplification Factor (WAF): when you want to write some data into the memory the minimum amount of memory that will be written is the whole memory page (1, 2 or 4 KB depends on memory). So if you want to write 1 bit or 900 bytes you will write 1 page of 1 KB (example).
Suppose to have a SQLITE table with id (integer autoincrement), timestamp (integer indexed) and data (string not indexed).
It's possible to predict (overestimation) of the number of bytes written for each INSERT?
Scenario example: INSERT INTO table (timestamp,data) VALUES (140909090,"The data limited to 100 bytes").
Note that in my scenario normally the timestamp will increase because is the real timestamp of the datalogging.
It's possibile to predict that foreach insert 8 (id) + 8 (timestamp) + max 100 bytes (data) will be written for each insert. But what about the write overhead of id and timestamp indexes?

Read throughput in DynamoDB

Ok, so my understanding of read units is that it costs 1 read unit per item, unless the item exceeds 4KB in which case read units = ceiling(item size/4).
However when I submit a scan asking for 80 items (provisioned throughput is 100), the response returns a ConsumedCapacity of either 2.5 or 3 read units. This is frustrating because 97% of the provisioned hardware is not being used. Any idea why this might be the case?
What is your item size for the 80 items? Looking at the documentation here: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughputIntro.html
You can use the Query and Scan operations in DynamoDB to retrieve
multiple consecutive items from a table in a single request. With
these operations, DynamoDB uses the cumulative size of the processed
items to calculate provisioned throughput. For example, if a Query
operation retrieves 100 items that are 1 KB each, the read capacity
calculation is not (100 × 4 KB) = 100 read capacity units, as if those
items were retrieved individually using GetItem or BatchGetItem.
Instead, the total would be only 25 read capacity units ((100 * 1024
bytes) = 100 KB, which is then divided by 4 KB).
So if your items are small, that would explain why Scan is not consuming as much capacity as you would expect. Also, note Scan uses eventually consistent reads, which consume half of the read capacity units.

Resources