I want to build a 20,000,000 record table in sqlite, but the file size is slightly larger than its TAB-separated plaintext representation.
What are the ways to optimize data size storage, specific to sqlite?
Details:
Each record has:
8 integers
3 enums (represented for now as 1 byte text),
7 text
I suspect that the numbers are not stored efficiently (value range 10,000,000 to 900,000,000)
According to the docs, I expect them to take 3-4 bytes, if stored as a number, and 8-9 bytes if stored as text (maybe additional termination byte or size indicator byte), meaning 1:2 ratio between storing as int : store as text).
But it doesn't appear so.
Your integers should take at least 3-4 bytes (3 Bytes for up to 2^24 =~ 16,000,000). Additionally SQLite always stores at least one byte for every column as size information (also for your 1 byte texts --> 2bytes in sum for each).
Some questions:
Do you use a compound primary key or a primary key other than a plain integer?
Do you use other indexes?
Did you try to vacuum the database? (command "vacuum") -- a SQLite database is not necessarily auto-vacuum't, so when data is deleted, the space stays reserved.
One further:
Do you already have your 20,000,000 entries or less? For small databases the storage overhead can be much larger than the real content size.
Related
I have created a database with one single table (check the code bellow). I plan to insert 10 rows per minute, which is about 52 million rows in ten years from now.
My question is, what can I expect in terms of database capacity and how long it will take to execute select query. Of course, I know you can not provide me an absolute values, but if you can give me any tips on change/speed rates, traps etc. I would be very glad.
I need to tell you, there will be 10 different observations (this is why I will insert ten rows per minute).
create table if not exists my_table (
date_observation default current_timestamp,
observation_name text,
value_1 real(20),
value_1_name text,
value_2 real(20),
value_2_name text,
value_3 real(20),
value_3_name text);
Database capacity exceeds known storage device capacity as per Limits In SQLite.
The more pertinent paragraphs are :-
Maximum Number Of Rows In A Table
The theoretical maximum number of rows in a table is 2^64
(18446744073709551616 or about 1.8e+19). This limit is unreachable
since the maximum database size of 140 terabytes will be reached
first. A 140 terabytes database can hold no more than approximately
1e+13 rows, and then only if there are no indices and if each row
contains very little data.
Maximum Database Size
Every database consists of one or more "pages". Within a single
database, every page is the same size, but different database can have
page sizes that are powers of two between 512 and 65536, inclusive.
The maximum size of a database file is 2147483646 pages. At the
maximum page size of 65536 bytes, this translates into a maximum
database size of approximately 1.4e+14 bytes (140 terabytes, or 128
tebibytes, or 140,000 gigabytes or 128,000 gibibytes).
This particular upper bound is untested since the developers do not
have access to hardware capable of reaching this limit. However, tests
do verify that SQLite behaves correctly and sanely when a database
reaches the maximum file size of the underlying filesystem (which is
usually much less than the maximum theoretical database size) and when
a database is unable to grow due to disk space exhaustion.
Speed determination has many aspects and is thus not a simple how fast will it go, like a car. The file system, the memory, optimisation are all factors that need to be taken into consideration. As such the answer is the same as the length of the anecdotal piece of string.
Note 18446744073709551616 is if you utilise negative numbers otherwise the more frequently mentioned number of 9223372036854775807 is the limit (i.e a 64 bit signed integer)
To utilise negative rowid numbers and therefore the higher range you have to insert at least 1 negative value explicitly into a rowid or alias thereof as per If no negative ROWID values are inserted explicitly, then automatically generated ROWID values will always be greater than zero.
So I am looking for potential feedback on the issues and benefits around storing many 100's of millions of rows of memory address data.
The addresses consist of 8 x UInt32 positions, I was initially going to store the data in 8 columns, and have an 8 way multipart index on those columns.
CREATE TABLE address (position1 INTEGER, position2 INTEGER, positionN ......)
But it then dawned on me that I could also have a BLOB column, and just store 64 bytes of data and index that.
CREATE TABLE address (position BLOB)
So I guess it becomes a question about any known efficiencies or detrimental effects of looking up on an 8 way index vs a far larger binary index, both with high cardinality.
There will be duplicate, timestamped, entries for the same positions although they will be fairly infrequent.
The table will be read and added to constantly if that helps, with a 3(read)/1(write) ratio.
Many thanks.
What is the storage space for a number type in DynamoDB Number vs string type?
Say I have a number (1234789). If I store it as number type, then it will take just 4 bytes, and as string it will take 7 bytes?
Does DynamoDB stores all numbers as bigdecimal?
DynamoDb is a managed cloud service, so I think the way that they store data internally is not clear.
However, they transfer Numbers as Strings for language compatibility support and one of the things that affect RCU/WCU is transfer data size.
So, as far as your concern is about calculating provisioned throughput and costs, Number size should be considered as a String size.
As Per DynamoDB Documentation : Datatypes :
String
Strings are Unicode with UTF-8 binary encoding. The length of a string
must be greater than zero, and is constrained by the maximum DynamoDB
item size limit of 400 KB.
If you define a primary key attribute as a string type attribute, the
following additional constraints apply:
For a simple primary key, the maximum length of the first attribute value (the partition key) is 2048 bytes.
For a composite primary key, the maximum length of the second attribute value (the sort key) is 1024 bytes.
Number
Numbers can be positive, negative, or zero. Numbers can have up to 38
digits precision—exceeding this will result in an exception.
Positive range: 1E-130 to 9.9999999999999999999999999999999999999E+125
Negative range: -9.9999999999999999999999999999999999999E+125 to -1E-130
In DynamoDB, numbers are represented as variable length. Leading and
trailing zeroes are trimmed.
All numbers are sent across the network to DynamoDB as strings, to
maximize compatibility across languages and libraries. However,
DynamoDB treats them as number type attributes for mathematical
operations.
Note : If number precision is important, you should pass numbers to DynamoDB using strings that you convert from number type.
I Hope, this may help you get your answer.
Scenario: i want to use SQLITE on embedded device with an eMMC instead of magnetic hard disk. All flash memories have a limited write cycles. All recent memories have a wear leveling system that allow to increase the lifetime of the memory. Each write is distribuited on the whole address space of the device (mapping between logical address and physical address). The main problem with a flash memory is the Write Amplification Factor (WAF): when you want to write some data into the memory the minimum amount of memory that will be written is the whole memory page (1, 2 or 4 KB depends on memory). So if you want to write 1 bit or 900 bytes you will write 1 page of 1 KB (example).
Suppose to have a SQLITE table with id (integer autoincrement), timestamp (integer indexed) and data (string not indexed).
It's possible to predict (overestimation) of the number of bytes written for each INSERT?
Scenario example: INSERT INTO table (timestamp,data) VALUES (140909090,"The data limited to 100 bytes").
Note that in my scenario normally the timestamp will increase because is the real timestamp of the datalogging.
It's possibile to predict that foreach insert 8 (id) + 8 (timestamp) + max 100 bytes (data) will be written for each insert. But what about the write overhead of id and timestamp indexes?
My application requires a key value store. Following are some of the details regarding key values:
1) Number of keys (data type: string) can either be 256, 1024 or 4096.
2) Data type of values against each key is a list of integers.
3) The list of integers (value) against each key can vary in size
4) The largest size of the value can be around 10,000,000 integers
5) Some keys might contain very small list of integers
The application needs fast access to the list of integers against a specified key . However, this step is not frequent in the working of the application.
I need suggestions for best Key value stores for my case. I need fast retrieval of values against key and value size can be around 512 MB or more.
I checked Redis but it requires the store to be stored in memory. However, in the given scenario I think I should look for disk based key value stores.
LevelDB can fit your use case very well, as you have limited number of keys (given you have enough disk space for your requirements), and might not need a distributed solution.
One thing you need to specify is if (and how) you wish to modify the lists once in the db, as levelDB and many other general key-val stores do not have such atomic transactions.
If you are looking for a distributed db, cassandra is good, as it will also let you insert/remove individual list elements.