SQLite database size: more rows vs. more columns - sqlite

Initial situation
Suppose I have a simple table that looks like this:
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterTypeA SMALLINT,
exitTypeA SMALLINT,
enterTypeB SMALLINT,
exitTypeB SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId);
The index is added, because a lot of queries are performed to select entities based on timestampMs and elementId.
I'm storing each minute exit and enter values of different types for different elements. E.g.:
elementId, timestampMs, enterTypeA, exitTypeA, enterTypeB, exitTypeB
1, 1559383200000, 4, 3, 1, 5
2, 1559383200000, 8, 2, 3, 7
1, 1559383260000, 2, 2, 4, 0
2, 1559383260000, 1, 0, 9, 2
Problem description
New types need to be added to database. More types may also be added in the future. So I tried two different approaches:
Approach 1:
Adding more columns for new types:
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterTypeA SMALLINT,
exitTypeA SMALLINT,
enterTypeB SMALLINT,
exitTypeB SMALLINT,
enterTypeC SMALLINT,
exitTypeC SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId);
Approach 2:
A new row for each type (means larger index):
CREATE TABLE AppData (
id INTEGER PRIMARY KEY,
elementId VARCHAR(36),
timestampMs INTEGER,
enterValue SMALLINT,
exitValue SMALLINT,
type SMALLINT
);
CREATE UNIQUE INDEX app_data_index ON AppData (timestampMs DESC, elementId, type);
Personally I prefer approach 2, because it reduces duplication.
I've tested both approaches and inserted test data for 10 days with 5 elements and 3 types. Results showed, that the database size of approach 1 is much smaller than size of approach 2 (which is from my point of view reasonably logical, since approach 2 has three times more rows):
Approach 1: 8.2 MB | 144'000 entries
Approach 2: 24.6 MB | 432'000 entries
Question
As I see, the size of the index in both solutions is about 50% of database size, so it's clear database size of approach 2 will always be larger.
Do more rows instead of more columns in SQLite always make such a big difference on database size?
So far I haven't found a solution to reduce the size of approach 2 even further. Perhaps this isn't possible due to the index?

The issue of which of the two versions would take up more space is not so important as what the proper database structure is for your needs. The second version is preferable, for several reasons:
If you need to restrict the table to only certain types, a simple WHERE clause will suffice. In the first version, you basically always get back every type when querying
Aggregation is possible in the second version. You may easily aggregate all timestamps by type. This is much harder to do in the first version.
If you need to link any of the columns in the second version to other tables, it is fairly straightforward. On the other hand, in the first version, you would need to potentially link each separate enter/exit column.
Regarding storage, storing the same amount of data in either scheme should be fairly similar, certainly within an order of magnitude and probably within a factor of 2. The design issue seems to be the bigger problem.

Related

SQLite is very slow when performing .import on a large table

I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
They require large, slow indexes. Slow to build, slow to query, slow to insert.
It's tempting to change text, exactly what you don't want a primary key to do.
Any table referencing it also requires storing and indexing text adding to bloat.
Joins with this table will be slower due to the text primary key.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2

sqlite using blob for epoch datetime

I'm trying to decide what is the best way to store a datetime in sqlite. The date will be in epoch.
I've been reading on wiki about the 2038 problem (it's very much like the year 2000 problem). Taking this into account with what I've been reading on tutorialspoint:
From https://www.tutorialspoint.com/sqlite/sqlite_data_types.htm
Tutorialspoint suggests using the below data types form datetime.
SQLite does not have a separate storage class for storing dates and/or times, but SQLite is capable of storing dates and times as TEXT, REAL or INTEGER values.
But when I looked at the type descriptions, BLOB didn't have a size limit and represents the data as it is inserted into the database.
BLOB The value is a blob of data, stored exactly as it was input.
INTEGER The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.
I saw on tutorials point that they suggest using sqlite type INTEGER for datetime. But taken with 2038 problem, I'm thinking that using BLOB is a better choice if I'm focusing on future proofing because BLOB does not have a dependence on a specific number of bytes like INTEGER does depend.
I'm new to database design, so I'm wondering what's best to do?
INTEGER as it says can be up to 8 bytes i.e. a 64 bit signed integer. Your issue is not SQLite being able to store values not subject to the 2038 issue with 32 bits. Your issue will be in retrieving a time from something that is not subject to the issue, that is unless you are trying to protect against the year 292,277,026,596 problem.
There is no need to use a BLOB and the added complexity and additional processing of converting between a BLOB and the time.
It may even be that you can use SQLite itself to retrieve suitable values, if you wanted to store the current time or a time based upon the current time aka now.
Perhaps consider the following :-
DROP TABLE IF EXISTS timevalues;
/* Create the table with 1 column with a weird type and a default value as now (seconds since Jan 1st 1970)*/
CREATE TABLE IF NOT EXISTS timevalues (dt typedoesnotmatterthtamuch DEFAULT (strftime('%s','now')));
/* INSERT 2 rows with dates of 1000 years from now */
INSERT INTO timevalues VALUES
(strftime('%s','now','+1000 years')),
((julianday('now','+1000 years') - 2440587.5)*86400.0);
/* INSERT a row using the DEFAULT */
INSERT INTO timevalues (rowid) /* specify the rowid column so there is no need to supply value for the dt column */
VALUES ((SELECT count() FROM timevalues)+1 /* get the highest rowid + 1 */);
/* Retrieve the data rowid, the value as stored in the dt column and the dt column converted to a user friendly format */
SELECT rowid,*, datetime(dt,'unixepoch') AS userfriendly FROM timevalues;
/* Cleanup the Environment */
DROP TABLE IF EXISTS timevalues;
Which results in :-
You would probably want to have a read of Date And Time Functions e.g. for strftime, julianday and now
rowid is a special normally hidden column that exists for all table unless it is WITHOUT ROWID table. It wouldn't typically be used, or if so aliased by using INTEGER PRIMARY KEY
see SQLite Autoincrement to find out about rowid and alias thereof and why not to use AUTOINCREMENT.
a column type of typedoesnotmatterthtamuch see Datatypes In SQLite Version 3 as to why this can be.

Ability to control byte size of INTEGER type

I am trying to keep an SQLite table as small as possible. My tables will only contain 1 byte unsigned integers. However, it is unclear when I create a new table what the underlying structure of the table is that gets created. For example:
CREATE TABLE test (SmallNumbers INTEGER)
Will the resulting SmallNumbers field be 1, 2, 4...8 bytes in size?
If I were to create 1 million records all containing the number "1" using the above command to create the table, would the resulting .db file be any smaller than if I inserted 1 million records all containing the value of 412,321,294,967,295?
How do I ensure that such a table can be as small as possible as I insert 1 byte unsigned integers into the table (with regards to disk space)?
Per SQLite documentation: https://www.sqlite.org/datatype3.html
Each value stored in an SQLite database (or manipulated by the
database engine) has one of the following storage classes:
INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8
bytes depending on the magnitude of the value.
You don't need to do anything to ensure the table will be as small as possible. SQLite will choose the smallest storage class that can store the value you supply, on a value-by-value basis.

SQLite integer size: individually sized or for the entire group

Taken straight off of SQLite's site "The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value."
Does this mean that if you have 1 value that requires 8 bytes, ALL values in that column will be treated as 8 bytes. Or, if the rest are all 1 byte, and one value is 8 bytes, will only that value be using 8 bytes and the rest will remain at 1?
I'm more used to SQL in which you specify the integer size accordingly.
I know the question seems trivial, but based on the answer will determine how I handle a piece of the database.
The sqlite database structure is different in the way it handles data types. Each field can have a different type...
Here is the documentation from sqlite:
Most SQL database engines use static typing. A datatype is associated with each column
in a table and only values of that particular datatype are allowed to be stored in that
column. SQLite relaxes this restriction by using manifest typing. In manifest typing, the
datatype is a property of the value itself, not of the column in which the value is
stored. SQLite thus allows the user to store any value of any datatype into any column
regardless of the declared type of that column. (There are some exceptions to this rule:
An INTEGER PRIMARY KEY column may only store integers. And SQLite attempts to coerce
values into the declared datatype of the column when it can.)

Speed up SQL select in SQLite

I'm making a large database that, for the sake of this question, let's say, contains 3 tables:
A. Table "Employees" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
Others don't matter
B. Table "Job_Sites" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
Others don't matter
C. Table "Workdays" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
emp_id = is a foreign key to Employees(id)
job_id = is a foreign key to Job_Sites(id)
datew = INTEGER that stands for the actual workday, represented by a Unix date in seconds since midnight of Jan 1, 1970
The most common operation in this database is to display workdays for a specific employee. I perform the following select statement:
SELECT * FROM Workdays WHERE emp_id='Actual Employee ID' AND job_id='Actual Job Site ID' AND datew>=D1 AND datew<D2
I need to point out that D1 and D2 are calculated for the beginning of the month in search and for the next month, respectively.
I actually have two questions:
Should I set any fields as indexes besides primary indexes? (Sorry, I seem to misunderstand the whole indexing concept)
Is there any way to re-write the Select statement to maybe speed it up. For instance, most of the checks in it would be to see that the actual employee ID and job site ID match. Maybe there's a way to split it up?
PS. Forgot to say, I use SQLite in a Windows C++ application.
If you use the above query often, then you may get better performance by creating a multicolumn index containing the columns in the query:
CREATE INDEX WorkdaysLookupIndex ON Workdays (emp_id, job_id, datew);
Sometimes you just have to create the index and try your queries to see what is faster.

Resources