Why in memory mode donot accelerate in sqlite - sqlite

There is a query that scans full table, there are 5million records, it costs about 60s. How to optimize this?
I have tried to use memory mode of sqlite, in theory this should be faster, since whole database is stored in memroy. However, it costs almost the same time.
table schema like this:
CREATE TABLE tbl0(estimateid int, seq int, field1 int NULL, field2 int NULL, field3 int NULL, field4 int NULL);
CREATE INDEX tbl0_idx on tbl0(estimateid);
CREATE TABLE tbl1(seq int, companyid int, field1 int NULL, field2 int NULL, field3 int NULL, field4 int NULL, field5 int NULL);
CREATE INDEX tbl1_idx on tbl1(seq);
CREATE TABLE tbl2(symbolid int, relatedcompanyid int, value char(64), field1 int NULL, field2 int NULL, field3 int NULL, field4 int NULL, field5 int NULL);
CREATE INDEX tbl2_idx on tbl2(relatedcompanyid);
and this is query, query that need join 3 tables:
>explain query plan select tbl0.estimateid, tbl1.seq, tbl1.companyid, tbl2.value from tbl0, tbl1, tbl2 where tbl0.seq = tbl1.seq and tbl1.companyid = tbl2.relatedcompanyid;
0|0|1|SCAN TABLE tbl1
0|1|2|SEARCH TABLE tbl2 USING INDEX tbl2_idx (relatedcompanyid=?)
0|2|0|SEARCH TABLE tbl0 USING AUTOMATIC COVERING INDEX (seq=?)
How to accelerate this query? Seems it's inevitable one table will be fully scanned. Each table contains about 5million records, this query costs very long time(several minutes).
When I put db in memory, use this #sqlite3 :memory:, it doesnot make any difference in speed.
Help is very appreciated.

A full index scan (type: index) is according to the documentation is the 2nd worst possible execution plan after a full table scan, which you chosen.
Full table scan is resource-intensive operation for DB, and there's no magic behind the scenes unless you boost your memory, CPU speed, will index the table, will reduce the number of records, etc. That's why you haven't noticed any drastic speed increase when you moved everything to the memory.
You should try to avoid this and to make the better query, or to optimize the DB and tables structures. Please, reference EXPLAIN QUERY PLAN and Query Planning for getting more details on the execution of your SQL and how it can be optimized.
It's hard to say more and to be more specific, as in your original question you haven't provided the DB structures, the characteristics of your data, your query, etc.

Your database is not in memory; you've done something wrong. I built a program to load 5 million records in another in-memory database system and it took less than 800 milliseconds for a full sequential scan. Even if SQLite is only half as fast as the in-memory database system I used, it should only take a second or two.
another possibility is you're writing to the console after fetching every row, or performing some other logic, that is causing the overall slowness.

Related

SQLite is very slow when performing .import on a large table

I'm running the following:
.mode tabs
CREATE TABLE mytable(mytextkey TEXT PRIMARY KEY, field1 INTEGER, field2 REAL);
.import mytable.tsv mytable
mytable.tsv is approx. 6 GB and 50 million rows. The process takes an extremely long time (hours) to run and it also completely throttles the performance of the entire system, I'm guessing because of temporary disk IO.
I don't understand why it takes so long and why it thrashes the disk so much, when I have plenty of free physical RAM it could use for temporary write.
How do I improve this process?
PS: Yes I did search for an previous question and answer, but nothing I found helped.
In Sqlite, a normal rowid table uses a 64-bit integer primary key. If you have a PK in the table definition that's anything but a single INTEGER column, that is instead treated as a unique index, and each row inserted has to update both the original table and that index, doubling the work (And in your case effectively doubling the storage requirements). If you instead make your table a WITHOUT ROWID one, the PK is a true PK and doesn't require an extra index table. That change alone should roughly halve both the time it takes to import your dataset and the size of the database. (If you have other indexes on the table, or use that PK as a foreign key in another table, it might not be worth making the change in the long run as it'll increase the amount of space needed for those tables by potentially a lot given the lengths of your keys; in that case, see Schwern's answer).
Sorting the input on the key column first can help too on large imports because there's less random access of b-tree pages and moving of data within those pages. Everything goes into the same page until it fills up and a new one is allocated and any needed rebalancing is done.
You can also turn on some unsafe settings that in normal usage aren't recommended because they can result in data loss or outright corruption, but if that happens during import because of a freak power outage or whatever, you can always just start over. In particular, setting the synchronous mode and journal type to OFF. That results in fewer disc writes over the course of the import.
My assumption is the problem is the text primary key. This requires building a large and expensive text index.
The primary key is a long nucleotide sequence (anywhere from 20 to 300 characters), field1 is a integer (between 1 and 1500) and field2 is a relative log ratio (between -10 and +10 roughly).
Text primary keys have few advantages and many drawbacks.
They require large, slow indexes. Slow to build, slow to query, slow to insert.
It's tempting to change text, exactly what you don't want a primary key to do.
Any table referencing it also requires storing and indexing text adding to bloat.
Joins with this table will be slower due to the text primary key.
Consider what happens when you make a new table which references this one.
create table othertable(
myrefrence references mytable, -- this is text
something integer,
otherthing integer
)
othertable now must store a copy of the entire sequence, bloating the table. Instead of being simple integers it now has a text column, bloating the table. And it must make its own text index, bloating the index, and slowing down joins and inserts.
Instead, use a normal, integer, autoincrementing primary key and make the sequence column unique (which is also indexed). This provides all the benefits of a text primary key with none of the drawbacks.
create table sequences(
id integer primary key autoincrement,
sequence text not null unique,
field1 integer not null,
field2 real not null
);
Now references to sequences are a simple integer.
Because the SQLite import process is not very customizable, getting your data into this table in SQLite efficiently requires a couple steps.
First, import your data into a table which does not yet exist. Make sure it has header fields matching your desired column names.
$ cat test.tsv
sequence field1 field2
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
sqlite> .import test.tsv import_sequences
As there's no indexing happening, this process should go pretty quick. SQLite made a table called import_sequences with everything of type text.
sqlite> .schema import_sequences
CREATE TABLE import_sequences(
"sequence" TEXT,
"field1" TEXT,
"field2" TEXT
);
sqlite> select * from import_sequences;
sequence field1 field2
---------- ---------- ----------
d34db33f 1 1.1
f00bar 5 5.5
somethings 9 9.9
Now we create the final production table.
sqlite> create table sequences(
...> id integer primary key autoincrement,
...> sequence text not null unique,
...> field1 integer not null,
...> field2 real not null
...> );
For efficiency, normally you'd add the unique constraint after the import, but SQLite has very limited ability to alter a table and cannot alter an existing column except to change its name.
Now transfer the data from the import table into sequences. The primary key will be automatically populated.
insert into sequences (sequence, field1, field2)
select sequence, field1, field2
from import_sequences;
Because the sequence must be indexed this might not import any faster, but it will result in a much better and more efficient schema going forward. If you want efficiency consider a more robust database.
Once you've confirmed the data came over correctly, drop the import table.
The following settings helped speed things up tremendously.
PRAGMA journal_mode = OFF
PRAGMA cache_size = 7500000
PRAGMA synchronous = 0
PRAGMA temp_store = 2

Teradata LOCK ROW FOR ACCESS on insert query into a VOLATILE TABLE

I have a VOLATILE TABLE in teradata that i created with the code below
CREATE VOLATILE TABLE Temp
(
ID VARCHAR(30),
has_cond INT
) ON COMMIT PRESERVE ROWS;
I want to insert records from a select statement that i have created which is a pretty big SQL statement and definitely requires a row lock before proceeding
INSERT INTO Temp
(ID ,has_cond)
SELECT * FROM....
Can anyone tell me how to safely lock the rows so i can insert the records into my VOLATILE TABLE as they are production tables and i don't want to lock out some ETL that might be happening in the background
I don't think you can apply a row lock for an insert unless you put the select in a view.
Or you switch to lock table, but don't forget to include all tables...
But in most production environments there's a database with 1-1-views including lock row access, you can use those (or you might already, check Explain).

Improving SQLite Query Performance

I have run the following query in SQLite and SQLServer. On SQLite the query has never finished runing - i have let it sit for hours and still continues to run. On SQLServer it takes a little less than a minute to run. The table has several hundred thousands of records. Is there a way to improve the performance of the query in SQLite?
update tmp_tbl
set prior_symbol = (select o.symbol
from options o
where o.underlying_ticker = tmp_tbl.underlying_ticker
and o.option_type = tmp_tbl.option_type
and o.expiration = tmp_tbl.expiration
and o.strike = (select max(o2.strike)
from options o2
where o2.underlying_ticker = tmp_tbl.underlying_ticker
and o2.option_type = tmp_tbl.option_type
and o2.expiration = tmp_tbl.expiration
and o2.strike < tmp_tbl.strike));
Update: I was able to get what I needed done using some python code and handling the data mapping outside of SQL. However, I am puzzled by the performance difference between SQLite and SQLServer - I was expecting SQLite to be much faster.
When I ran the above query initially, neither table had any indexes other than a standard primary key, id, which is unrelated to the data. I created two indexes as follows:
create index options_table_index on options(underlying_ticker, option_type, expiration, strike);
and:
create index tmp_tbl_index on tmp_tbl(underlying_ticker, option_type, expiration, strike);
But that didn't help. The query still continues to clock without any output - I let it run for nearly 40 minutes.
The table definition for tmp_tbl is:
create table tmp_tbl(id integer primary key,
symbol text,
underlying_ticker text,
option_type text,
strike real,
expiration text,
mid real,
prior_symbol real,
prior_premium real,
ratio real,
error_flag bit);
The definition of options table is similar but with a few more fields.

Is it possible to (emulate?) AUTOINCREMENT on a compound-PK in Sqlite?

According to the SQLite docs, the only way to get an auto-increment column is on the primary key.
I need a compound primary key, but I also need auto-incrementing. Is there a way to achieve both of these in SQLite?
Relevant portion of my table as I would write it in PostgreSQL:
CREATE TABLE tstage (
id SERIAL NOT NULL,
node INT REFERENCES nodes(id) NOT NULL,
PRIMARY KEY (id,node),
-- ... other columns
);
The reason for this requirement is that all nodes eventually dump their data to a single centralized node where, with a single-column PK, there would be collisions.
The documentation is correct.
However, it is possible to reimplement the autoincrement logic in a trigger:
CREATE TABLE tstage (
id INT, -- allow NULL to be handled by the trigger
node INT REFERENCES nodes(id) NOT NULL,
PRIMARY KEY (id, node)
);
CREATE TABLE tstage_sequence (
seq INTEGER NOT NULL
);
INSERT INTO tstage_sequence VALUES(0);
CREATE TRIGGER tstage_id_autoinc
AFTER INSERT ON tstage
FOR EACH ROW
WHEN NEW.id IS NULL
BEGIN
UPDATE tstage_sequence
SET seq = seq + 1;
UPDATE tstage
SET id = (SELECT seq
FROM tstage_sequence)
WHERE rowid = NEW.rowid;
END;
(Or use a common my_sequence table with the table name if there are multiple tables.)
A trigger works, but is complex. More simply, you could avoid serial ids. One approach, you could use a GUID. Unfortunately I couldn't find a way to have SQLite generate the GUID for you by default, so you'd have to generate it in your application. There also isn't a GUID type, but you could store it as a string or a binary blob.
Or, perhaps there is something in your other columns that would serve as a suitable key. If you know that inserts won't happen more frequently than the resolution of your timestamp format of choice (SQLite offers several, see section 1.2), then maybe (node, timestamp_column) is a good primary key.
Or, you could use SQLite's AUTOINCREMENT, but set the starting number on each node via the sqlite_sequence table such that the generated serials won't collide. Since rowid is SQLite is a 64-bit number, you could do this by generating a unique 32-bit number for each node (IP addresses are a convenient, probably unique 32 bit number) and shifting it left 32 bits, or equivalently, multiplying it by 4294967296. Thus, the 64-bit rowid becomes effectively two concatenated 32-bit numbers, NODE_ID, RECORD_ID, guaranteed to not collide unless one node generates over four billion records.
How about...
ASSUMPTIONS
Only need uniqueness in PK, not sequential-ness
Source table has a PK
Create the central table with one extra column, the node number...
CREATE TABLE tstage (
node INTEGER NOT NULL,
id INTEGER NOT NULL, <<< or whatever the source table PK is
PRIMARY KEY (node, id)
:
);
When you rollup the data into the centralized node, insert the number of the source node into 'node' and set 'id' to the source table's PRIMARY KEY column value...
INSERT INTO tstage (nodenumber, sourcetable_id, ...);
There's no need to maintain another autoincrementing column on the central table because nodenumber+sourcetable_id will always be unique.

Using SQLite how do I index columns in a CREATE TABLE statement?

How do I index a column in my CREATE TABLE statement? The table looks like
command.CommandText =
"CREATE TABLE if not exists file_hash_list( " +
"id INTEGER PRIMARY KEY, " +
"hash BLOB NOT NULL, " +
"filesize INTEGER NOT NULL);";
command.ExecuteNonQuery();
I want filesize to be index and would like it to be 4 bytes
You can't do precisely what you're asking, but unlike some RDBMSs, SQLite is able to perform DDL inside of a transaction, with the appropriate result. This means that if you're really concerned about nothing ever seeing file_hash_list without an index, you can do
BEGIN;
CREATE TABLE file_hash_list (
id INTEGER PRIMARY KEY,
hash BLOB NOT NULL,
filesize INTEGER NOT NULL
);
CREATE INDEX file_hash_list_filesize_idx ON file_hash_list (filesize);
COMMIT;
or the equivalent using the transaction primitives of whatever database library you've got there.
I'm not sure how necessary that really is though, compared to just doing the two commands outside of a transaction.
As others have pointed out, SQLite's indexes are all B-trees; you don't get to choose what the type is or what portion of the column is indexed; the entire indexed column(s) are in the index. It will still be efficient for range queries, it just might take up a little more disk space than you'd really like.
I don't think you can. Why can't you use a second query? And specifying the size of the index seems to be impossible in SQLite. But then again, it is SQ_Lite_ ;)
create index
myIndex
on
myTable (myColumn)

Resources