Why does this SQLite query not use an index for the correlated subquery? - sqlite

Consider a SQLite database for things with parts, containing the following tables
CREATE TABLE thing (id integer PRIMARY KEY, name text, total_cost real);
CREATE TABLE part (id integer PRIMARY KEY, cost real);
CREATE TABLE thing_part (thing_id REFERENCES thing(id), part_id REFERENCES part(id));
I have an index to find the parts of a thing
CREATE INDEX thing_part_idx ON thing_part (thing_id);
To illustrate the problem, I'm using the following queries to fill the tables with random data
INSERT INTO thing(name)
WITH RECURSIVE
cte(x) AS (
SELECT 1
UNION ALL
SELECT 1 FROM cte LIMIT 10000
)
SELECT hex(randomblob(4)) FROM cte;
INSERT INTO part(cost)
WITH RECURSIVE
cte(x) AS (
SELECT 1
UNION ALL
SELECT 1 FROM cte LIMIT 10000
)
SELECT abs(random()) % 100 FROM cte;
INSERT INTO thing_part (thing_id, part_id)
SELECT thing.id, abs(random()) % 10000 FROM thing, (SELECT 1 UNION ALL SELECT 1), (SELECT 1 UNION ALL SELECT 1);
So each thing is associated with a small number of parts (4 in this example).
At this point, I have not yet set the total cost of the things. I thought I could use the following query
UPDATE thing SET total_cost = (
SELECT sum(part.cost)
FROM thing_part, part
WHERE thing_part.thing_id = thing.id
AND thing_part.part_id = part.id);
but it is extremely slow (I did not have the patience to wait for it to complete).
EXPLAIN QUERY PLAN shows that both thing and thing_part are being scanned over, only the lookup in part is done using the rowid:
SCAN TABLE thing
EXECUTE CORRELATED SCALAR SUBQUERY 0
SCAN TABLE thing_part
SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)
If I look at the query plan for the inner query with a fixed thing_id, i.e.
SELECT sum(part.cost)
FROM thing_part, part
WHERE thing_part.thing_id = 1000
AND thing_part.part_id = part.id;
it does use the thing_part_idx:
SEARCH TABLE thing_part USING INDEX thing_part_idx (thing_id=?)
SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)
I would expect the first query to be equivalent to iterating over all rows of thing and executing the inner query each time, but obviously that's not the case. Why? Should I use a different index or rewrite my query or maybe do the iteration in the client to generate multiple queries instead?
In case it matters, I'm using SQLite version 3.22.0

SQLite might use dynamic typing, but column types still matter for affinity, and indexes can be used only when the database can prove that index lookups behave the same as comparisons with the actual table values, which often requires the affinities to be compatible.
So when you tell the database that the thing_part values are integers:
CREATE TABLE thing_part (
thing_id integer REFERENCES thing(id),
part_id integer REFERENCES part(id)
);
then the index on that will have the correct affinity, and will be used:
QUERY PLAN
|--SCAN TABLE thing
`--CORRELATED SCALAR SUBQUERY
|--SEARCH TABLE thing_part USING INDEX thing_part_idx (thing_id=?)
`--SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)

I would rewrite your query as:
-- calculating sum for each thing_id at once
WITH cte AS (
SELECT thing_part.thing_id, sum(part.cost) AS s
FROM thing_part
JOIN part
ON thing_part.part_id = part.id
GROUP BY thing_part.thing_id
)
UPDATE thing
SET total_cost = (SELECT s FROM cte WHERE thing.id = cte.thing_id);

Related

SQLite queryslow when using index

I have a table indexed on a text column, and I want all my queries to return results ordered by name without any performance hit.
Table has around 1 million rows if it matters.
Table -
CREATE TABLE table (Name text)
Index -
CREATE INDEX "NameIndex" ON "Files" (
"Name" COLLATE nocase ASC
);
Query 1 -
select * from table where Name like "%a%"
Query plan, as expected a full scan -
SCAN TABLE table
Time -
Result: 179202 rows returned in 53ms
Query 2, now using order by to read from index -
select * from table where Name like "%a%" order by Name collate nocase
Query plan, scan using index -
SCAN TABLE table USING INDEX NameIndex
Time -
Result: 179202 rows returned in 672ms
Used DB Browser for SQLite to get the information above, with default Pragmas.
I'd assume scanning the index would be as performant as scanning the table, is it not the case or am I doing something wrong?
Another interesting thing I noticed, that may be relevant -
Query 3 -
select * from table where Name like "a%"
Result: 23026 rows returned in 9ms
Query 4 -
select * from table where name like "a%" order by name collate nocase
Result: 23026 rows returned in 101ms
And both has them same query plan -
SEARCH TABLE table USING INDEX NameIndex (Name>? AND Name<?)
Is this expected? I'd assume the performance be the same if the plan was the same.
Thanks!
EDIT - The reason the query is slower was because I used select * and not select name, causing SQLite to go between the table and the index.
The solution was to use clustered index, thanks #Tomalak for helping me find it -
create table mytable (a text, b text, primary key (a,b)) without rowid
The table will be ordered by default using a + b combination, meaning that full scan queries will be much faster (now 90ms).
A LIKE pattern that starts with % can never use an index. It will always result in a full table scan (or index scan, if the query can be covered by the index itself).
It's logical when you think about it. Indexes are not magic. They are sorted lists of values, exactly like a keyword index in a book, and that means they are only only quick for looking up a word if you know how the given word starts. If you're searching for the middle part of a word, you would have to look at every index entry in a book as well.
Conclusion from the ensuing discussion in the comments:
The best course of action to get a table that always sorts by a non-unique column without a performance penalty is to create it without ROWID, and turn it into a clustering index over a the column in question plus a second column that makes the combination unique:
CREATE TABLE MyTable (
Name TEXT COLLATE NOCASE,
Id INTEGER,
Other TEXT,
Stuff INTEGER,
PRIMARY KEY(Name, Id) -- this will sort the whole table by Name
) WITHOUT ROWID;
This will result in a performance penalty for INSERT/UPDATE/DELETE operations, but in exchange sorting will be free since the table is already ordered.

SQLite: Does PRIMARY KEY default to ASC?

Ie, are the following two SQL statements equivalent in SQLite?
CREATE TABLE posts (
id INTEGER PRIMARY KEY
);
CREATE TABLE posts (
id INTEGER PRIMARY KEY ASC
);
Yes they are.
There is no need to specify ASC and beware that if you were to specify DESC, then NO they are then not equivalent (see 4 below) as id INTEGER PRIMARY KEY DESC is an exclusion to the column being an alias of the rowid column as per :-
The exception mentioned above is that if the declaration of a column
with declared type "INTEGER" includes an "PRIMARY KEY DESC" clause, it
does not become an alias for the rowid and is not classified as an
integer primary key. This quirk is not by design. It is due to a bug
in early versions of SQLite. But fixing the bug could result in
backwards incompatibilities. Hence, the original behavior has been
retained (and documented) because odd behavior in a corner case is far
better than a compatibility break.
ROWIDs and the INTEGER PRIMARY KEY
You can use id INTEGER, PRIMARY KEY(id, DESC), but still the order defaults to ASC when retrieving the column as it is an alias of the rowid (see 5 below )
Perhaps consider the following :-
DROP TABLE IF EXISTS posts1;
CREATE TABLE posts1 (
id INTEGER PRIMARY KEY
);
DROP TABLE IF EXISTS posts2;
CREATE TABLE posts2 (
id INTEGER PRIMARY KEY ASC
);
DROP TABLE IF EXISTS posts3;
CREATE TABLE posts3 (
id INTEGER PRIMARY KEY DESC
);
DROP TABLE IF EXISTS posts4;
CREATE TABLE posts4 (
id INTEGER, PRIMARY KEY (id DESC)
);
INSERT INTO posts1 VALUES(null),(null),(null);
INSERT INTO posts2 VALUES(null),(null),(null);
INSERT INTO posts3 VALUES(null),(null),(null);
INSERT INTO posts4 VALUES(null),(null),(null);
SELECT * FROM sqlite_master WHERE name LIKE '%posts%';
SELECT * FROM posts1;
SELECT * FROM posts2;
SELECT * FROM posts3;
SELECT * FROM posts4;
Results
1
The query SELECT * FROM sqlite_master WHERE name LIKE '%posts%'; results in :-
As you can see posts3 is significantly different as the index sqlite_autoindex_posts3_1 has been created
The others do not have a specific index created as the id column is an alias of the rowid column
The data for rowid tables is stored as a B-Tree structure containing
one entry for each table row, using the rowid value as the key. This
means that retrieving or sorting records by rowid is fast. Searching
for a record with a specific rowid, or for all records with rowids
within a specified range is around twice as fast as a similar search
made by specifying any other PRIMARY KEY or indexed value.
ROWIDs and the INTEGER PRIMARY KEY
2
The query SELECT * FROM posts1; results in :-
3
The query SELECT * FROM posts2;, confirms the initial YES answer as per :-
4
The query SELECT * FROM posts3;, may be a little confusing, but shows that id INTEGER PRIMARY KEY DESC does not result in an alias of the rowid and in the case of no value or null being inserted into the column, the value is null rather than an auto generated value. There is no UNIQUE constraint conflict (as nulls are considered as being different values).
5
The query SELECT * FROM posts4; produces the same result as for 1 and 2 even though id INTEGER, PRIMARY KEY (id DESC) was used. Confirming that even if DESC is applied via the column definition that the sort order is still defaults to ASC (unless the ORDER BY clause is used).
Note that this peculiarity is specific to the rowid column or an alias thereof.
See both https://www.sqlite.org/lang_createtable.html#rowid and https://www.sqlite.org/lang_createindex.html for a more complete answer. Shawn's link is specific to INTEGER PRIMARY KEY which matches the example code, but the more general question is not answered explicitly in either location, but can be deduced by reading both.
Under SQL Data Constraints, the first link says
In most cases, UNIQUE and PRIMARY KEY constraints are implemented by creating a unique index in the database. (The exceptions are INTEGER PRIMARY KEY and PRIMARY KEYs on WITHOUT ROWID tables.)
The CREATE INDEX page explains that originally the sort order was ignored and all indices were generated in ascending order. Only as of version 3.3.0 is the DESC order "understood". But even that description is somewhat vague, however altogether it is apparent that ASC is the default.

Get last_insert_rowid() from bulk insert

I have an sqlite database where I need to insert spatial information along with metadata into an R*tree and an accompanying regular table. Each entry needs to be uniquely defined for the lifetime of the database. Therefore the regular table have an INTEGER PRIMARY KEY AUTOINCREMENT column and my plan was to start with the insert into this table, extract the last inserted rowids and use these for the insert into the R*tree. Alas this doesn't seem possible:
>testCon <- dbConnect(RSQLite::SQLite(), ":memory:")
>dbGetQuery(testCon, 'CREATE TABLE testTable (x INTEGER PRIMARY KEY, y INTEGER)')
>dbGetQuery(testCon, 'INSERT INTO testTable (y) VALUES ($y)', bind.data=data.frame(y=1:5))
>dbGetQuery(testCon, 'SELECT last_insert_rowid() FROM testTable')
last_insert_rowid()
1 5
2 5
3 5
4 5
5 5
Only the last inserted rowid seems to be kept (probably for performance reasons). As the number of records to be inserted is hundreds of thousands, it is not feasible to do the insert line by line.
So the question is: Is there any way to make the last_insert_rowid() bend to my will? And if not, what is the best failsafe alternative? Some possibilities:
Record highest rowid before insert and 'SELECT rowid FROM testTable WHERE rowid > prevRowid'
Get the number of rows to insert, fetch the last_insert_rowid() and use seq(to=lastRowid, length.out=nInserts)
While the two above suggestion at least intuitively should work I don't feel confident enough in sqlite to know if they are failsafe.
The algorithm for generating autoincrementing IDs is documented.
For an INTEGER PRIMARY KEY column, you can simply get the current maximum value:
SELECT IFNULL(MAX(x), 0) FROM testTable
and then use the next values.

How to read the last record in SQLite table?

Is there a way to read the value of the last record inserted in an SQLite table without going through the previous records ?
I ask this question for performance reasons.
There is a function named sqlite3_last_insert_rowid() which will return the integer key for the most recent insert operation. http://www.sqlite.org/c3ref/last_insert_rowid.html
This only helps if you know the last insert happened on the table you care about.
If you need the last row on a table, regardless of wehter the last insert was on this table or not, you will have to use a SQL query
SELECT * FROM mytable WHERE ROWID IN ( SELECT max( ROWID ) FROM mytable );
When you sort the records by ID, in reverse order, the last record will be returned first.
(Because of the implicit index on the autoincrementing column, this is efficient.)
If you aren't interested in any other records, use LIMIT:
SELECT *
FROM MyTable
ORDER BY _id DESC
LIMIT 1

Speed up SQL select in SQLite

I'm making a large database that, for the sake of this question, let's say, contains 3 tables:
A. Table "Employees" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
Others don't matter
B. Table "Job_Sites" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
Others don't matter
C. Table "Workdays" with fields:
id = INTEGER PRIMARY INDEX AUTOINCREMENT
emp_id = is a foreign key to Employees(id)
job_id = is a foreign key to Job_Sites(id)
datew = INTEGER that stands for the actual workday, represented by a Unix date in seconds since midnight of Jan 1, 1970
The most common operation in this database is to display workdays for a specific employee. I perform the following select statement:
SELECT * FROM Workdays WHERE emp_id='Actual Employee ID' AND job_id='Actual Job Site ID' AND datew>=D1 AND datew<D2
I need to point out that D1 and D2 are calculated for the beginning of the month in search and for the next month, respectively.
I actually have two questions:
Should I set any fields as indexes besides primary indexes? (Sorry, I seem to misunderstand the whole indexing concept)
Is there any way to re-write the Select statement to maybe speed it up. For instance, most of the checks in it would be to see that the actual employee ID and job site ID match. Maybe there's a way to split it up?
PS. Forgot to say, I use SQLite in a Windows C++ application.
If you use the above query often, then you may get better performance by creating a multicolumn index containing the columns in the query:
CREATE INDEX WorkdaysLookupIndex ON Workdays (emp_id, job_id, datew);
Sometimes you just have to create the index and try your queries to see what is faster.

Resources