When to create multi-column indices in SQLite? - sqlite

Assume I have a table in an SQLite database:
CREATE TABLE orders (
id INTEGER PRIMARY KEY,
price INTEGER NOT NULL,
updateTime INTEGER NOT NULL,
) [WITHOUT ROWID];
what indices should I create to optimize the following query:
SELECT * FROM orders WHERE price > ? ORDER BY updateTime DESC;
Do I create two indices:
CREATE INDEX i_1 ON orders(price);
CREATE INDEX i_2 ON orders(updateTime);
or one complex index?
CREATE INDEX i_3 ON orders(price, updateTime);
What can be query time complexity?

From The SQLite Query Optimizer Overview/WHERE Clause Analysis:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
Then the index might
be used if the initial columns of the index (columns a, b, and so
forth) appear in WHERE clause terms. The initial columns of the index
must be used with the = or IN or IS operators. The right-most column
that is used can employ inequalities.
As explained also in The SQLite Query Optimizer Overview/The Skip-Scan Optimization with an example:
Because the left-most column of the index does not appear in the WHERE
clause of the query, one is tempted to conclude that the index is not
usable here. However, SQLite is able to use the index.
This means than if you create an index like:
CREATE INDEX idx_orders ON orders(updateTime, price);
it might be used to optimize the WHERE clause even though updateTime does not appear there.
Also, from The SQLite Query Optimizer Overview/ORDER BY Optimizations:
SQLite attempts to use an index to satisfy the ORDER BY clause of a
query when possible. When faced with the choice of using an index to
satisfy WHERE clause constraints or satisfying an ORDER BY clause,
SQLite does the same cost analysis described above and chooses the
index that it believes will result in the fastest answer.
Since updateTime is defined first in the composite index, the index may also be used to optimize the ORDER BY clause.

Related

SQLite queryslow when using index

I have a table indexed on a text column, and I want all my queries to return results ordered by name without any performance hit.
Table has around 1 million rows if it matters.
Table -
CREATE TABLE table (Name text)
Index -
CREATE INDEX "NameIndex" ON "Files" (
"Name" COLLATE nocase ASC
);
Query 1 -
select * from table where Name like "%a%"
Query plan, as expected a full scan -
SCAN TABLE table
Time -
Result: 179202 rows returned in 53ms
Query 2, now using order by to read from index -
select * from table where Name like "%a%" order by Name collate nocase
Query plan, scan using index -
SCAN TABLE table USING INDEX NameIndex
Time -
Result: 179202 rows returned in 672ms
Used DB Browser for SQLite to get the information above, with default Pragmas.
I'd assume scanning the index would be as performant as scanning the table, is it not the case or am I doing something wrong?
Another interesting thing I noticed, that may be relevant -
Query 3 -
select * from table where Name like "a%"
Result: 23026 rows returned in 9ms
Query 4 -
select * from table where name like "a%" order by name collate nocase
Result: 23026 rows returned in 101ms
And both has them same query plan -
SEARCH TABLE table USING INDEX NameIndex (Name>? AND Name<?)
Is this expected? I'd assume the performance be the same if the plan was the same.
Thanks!
EDIT - The reason the query is slower was because I used select * and not select name, causing SQLite to go between the table and the index.
The solution was to use clustered index, thanks #Tomalak for helping me find it -
create table mytable (a text, b text, primary key (a,b)) without rowid
The table will be ordered by default using a + b combination, meaning that full scan queries will be much faster (now 90ms).
A LIKE pattern that starts with % can never use an index. It will always result in a full table scan (or index scan, if the query can be covered by the index itself).
It's logical when you think about it. Indexes are not magic. They are sorted lists of values, exactly like a keyword index in a book, and that means they are only only quick for looking up a word if you know how the given word starts. If you're searching for the middle part of a word, you would have to look at every index entry in a book as well.
Conclusion from the ensuing discussion in the comments:
The best course of action to get a table that always sorts by a non-unique column without a performance penalty is to create it without ROWID, and turn it into a clustering index over a the column in question plus a second column that makes the combination unique:
CREATE TABLE MyTable (
Name TEXT COLLATE NOCASE,
Id INTEGER,
Other TEXT,
Stuff INTEGER,
PRIMARY KEY(Name, Id) -- this will sort the whole table by Name
) WITHOUT ROWID;
This will result in a performance penalty for INSERT/UPDATE/DELETE operations, but in exchange sorting will be free since the table is already ordered.

How to introduce indexing to sqlite query in android?

In my android application, I use Cursor c = db.rawQuery(query, null); to query data from a local sqlite database, and one of the query string looks like the following:
SELECT t1.* FROM table t1
WHERE NOT EXISTS (
SELECT 1 FROM table t2
WHERE t2.start_time = t1.start_time AND t2.stop_time > t1.stop_time
)
however, the issue is that the query gets very slow when the database gets huge. Trying to look into introducing indexing to speed up the query, but so far, not been very successful, therefore, would be great to have some help here, as it's also hard to find examples for this for android applications.
You can create a composite index for the columns start_time and stop_time:
CREATE INDEX idx_name ON table_name(start_time, stop_time);
You can read in The SQLite Query Optimizer Overview:
The ON and USING clauses of an inner join are converted into
additional terms of the WHERE clause prior to WHERE clause analysis
...
and:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
Then the index might be used if the initial columns of the index
(columns a, b, and so forth) appear in WHERE clause terms. The initial
columns of the index must be used with the = or IN or IS operators.
The right-most column that is used can employ inequalities.
You may have to uninstall the app from the device so that the db is deleted and rerun to recreate it, or increase the version number of the db so that you can create the index in the onUpgrade() method.

Use views and table valued functions as node or edge tables in match clauses

I like to use Table Valued functions in MATCH clauses in the same way as is possible with Node tables. Is there a way to achieve this?
The need for table valued functions
There can be various use cases for using table valued functions or views as Node tables. For instance mine is the following.
I have Node tables that contain NVarChar(max) fields that I would like to search for literal text. I need only equality searching and no full text searching, so I opted for using a index on the hash value of the text field. As suggested by Remus Rusanu in his answer to SQL server - worth indexing large string keys? and https://www.brentozar.com/archive/2013/05/indexing-wide-keys-in-sql-server/. A table valued function handles using the CHECKSUM index; see Msg 207 Invalid column name $node_id for pseudo column in inline table valued function.
Example data definitions
CREATE TABLE [Tags](
[tag] NVarChar(max),
[tagHash] AS CHECKSUM([Tag]) PERSISTED NOT NULL
) as Node;
CREATE TABLE [Sites](
[endPoint] NVarChar(max),
[endPointHash] AS CHECKSUM([endPoint]) PERSISTED NOT NULL
) as Node;
CREATE TABLE [Links] as Edge;
CREATE INDEX [IX_TagsByName] ON [Tags]([tagHash]);
GO
CREATE FUNCTION [TagsByName](
#tag NVarChar(max))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT
$node_id AS [NodeId],
[tag],
[tagHash]
FROM [dbo].[Tags]
WHERE [tagHash] = CHECKSUM(#tag) AND
[tag] = #tag;
[TagsByName] returns the $node_id with an alias NodeId as suggested by https://stackoverflow.com/a/45565410/814206. However, real Node tables contain two more internal columns which I do not know how to export.
Desired query
I would like to query the database similar to this:
SELECT *
FROM [TagsByName]('important') as t,
[Sites] as s,
[Links] as l
WHERE MATCH ([t]-([l])->[s])
However, this results in the error1:
Msg 13901, Level 16, State 2, Line ...
Identifier 't' in a MATCH clause is not a node table or an alias for a node table.
I there a way to do this?
PS. There are some workarounds but they do not look as elegant as the MATCH-query; especially considering that my actual query involves matching more relations and more string equality tests. I will post these workarounds as answers and hope that someone comes with a better idea.
1 This gives a very specific difference between views and tables for Difference between View and table in sql; which only occurs in sql-server-2017 and only when using SQL Graph.
Workaround
Revert to traditional relational joins via JOIN clauses or FROM with <table_or_view_name> and WHERE clauses. In queries that match on more relations, the latter has the advantage that sql-server-2017-graph can MATCH on FROM <table_or_view_name> but not on FROM <table_source> JOIN <table_source>.
SELECT *
FROM [TagsByName]('important') as t
[Sites] as s,
[Links] as l
WHERE t.NodeId = l.$from_id AND
l.$to_id = s.$node_id;
Workaround
Add the Node table twice to the from clause: once as table and once as table valued function and join them via the $node_id in the where clause:
SELECT *
FROM [TagsByName]('important') as t1,
[Tags] as t2,
[Sites] as s,
[Links] as l
WHERE MATCH ([t2]-([l])->[s]) AND
t1.[NodeId] = t2.$node_id
Does this affect performance?
Workaround
Do not use the table valued function, but include its expression in the WHERE clause:
SELECT *
FROM [Tags] as t,
[Sites] as s,
[Links] as l
WHERE MATCH ([t]-([l])->[s]) AND
[t].[tagHash] = CHECKSUM('important') AND
[t].[tag] = 'important'
Downside: This is easy to get wrong; for example by forgetting to join on the CHECKSUM

Explanation on index on a datetime field and included columns

I have a sqlserver table with the usual
intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation
99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables.
select * because more or less I need all the fields.
I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.
I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.
I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)
And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?
Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?
Thanks!
Mattia
We will need more information on your table examples of your queries to address this fully, but:
DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .
Edit
Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.
The basic starting point would be:
CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,
CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);
And the basic indexes will be
CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);
Clustering Consideration / Option
If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).
However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.
Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.
Covering Indexes
As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.
SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.
Then altering your index on TimeOperation to include field1
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);
OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.
Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?
e.g.
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
will not be useful for the query
SELECT ... FROM MyTable WHERE Field1 = 'xyz';
The index will only be useful for the queries which have TimeOperation
SELECT ... FROM MyTable WHERE TimeOperation between x and y;
OR
SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';
Hope this helps?
An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).
An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.
NEVER simply create an index including all columns of a table, especially a wide one such as this.

How do I find out if a SQLite index is unique? (With SQL)

I want to find out, with an SQL query, whether an index is UNIQUE or not. I'm using SQLite 3.
I have tried two approaches:
SELECT * FROM sqlite_master WHERE name = 'sqlite_autoindex_user_1'
This returns information about the index ("type", "name", "tbl_name", "rootpage" and "sql"). Note that the sql column is empty when the index is automatically created by SQLite.
PRAGMA index_info(sqlite_autoindex_user_1);
This returns the columns in the index ("seqno", "cid" and "name").
Any other suggestions?
Edit: The above example is for an auto-generated index, but my question is about indexes in general. For example, I can create an index with "CREATE UNIQUE INDEX index1 ON visit (user, date)". It seems no SQL command will show if my new index is UNIQUE or not.
PRAGMA INDEX_LIST('table_name');
Returns a table with 3 columns:
seq Unique numeric ID of index
name Name of the index
unique Uniqueness flag (nonzero if UNIQUE index.)
Edit
Since SQLite 3.16.0 you can also use table-valued pragma functions which have the advantage that you can JOIN them to search for a specific table and column. See #mike-scotty's answer.
Since noone's come up with a good answer, I think the best solution is this:
If the index starts with "sqlite_autoindex", it is an auto-generated index for a single UNIQUE column
Otherwise, look for the UNIQUE keyword in the sql column in the table sqlite_master, with something like this:
SELECT * FROM sqlite_master WHERE type = 'index' AND sql LIKE '%UNIQUE%'
you can programmatically build a select statement to see if any tuples point to more than one row. If you get back three columns, foo, bar and baz, create the following query
select count(*) from t
group by foo, bar, baz
having count(*) > 1
If that returns any rows, your index is not unique, since more than one row maps to the given tuple. If sqlite3 supports derived tables (I've yet to have the need, so I don't know off-hand), you can make this even more succinct:
select count(*) from (
select count(*) from t
group by foo, bar, baz
having count(*) > 1
)
This will return a single row result set, denoting the number of duplicate tuple sets. If positive, your index is not unique.
You are close:
1) If the index starts with "sqlite_autoindex", it is an auto-generated index for the primary key . However, this will be in the sqlite_master or sqlite_temp_master tables depending depending on whether the table being indexed is temporary.
2) You need to watch out for table names and columns that contain the substring unique, so you want to use:
SELECT * FROM sqlite_master WHERE type = 'index' AND sql LIKE 'CREATE UNIQUE INDEX%'
See the sqlite website documentation on Create Index
As of sqlite 3.16.0 you could also use pragma functions:
SELECT distinct il.name
FROM sqlite_master AS m,
pragma_index_list(m.name) AS il,
pragma_index_info(il.name) AS ii
WHERE m.type='table' AND il.[unique] = 1;
The above statement will list all names of unique indexes.
SELECT DISTINCT m.name as table_name, ii.name as column_name
FROM sqlite_master AS m,
pragma_index_list(m.name) AS il,
pragma_index_info(il.name) AS ii
WHERE m.type='table' AND il.[unique] = 1;
The above statement will return all tables and their columns if the column is part of a unique index.
From the docs:
The table-valued functions for PRAGMA feature was added in SQLite version 3.16.0 (2017-01-02). Prior versions of SQLite cannot use this feature.

Resources