Fast search on a blob starting bytes in SQLite - sqlite

Is there a way to index blob fields and have the index used for beginning of blob searches?
Currently I have hashes stored as hexadecimal in text fields.
These hashes in hexadecimal form are 32 characters long, and form the bulk of the data in the database.
Problem is, they are often searched by their starting bytes, as in
select * from mytable where hash like '00a1b2%'
I would like to store them as blobs, as this saves about 30% of the database size. However while
select * from mytable where hex(hash) like '00a1b2%'
works, it's also much slower and does not seem to use the index.
Searching for exact blob matches does use the index, so the index is working.
Is there a way to perform a search on a blob start (with binary/memcmp "collation") that would use the index?
I also tried substr(), it's apparently faster than hex() but still not indexed
select * from mytable where substr(hash, 1, 6) = x'00a1b2'

To be able to use an index for LIKE, the table column must have TEXT affinity, and the index must be case insensitive:
CREATE TABLE mytable(... hash TEXT, ...);
CREATE INDEX hash_index ON mytable(hash COLLATE NOCASE);
Functions like hex or substr prevent usage of indexes.
Blobs can be indexed and compared like other types.
This allows you to express a prefix search with two comparisons:
SELECT * FROM mytable WHERE hash >= x'00a1b2' AND hash < x'00a1b3'

Related

SQLite queryslow when using index

I have a table indexed on a text column, and I want all my queries to return results ordered by name without any performance hit.
Table has around 1 million rows if it matters.
Table -
CREATE TABLE table (Name text)
Index -
CREATE INDEX "NameIndex" ON "Files" (
"Name" COLLATE nocase ASC
);
Query 1 -
select * from table where Name like "%a%"
Query plan, as expected a full scan -
SCAN TABLE table
Time -
Result: 179202 rows returned in 53ms
Query 2, now using order by to read from index -
select * from table where Name like "%a%" order by Name collate nocase
Query plan, scan using index -
SCAN TABLE table USING INDEX NameIndex
Time -
Result: 179202 rows returned in 672ms
Used DB Browser for SQLite to get the information above, with default Pragmas.
I'd assume scanning the index would be as performant as scanning the table, is it not the case or am I doing something wrong?
Another interesting thing I noticed, that may be relevant -
Query 3 -
select * from table where Name like "a%"
Result: 23026 rows returned in 9ms
Query 4 -
select * from table where name like "a%" order by name collate nocase
Result: 23026 rows returned in 101ms
And both has them same query plan -
SEARCH TABLE table USING INDEX NameIndex (Name>? AND Name<?)
Is this expected? I'd assume the performance be the same if the plan was the same.
Thanks!
EDIT - The reason the query is slower was because I used select * and not select name, causing SQLite to go between the table and the index.
The solution was to use clustered index, thanks #Tomalak for helping me find it -
create table mytable (a text, b text, primary key (a,b)) without rowid
The table will be ordered by default using a + b combination, meaning that full scan queries will be much faster (now 90ms).
A LIKE pattern that starts with % can never use an index. It will always result in a full table scan (or index scan, if the query can be covered by the index itself).
It's logical when you think about it. Indexes are not magic. They are sorted lists of values, exactly like a keyword index in a book, and that means they are only only quick for looking up a word if you know how the given word starts. If you're searching for the middle part of a word, you would have to look at every index entry in a book as well.
Conclusion from the ensuing discussion in the comments:
The best course of action to get a table that always sorts by a non-unique column without a performance penalty is to create it without ROWID, and turn it into a clustering index over a the column in question plus a second column that makes the combination unique:
CREATE TABLE MyTable (
Name TEXT COLLATE NOCASE,
Id INTEGER,
Other TEXT,
Stuff INTEGER,
PRIMARY KEY(Name, Id) -- this will sort the whole table by Name
) WITHOUT ROWID;
This will result in a performance penalty for INSERT/UPDATE/DELETE operations, but in exchange sorting will be free since the table is already ordered.

Searching for blob field in SQLite

I have column in my database that stores BLOB.
I want to run a query to check if specific byte array value is present in the table.
The value is b'\xf4\x8f\xc6{\xc2mH(\x97\x9c\x83hkE\x8b\x95' (python bytes).
I tried to run this query:
SELECT * from received_message
WHERE "EphemeralID"
LIKE HEX('\xf4\x8f\xc6{\xc2mH(\x97\x9c\x83hkE\x8b\x95');
But I get 0 results though I 100% sure that I store this value in the database.
Is there something wrong with my query?
Your search string is a bit weird-- you appear to have some complex things in there like { and (. Maybe you should search through the blob the way it is stored instead?
From the Sqlite documentation:
BLOB literals are string literals containing hexadecimal data and
preceded by a single "x" or "X" character. Example: X'53514C697465'
So maybe do a like with the ascii representation of the hex value you want? Maybe start with looking for just f48f or F48F if your sqlite stores it upper case.

How to search the database for a field which is a substring of the query by using sqlite

Problem description
I want to search for the query = Angela in a database from a table called Variations. The problem is that the database does not Angela. It contains Angel. As you can see the a is missing.
Searching procedure
The table that I want to query is the following:
"CREATE TABLE IF NOT EXISTS VARIATIONS
(ID INTEGER PRIMARY KEY NOT NULL,
ID_ENTITE INTEGER,
NAME TEXT,
TYPE TEXT,
LANGUAGE TEXT);"
To search for the query I am using fts4 because it is faster than LIKE% especially if I have a big database with more than 10 millions rows. I cannot also use the equality since i am looking for substrings.
I create a virtual table create virtual table variation_virtual using fts4(ID, ID_ENTITE, NAME, TYPE, LANGUAGE);
Filled the virtual table with VARIATIONS insert into variation_virtual select * from VARIATIONS;
The selection query is represented as follow:
SELECT ID_ENTITE, NAME FROM variation_virtual WHERE NAME MATCH "Angela";
Question
What am I missing in the query. What I am doing is the opposite of when we want to check if a query is a subtring of a string in a table.
You can't use fts4 for this. From the documentation:
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
Of course, the two queries above are not
entirely equivalent. For example the LIKE query matches rows that
contain terms such as "linuxophobe" or "EnterpriseLinux" (as it
happens, the Enron E-Mail Dataset does not actually contain any such
terms), whereas the MATCH query on the FTS3 table selects only those
rows that contain "linux" as a discrete token. Both searches are
case-insensitive.
So your query will only match strings that have 'Angela' as a word (at least that is how I interpret 'discrete token').

Indexes with custom collations in sqlite

Assuming I have a schema like this:
CREATE TABLE abc(
id INTEGER PRIMARY KEY AUTOINCREMENT,
txt TEXT
);
CREATE INDEX "txtCS" ON "abc"("txt" COLLATE MY_CUSTOM_SORT);
when will sqlite use my index on txt ?
because I ran:
EXPLAIN QUERY PLAN SELECT * FROM abc ORDER BY txt COLLATE MY_CUSTOM_SORT DESC ...
and it tells me that it scans the table, twice, using the txtCS index (It doesn't search like I expected.)
MY_CUSTOM_SORT is my own sorting function that I hooked with sqliteCreateCollation. I just need that index for some queries that involve special ordering and I want them to be fast
In the EXPLAIN QUERY PLAN output, SEARCH means that the database tries to look up some particular record(s) with specific values, while SCAN means that the database goes through the entire table.
This query returns all records, so the most efficient operation is a SCAN.
Either operation can be sped up with an index.
(In a SCAN, the database just goes through all index entries in order.)

sqlite Query optimisation

The query
SELECT * FROM Table WHERE Path LIKE 'geo-Africa-Egypt-%'
can be optimized as:
SELECT * FROM Table WHERE Path >= 'geo-Africa-Egypt-' AND Path < 'geo-Africa-Egypt-zzz'
But how can be this done:
select * from foodDb where Food LIKE '%apples%";
how this can be optimized?
One option is redundant data. If you're querying a lot for some fixed set of strings occuring in the middle of some column, add another column that contains the information whether a particular string can be found in the other column.
Another option, for arbitrary but still tokenizable strings is to create a dictionary table where you have the tokens (e.g. apples) and foreign key references to the actual table where the token occurs.
In general, sqlite is by design not very good at full text searches.
It would surprise me if it was faster, but you could try GLOB instead of LIKE and compare;
SELECT * FROM foodDb WHERE Food GLOB '*apples*';

Resources