Lossless SQLite FTS5 search of a substring - sqlite

Using a FTS5 virtual table returns nothing for postfix searches.
It only can search for the entire word tokens, or for the prefixes of the word tokens if I append * to the search.
For example, it does not find qwerty.png row, if I search for werty.
CREATE TABLE IF NOT EXISTS files (name TEXT, id INTEGER);
INSERT INTO files (name, id) VALUES ('qwerty.png', 1), ('asdfgh.png', 2);
CREATE VIRTUAL TABLE IF NOT EXISTS names USING FTS5(name);
INSERT INTO names (name) SELECT name FROM files;
SELECT *
FROM names
WHERE name MATCH 'werty';
It only works for prefix searches (qwerty, qwer*, qwe*, ...).
I can't use * at the start of the search (*werty), since it produces an error.
Is possibly to make the indexed text search working as if I would use
SELECT *
FROM names
WHERE name like '%wert%';
?
I just want to have the fast search for a substring without the full table scan.

Perhaps try the experimental trigram tokenizer
When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token.

Related

How to create virtual table FTS with external sqlite content table?

I want to create a SQLite virtual table with a content of a real one.
I have a small sample which demonstrates my problem. I already red the official tutorial, but can't find anything wrong in this code. Some users use a rebuild option, but it doesn't work for me.
CREATE TABLE if NOT EXISTS posts (a INTEGER PRIMARY KEY);
INSERT OR IGNORE INTO posts (a) VALUES(510000);
INSERT OR IGNORE INTO posts (a) VALUES(510001);
INSERT OR IGNORE INTO posts (a) VALUES(510300);
CREATE VIRTUAL TABLE IF NOT EXISTS posts_fts using fts5(content=posts, content_rowid=a, a);
SELECT * FROM posts_fts where posts_fts MATCH '10' ORDER BY a ASC;
If I run this, I get:
0 rows returned in 2ms from: SELECT * FROM posts_fts where posts_fts match '10' ORDER BY a ASC;
Does anyone have an idea wat I do wrong?
"10" is not a token in the FTS table.
From the doc:
4.3.1. Unicode61 Tokenizer
The unicode tokenizer classifies all unicode characters as either
"separator" or "token" characters. By default all space and
punctuation characters, as defined by Unicode 6.1, are considered
separators, and all other characters as token characters. More
specifically, all unicode characters assigned to a general category
beginning with "L" or "N" (letters and numbers, specifically) or to
category "Co" ("other, private use") are considered tokens. All other
characters are separators.
Each contiguous run of one or more token characters is considered to
be a token. The tokenizer is case-insensitive according to the rules
defined by Unicode 6.1.
Also from the doc:
3.2. FTS5 Phrases
FTS queries are made up of phrases. A phrase is an ordered list of one
or more tokens.
You might try a "prefix query" i.e. MATCH "5*" to see that you get results.

Retrieve distinct tokens from SQLite3 column

I want to add a new feature to my bookmarking utility, Buku: retrieve all distinct tags.
Buku uses SQLite3.
A bookmark entry can have multiple tags separated by commas (,) in the same column tags.
Instead of retrieving the distinct values from column tags and then parsing them, is there any way I can tokenize the tags by comma and retrieve the distinct tags?
Any help is much appreciated.
There isn't function 'split' in sqlite3 database. Only instr(X, Y) which returns position of only first occurrence. And there is function substr. If number of tags in field is constant value you can create complicated query to split you string into rows and then select distinct from them.
So answer is no, don't try to do it by database engine. You should change structure or parse values after retrieving from database.

How to search the database for a field which is a substring of the query by using sqlite

Problem description
I want to search for the query = Angela in a database from a table called Variations. The problem is that the database does not Angela. It contains Angel. As you can see the a is missing.
Searching procedure
The table that I want to query is the following:
"CREATE TABLE IF NOT EXISTS VARIATIONS
(ID INTEGER PRIMARY KEY NOT NULL,
ID_ENTITE INTEGER,
NAME TEXT,
TYPE TEXT,
LANGUAGE TEXT);"
To search for the query I am using fts4 because it is faster than LIKE% especially if I have a big database with more than 10 millions rows. I cannot also use the equality since i am looking for substrings.
I create a virtual table create virtual table variation_virtual using fts4(ID, ID_ENTITE, NAME, TYPE, LANGUAGE);
Filled the virtual table with VARIATIONS insert into variation_virtual select * from VARIATIONS;
The selection query is represented as follow:
SELECT ID_ENTITE, NAME FROM variation_virtual WHERE NAME MATCH "Angela";
Question
What am I missing in the query. What I am doing is the opposite of when we want to check if a query is a subtring of a string in a table.
You can't use fts4 for this. From the documentation:
SELECT count(*) FROM enrondata1 WHERE content MATCH 'linux'; /* 0.03 seconds */
SELECT count(*) FROM enrondata2 WHERE content LIKE '%linux%'; /* 22.5 seconds */
Of course, the two queries above are not
entirely equivalent. For example the LIKE query matches rows that
contain terms such as "linuxophobe" or "EnterpriseLinux" (as it
happens, the Enron E-Mail Dataset does not actually contain any such
terms), whereas the MATCH query on the FTS3 table selects only those
rows that contain "linux" as a discrete token. Both searches are
case-insensitive.
So your query will only match strings that have 'Angela' as a word (at least that is how I interpret 'discrete token').

Efficiently match the first characters of an indexed SQLite field

I have two tables in SQLite:
CREATE TABLE article (body TEXT)
CREATE TABLE article_word (
word TEXT,
article_rowid INTEGER,
FOREIGN KEY(article_rowid) REFERENCES article(rowid),
PRIMARY KEY(word, article_rowid)
)
The program stores long strings in article.body and, for each word in the string, it stores a lowercase version of the word along with the article's rowid in article_word.
I want to let the user search for articles by the first case-insensitive characters in a word, so a search for baz yields an article containing foobar Bazquux spambacon.
How can I modify the tables/add more (if necessary) and query them for matches optimally? Does
SELECT a.rowid, a.body FROM article a, article_word w WHERE w.word LIKE "baz%" AND a.rowid = w.article_rowid
take advantage of the PRIMARY KEY index on article_word.word or does it naïvely search every row?
Use NSPredicate to retrive specific Attribute According to your
requirement, and you can also do Mapping with Sqlite as in core Data.

sqlite Query optimisation

The query
SELECT * FROM Table WHERE Path LIKE 'geo-Africa-Egypt-%'
can be optimized as:
SELECT * FROM Table WHERE Path >= 'geo-Africa-Egypt-' AND Path < 'geo-Africa-Egypt-zzz'
But how can be this done:
select * from foodDb where Food LIKE '%apples%";
how this can be optimized?
One option is redundant data. If you're querying a lot for some fixed set of strings occuring in the middle of some column, add another column that contains the information whether a particular string can be found in the other column.
Another option, for arbitrary but still tokenizable strings is to create a dictionary table where you have the tokens (e.g. apples) and foreign key references to the actual table where the token occurs.
In general, sqlite is by design not very good at full text searches.
It would surprise me if it was faster, but you could try GLOB instead of LIKE and compare;
SELECT * FROM foodDb WHERE Food GLOB '*apples*';

Resources