SQLite3, FTS3 and stop-words - sqlite

How do you prevent SQLite3 from not indexing certain key words, or "stop-words", during the build of a virtual FTS3 table?
Examples I'd like to not index include "is", "the", "a" etc.

Unfortunately there is no built in tokenizer that handles stop words, so you will either need to implement your own tokenizer in C and filter out the stop words from the list manually, insert pre-tokenized/pre-filtered text into the relevant FTS table column or use a somewhat convoluted scheme where you insert the text into the FTS column, fetch it back after its been tokenized, filter it and then update the column value.

Related

Lossless SQLite FTS5 search of a substring

Using a FTS5 virtual table returns nothing for postfix searches.
It only can search for the entire word tokens, or for the prefixes of the word tokens if I append * to the search.
For example, it does not find qwerty.png row, if I search for werty.
CREATE TABLE IF NOT EXISTS files (name TEXT, id INTEGER);
INSERT INTO files (name, id) VALUES ('qwerty.png', 1), ('asdfgh.png', 2);
CREATE VIRTUAL TABLE IF NOT EXISTS names USING FTS5(name);
INSERT INTO names (name) SELECT name FROM files;
SELECT *
FROM names
WHERE name MATCH 'werty';
It only works for prefix searches (qwerty, qwer*, qwe*, ...).
I can't use * at the start of the search (*werty), since it produces an error.
Is possibly to make the indexed text search working as if I would use
SELECT *
FROM names
WHERE name like '%wert%';
?
I just want to have the fast search for a substring without the full table scan.
Perhaps try the experimental trigram tokenizer
When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token.

Retrieve distinct tokens from SQLite3 column

I want to add a new feature to my bookmarking utility, Buku: retrieve all distinct tags.
Buku uses SQLite3.
A bookmark entry can have multiple tags separated by commas (,) in the same column tags.
Instead of retrieving the distinct values from column tags and then parsing them, is there any way I can tokenize the tags by comma and retrieve the distinct tags?
Any help is much appreciated.
There isn't function 'split' in sqlite3 database. Only instr(X, Y) which returns position of only first occurrence. And there is function substr. If number of tags in field is constant value you can create complicated query to split you string into rows and then select distinct from them.
So answer is no, don't try to do it by database engine. You should change structure or parse values after retrieving from database.

Strange sqlite3 behavior

I'm working on a small SQLite database using the Unix command line sqlite3 command tool. My schema is:
sqlite> .schema
CREATE TABLE status (id text, date integer, status text, mode text);
Now I want to set the column 'mode' to the string "Status" for all entries. However, if I type this:
sqlite> UPDATE status SET mode="Status";
Instead of setting column 'mode' to the string "Status", it sets every entry to the value that is currently in the column 'status'. Instead, if I type the following it does the expected behavior:
sqlite> UPDATE status SET mode='Status';
Is this normal behavior?
This is also a FAQ :-
My WHERE clause expression column1="column1" does not work. It causes every row of the table to be returned, not just the rows where column1 has the value "column1".
Use single-quotes, not double-quotes, around string literals in SQL. This is what the SQL standard requires. Your WHERE clause expression should read: column1='column2'
SQL uses double-quotes around identifiers (column or table names) that contains special characters or which are keywords. So double-quotes are a way of escaping identifier names. Hence, when you say column1="column1" that is equivalent to column1=column1 which is obviously always true.
http://www.sqlite.org/faq.html#q24
Yes, that's normal in SQL.
Single quotes are used for string values; double quotes are used for identifiers (like table or column names).
(See the documentation.)

full text searching

I have an application that allow users to search on multiple columns (prod_name,prod_desc)
So I used full text search like below, but it does not return all the records, for excample I tried to find 'o' character in 2 columns (prod_name,prod_desc)but it can not find for some records.
Also when I do not use wildcard for the 'o' character it can not find any thing while contains means like %o%.
I am a bit confused about full text search.
Please help what is the problem.
CREATE FULLTEXT CATALOG catalog_crashcourse3;
CREATE FULLTEXT INDEX ON products(prod_name,prod_desc)
KEY INDEX pk_products ON catalog_crashcourse3;
SELECT prod_name, prod_desc
FROM products
WHERE CONTAINS((prod_name,prod_desc), '"*o*"');
SQL Server FTS is a word-based search process. When you create a full-text index on a column, the indexing engine crawls the content and breaks it into individual words in a process known as tokenization. The index then stored the word, the primary key of the row it was found in, and the word's position in the content (i.e. is is the first word in the field, the 57th word, or whatever).
When you specify a CONTAINS predicate such as
CONTAINS((prod_name,prod_desc), '"o"');
the SQL Server FTS engine looks for tokens (i.e. words) in its index that are "o". If your content does not have the word "o" in it (which is probably doesn't) then no matches will be found.
As you point out, you can do wildcard searches, where you try and matched patterns in the indexed word. For example, if you specify a predicate such as
CONTAINS((prod_name,prod_desc), '"o*"');
then the search will return all words in the indexed content that start with the letter "o"
FTS is best used when you want to search for groups of words in your indexed content. It can do sophisticated word stemming (such as searching for "ran" and "running" when you specify "run"). It also provides a ranking of the search result content so that you can find the best match. If you just want to search for a specified word in your content and your content is not too large, you may not need FTS. As MikeSmithDev pointed out in the comments, you may be able to just get away with a LIKE clause.
Note added: In response to your comment, if you have a table with 8 columns that you want to search using FTS, then you would create full-text indexes on each of these columns and search them as follows:
CONTAINS(*, '"Word"')
where the asterik indicates that all 8 indexed columns in the table should be included in the search.
You have two issues:
You are using a prefix wildcard *o which Sql Server FTS is
helpless with. It only works with suffix wildcards like word*.
You are using a single-character search term. Single character words
are excluded from the FT index by default, which is a good thing.
Unless specified otherwise, SQL Server associates the system
full-text stoplist by default when creating the index.
To see the default stoplist your database is using behind your back, use this query
Select SysStop.stopword, Langs.name
From sys.fulltext_system_stopwords SysStop
Inner Join sys.fulltext_languages Langs
On Langs.lcid = SysStop.language_id;
If you really want to search for single characters, you can drop and
recreate the FT index using the option WITH STOPLIST OFF, but be prepared
for a lot of noise. See Create FullText Index.

Selecting empty character column in SQLite

I want to select some rows from a SQLite table, and add an empty character column at the same time, but I get an error. The statement is SELECT firstname, SPACE(100) AS mytext FROM Customers, and the error message is "No such function: space".
I can run the same command in SQL-server without any probems, and in SQLite I can select additional numeric columns without problems (eg. SELECT firstname, 8 AS newfield ...), but not character columns.
Any help would be appreciated.
Regards, Alan
Functions are not standard across database engines; some will be the same, but most are not. A complete list of standard functions is here http://www.sqlite.org/lang_corefunc.html. You can also create custom functions in C, C#, or whatever you're using.
There is no built in version of SPACE. You need to create a custom function or use a string literal.

Resources