How to create virtual table FTS with external sqlite content table? - sqlite

I want to create a SQLite virtual table with a content of a real one.
I have a small sample which demonstrates my problem. I already red the official tutorial, but can't find anything wrong in this code. Some users use a rebuild option, but it doesn't work for me.
CREATE TABLE if NOT EXISTS posts (a INTEGER PRIMARY KEY);
INSERT OR IGNORE INTO posts (a) VALUES(510000);
INSERT OR IGNORE INTO posts (a) VALUES(510001);
INSERT OR IGNORE INTO posts (a) VALUES(510300);
CREATE VIRTUAL TABLE IF NOT EXISTS posts_fts using fts5(content=posts, content_rowid=a, a);
SELECT * FROM posts_fts where posts_fts MATCH '10' ORDER BY a ASC;
If I run this, I get:
0 rows returned in 2ms from: SELECT * FROM posts_fts where posts_fts match '10' ORDER BY a ASC;
Does anyone have an idea wat I do wrong?

"10" is not a token in the FTS table.
From the doc:
4.3.1. Unicode61 Tokenizer
The unicode tokenizer classifies all unicode characters as either
"separator" or "token" characters. By default all space and
punctuation characters, as defined by Unicode 6.1, are considered
separators, and all other characters as token characters. More
specifically, all unicode characters assigned to a general category
beginning with "L" or "N" (letters and numbers, specifically) or to
category "Co" ("other, private use") are considered tokens. All other
characters are separators.
Each contiguous run of one or more token characters is considered to
be a token. The tokenizer is case-insensitive according to the rules
defined by Unicode 6.1.
Also from the doc:
3.2. FTS5 Phrases
FTS queries are made up of phrases. A phrase is an ordered list of one
or more tokens.
You might try a "prefix query" i.e. MATCH "5*" to see that you get results.

Related

Lossless SQLite FTS5 search of a substring

Using a FTS5 virtual table returns nothing for postfix searches.
It only can search for the entire word tokens, or for the prefixes of the word tokens if I append * to the search.
For example, it does not find qwerty.png row, if I search for werty.
CREATE TABLE IF NOT EXISTS files (name TEXT, id INTEGER);
INSERT INTO files (name, id) VALUES ('qwerty.png', 1), ('asdfgh.png', 2);
CREATE VIRTUAL TABLE IF NOT EXISTS names USING FTS5(name);
INSERT INTO names (name) SELECT name FROM files;
SELECT *
FROM names
WHERE name MATCH 'werty';
It only works for prefix searches (qwerty, qwer*, qwe*, ...).
I can't use * at the start of the search (*werty), since it produces an error.
Is possibly to make the indexed text search working as if I would use
SELECT *
FROM names
WHERE name like '%wert%';
?
I just want to have the fast search for a substring without the full table scan.
Perhaps try the experimental trigram tokenizer
When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token.

sqlite3 fts3 multiple columns search including special characters

i am using sqlite3 fts3. (sqlite3 version is 3.7.17)
I tried to search keywords including special characters (ex. #, ?) in multiple columns.
This is my examples.
SELECT * FROM table_name WHERE table_name MATCH
'EMAIL:aaa#test.com OR SUBJECT:is it a question?'
This query have to return a result having email address is 'aaa#test.com' or subject is 'is it a question?'
But this query is not return correct results.
I think that sqlite3 fts3 can't recognize special characters...
How can i solve this problem? :(
To do a phrase query, you must use quotes.
Special characters are filtered out by the default tokenizer; aaa#test.com must be handled as a phrase with three words.

Why does SQLite full-text search (FTS4) treat angle brackets differently in a compound search?

I have an SQLite database using FTS4. It is used to store emails with message id's of the form:
Searching for messages using the FTS MATCH syntax, I get a result from:
SELECT rowid FROM emails WHERE emails MATCH '<8200#comms.io>'
This returns the correct row. But when I try to find multiple emails, I get an empty response:
SELECT rowid FROM emails WHERE emails MATCH '<8200#comms.io> OR <8188#comms.io>'
Strangely though, I can search without the angle bracket characters. This returns both rows:
SELECT rowid FROM emails WHERE emails MATCH '8200#comms.io OR 8188#comms.io'
This even though the angle brackets are present in the stored columns. I can find no mention that these are special characters in SQLite, and without the 'OR', the single-term search works fine.
Why are these characters treated differently in my compound search?
The default (simple) tokenizer reads alphanumerical characters and treats all others as word separators to be ignored.
So when searching for a message ID, you have to actually search for a phrase with multiple words (8200, comms, and io).
If you want to treat the entire message ID as a word, you have to write a custom tokenizer.

full text searching

I have an application that allow users to search on multiple columns (prod_name,prod_desc)
So I used full text search like below, but it does not return all the records, for excample I tried to find 'o' character in 2 columns (prod_name,prod_desc)but it can not find for some records.
Also when I do not use wildcard for the 'o' character it can not find any thing while contains means like %o%.
I am a bit confused about full text search.
Please help what is the problem.
CREATE FULLTEXT CATALOG catalog_crashcourse3;
CREATE FULLTEXT INDEX ON products(prod_name,prod_desc)
KEY INDEX pk_products ON catalog_crashcourse3;
SELECT prod_name, prod_desc
FROM products
WHERE CONTAINS((prod_name,prod_desc), '"*o*"');
SQL Server FTS is a word-based search process. When you create a full-text index on a column, the indexing engine crawls the content and breaks it into individual words in a process known as tokenization. The index then stored the word, the primary key of the row it was found in, and the word's position in the content (i.e. is is the first word in the field, the 57th word, or whatever).
When you specify a CONTAINS predicate such as
CONTAINS((prod_name,prod_desc), '"o"');
the SQL Server FTS engine looks for tokens (i.e. words) in its index that are "o". If your content does not have the word "o" in it (which is probably doesn't) then no matches will be found.
As you point out, you can do wildcard searches, where you try and matched patterns in the indexed word. For example, if you specify a predicate such as
CONTAINS((prod_name,prod_desc), '"o*"');
then the search will return all words in the indexed content that start with the letter "o"
FTS is best used when you want to search for groups of words in your indexed content. It can do sophisticated word stemming (such as searching for "ran" and "running" when you specify "run"). It also provides a ranking of the search result content so that you can find the best match. If you just want to search for a specified word in your content and your content is not too large, you may not need FTS. As MikeSmithDev pointed out in the comments, you may be able to just get away with a LIKE clause.
Note added: In response to your comment, if you have a table with 8 columns that you want to search using FTS, then you would create full-text indexes on each of these columns and search them as follows:
CONTAINS(*, '"Word"')
where the asterik indicates that all 8 indexed columns in the table should be included in the search.
You have two issues:
You are using a prefix wildcard *o which Sql Server FTS is
helpless with. It only works with suffix wildcards like word*.
You are using a single-character search term. Single character words
are excluded from the FT index by default, which is a good thing.
Unless specified otherwise, SQL Server associates the system
full-text stoplist by default when creating the index.
To see the default stoplist your database is using behind your back, use this query
Select SysStop.stopword, Langs.name
From sys.fulltext_system_stopwords SysStop
Inner Join sys.fulltext_languages Langs
On Langs.lcid = SysStop.language_id;
If you really want to search for single characters, you can drop and
recreate the FT index using the option WITH STOPLIST OFF, but be prepared
for a lot of noise. See Create FullText Index.

SQL Server & ASP .NET encoding issue

my page has utf-8 meta element added + sql server encoding is also utf. However when I create record and try to issue SELECT statement with condition that contains POLISH characters like 'ń' , I see no results. Any ideas what am I missing?
ALSO Sql management studio shows result with POLISH characters , but I don't trust it.... I guess something is wrong with putting record into database...
Or how can I troubleshoot it?
Thanks,Paweł
I had the same issue, and I solved it by prefixing the text in the WHERE clause with "N".
For example, I have a table 'Person' containing a bit over 21,000 names of people. A person with the last name "Krzemiński" was recently added to the database, and the name appears normal when the row is displayed (i.e., the "ń" character is displayed correctly). However, neither of the following statements returned any records:
SELECT * FROM Person WHERE FamilyName='Krzemiński
SELECT * FROM Person WHERE FamilyName LIKE 'Krzemiń%'
...but these statements both returned the correct record:
SELECT * FROM Person WHERE FamilyName LIKE 'Krzemi%'<br>
SELECT * FROM Person WHERE FamilyName LIKE 'Krzemi%ski'
When I executed the following statement:
SELECT * FROM Person WHERE FamilyName LIKE '%ń%'
I get all 8900 records that contain the letter "n" (no diacritic), but I do not get the record that contains the "ń" character. I tried this last query with all of the Polish characters (ąćęłńóśźż), and all of them except "ó" exhibit the same behavior (i.e., return all records with the lower-ASCII equivalent character). Weirdly, "ó" works as it should, returning only those records with an "ó" in the FamilyName field.
In any case, the solution was to prefix the search criterion with "N", to explicitly declare it as Unicode.
Thus, the following statements:
SELECT * FROM Person WHERE FamilyName LIKE N'%ń%'
SELECT * FROM Person WHERE FamilyName=N'Krzemiński'
...both return the correct set of records.
The reason I was confused is that I have MANY records with weird diacritics, and they all return the correct records even without the "N" prefix. So far, the only characters I've found that require the explicit "N" prefix are the Polish characters.
According to this (Archived) Microsoft Support Issue:
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
simply use nvarchar instead of varchar as the datatype of the column saving the record.

Resources