NHibernate, SQLite, and Cyrillic characters: case sensitivity and fallback queries - sqlite

I'm querying an SQLite database using NHibernate. Generally, I want to do case insensitive string queries. Recently, I've discovered that although I can insert a row with Cyrillic characters, I can not select it using a case insensitive query. This is what the query looks like:
string foo = "foo";
IList<T> list = session.CreateCriteria(typeof(T)).
Add(Expression.Eq("Foo", foo).IgnoreCase()).List<T>();
I can, however, select the row using the above query if IgnoreCase() is removed. A naive fix would be to check if list.Count == 0 after the first query, and make a subsequent case sensitive query. The major downside of this approach is that querying for non-existent rows is a reasonably common operation that would now consist of two queries.
The question is, how can I construct a single query that will select from the Foo column that is case insensitive yet will also select rows that contain Cyrillic characters?

Case insensitive queries by default only work with ASCII characters in SQLite.
See this FAQ: Case-insensitive matching of Unicode characters does not work.

Related

sqlite3 fts3 multiple columns search including special characters

i am using sqlite3 fts3. (sqlite3 version is 3.7.17)
I tried to search keywords including special characters (ex. #, ?) in multiple columns.
This is my examples.
SELECT * FROM table_name WHERE table_name MATCH
'EMAIL:aaa#test.com OR SUBJECT:is it a question?'
This query have to return a result having email address is 'aaa#test.com' or subject is 'is it a question?'
But this query is not return correct results.
I think that sqlite3 fts3 can't recognize special characters...
How can i solve this problem? :(
To do a phrase query, you must use quotes.
Special characters are filtered out by the default tokenizer; aaa#test.com must be handled as a phrase with three words.

Can I use Order By and ToLower to perform a case-insensitive string sort on DocumentDB?

I would like to sort the records in my DocumentDB collection alphabetically by their title. At first I thought this was working:
SELECT c.Title FROM c ORDER BY c.Title
But as would be expected this sorts lowercase letters after uppercase. I would like my search to be case insensitive, so I tried this:
SELECT c.Title FROM c order by LOWER(c.Title)
and this:
SELECT LOWER(c.Title) AS title FROM c ORDER BY title
but both of these generate errors. How can I perform a case-insensitive string sort?
The best way to do case insensitive search is to add a separate field that is created with lower case of corresponding field (in this case Title). DocumentDB provides an efficient auto-index mechanism which adds little to no overhead for adding another extra field.
Once you have the extra field, point your lower case queries to the new field.

How to escape string for SQLite FTS query

I'm trying to perform a SQLite FTS query with untrusted user input. I do not want to give the user access to the query syntax, that is they will not be able to perform a match query like foo OR bar AND cats. If they tried to query with that string I would want to interpret it as something more like foo \OR bar \AND cats.
There doesn't seem to be anything built in to SQLite for this, so I'll probably end up building my own escaping function, but this seems dangerous and error-prone. Is there a preferred way to do this?
The FTS MATCH syntax is its own little language. For FTS5, verbatim string literals are well defined:
Within an FTS expression a string may be specified in one of two ways:
By enclosing it in double quotes ("). Within a string, any embedded double quote characters may be escaped SQL-style - by adding a second double-quote character.
(redacted special case)
It turns out that correctly escaping a string for an FTS query is simple enough to implement completely and reliably: Replace " with "" and enclose the result in " on both ends.
In my case it then works perfectly when I put it into a prepared statement such as SELECT stuff FROM fts_table WHERE fts_table MATCH ?. I would then .bind(fts_escape(user_input)) where fts_escape is the function I described above.
OK I've investigated further, and with some heavy magic you can access the actual tokenizer used by SQLite's FTS. The "simple" tokenizer takes your string, separates it on any character that is not in [A-Za-z0-0], and lowercases the remaining. If you perform this same operation you will get a nicely "escaped" string suitable for FTS.
You can write your own, but you can access SQLite's internal one as well. See this question for details on that: Automatic OR queries using SQLite FTS4

Why does SQLite full-text search (FTS4) treat angle brackets differently in a compound search?

I have an SQLite database using FTS4. It is used to store emails with message id's of the form:
Searching for messages using the FTS MATCH syntax, I get a result from:
SELECT rowid FROM emails WHERE emails MATCH '<8200#comms.io>'
This returns the correct row. But when I try to find multiple emails, I get an empty response:
SELECT rowid FROM emails WHERE emails MATCH '<8200#comms.io> OR <8188#comms.io>'
Strangely though, I can search without the angle bracket characters. This returns both rows:
SELECT rowid FROM emails WHERE emails MATCH '8200#comms.io OR 8188#comms.io'
This even though the angle brackets are present in the stored columns. I can find no mention that these are special characters in SQLite, and without the 'OR', the single-term search works fine.
Why are these characters treated differently in my compound search?
The default (simple) tokenizer reads alphanumerical characters and treats all others as word separators to be ignored.
So when searching for a message ID, you have to actually search for a phrase with multiple words (8200, comms, and io).
If you want to treat the entire message ID as a word, you have to write a custom tokenizer.

full text searching

I have an application that allow users to search on multiple columns (prod_name,prod_desc)
So I used full text search like below, but it does not return all the records, for excample I tried to find 'o' character in 2 columns (prod_name,prod_desc)but it can not find for some records.
Also when I do not use wildcard for the 'o' character it can not find any thing while contains means like %o%.
I am a bit confused about full text search.
Please help what is the problem.
CREATE FULLTEXT CATALOG catalog_crashcourse3;
CREATE FULLTEXT INDEX ON products(prod_name,prod_desc)
KEY INDEX pk_products ON catalog_crashcourse3;
SELECT prod_name, prod_desc
FROM products
WHERE CONTAINS((prod_name,prod_desc), '"*o*"');
SQL Server FTS is a word-based search process. When you create a full-text index on a column, the indexing engine crawls the content and breaks it into individual words in a process known as tokenization. The index then stored the word, the primary key of the row it was found in, and the word's position in the content (i.e. is is the first word in the field, the 57th word, or whatever).
When you specify a CONTAINS predicate such as
CONTAINS((prod_name,prod_desc), '"o"');
the SQL Server FTS engine looks for tokens (i.e. words) in its index that are "o". If your content does not have the word "o" in it (which is probably doesn't) then no matches will be found.
As you point out, you can do wildcard searches, where you try and matched patterns in the indexed word. For example, if you specify a predicate such as
CONTAINS((prod_name,prod_desc), '"o*"');
then the search will return all words in the indexed content that start with the letter "o"
FTS is best used when you want to search for groups of words in your indexed content. It can do sophisticated word stemming (such as searching for "ran" and "running" when you specify "run"). It also provides a ranking of the search result content so that you can find the best match. If you just want to search for a specified word in your content and your content is not too large, you may not need FTS. As MikeSmithDev pointed out in the comments, you may be able to just get away with a LIKE clause.
Note added: In response to your comment, if you have a table with 8 columns that you want to search using FTS, then you would create full-text indexes on each of these columns and search them as follows:
CONTAINS(*, '"Word"')
where the asterik indicates that all 8 indexed columns in the table should be included in the search.
You have two issues:
You are using a prefix wildcard *o which Sql Server FTS is
helpless with. It only works with suffix wildcards like word*.
You are using a single-character search term. Single character words
are excluded from the FT index by default, which is a good thing.
Unless specified otherwise, SQL Server associates the system
full-text stoplist by default when creating the index.
To see the default stoplist your database is using behind your back, use this query
Select SysStop.stopword, Langs.name
From sys.fulltext_system_stopwords SysStop
Inner Join sys.fulltext_languages Langs
On Langs.lcid = SysStop.language_id;
If you really want to search for single characters, you can drop and
recreate the FT index using the option WITH STOPLIST OFF, but be prepared
for a lot of noise. See Create FullText Index.

Resources