Limit fulltext search in MariaDB (innodb) - mariadb

I'm having trouble making a search on a fairly large (5 million entries) table fast.
This is innodb on MariaDB (10.4.25).
Structure of the table my_table is like so:
id
text
1
some text
2
some more text
I now have a fulltext index on "text" and search for:
SELECT id FROM my_table WHERE MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
This is not super slow but can yield to millions of results. Retrieving them in my Java application takes forever but I need the matching ids.
Therefore, I wanted to limit the number already by the ids I know can only be relevant and tried something like this (id is primary index):
SELECT id FROM my_table WHERE id IN (1,2) AND MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
hoping that it would first limit to the 2 ids and then apply the fulltext search and give me the two results super quickly. Alas, that's not what happened and I don't understand why.
How can I limit the query if I already know some ids to only search through those AND make the query faster by doing so?

When you use a FULLTEXT (or SPATIAL) index together with some 'regular' index, the Optimizer assumes that the former will run faster, so it does that first.
Furthermore, it is nontrivial (maybe impossible) to run MATCH against a subset of a table.
Both of those conspire to say that the MATCH will happen first. (Of course, you were hoping to do the opposite.)
Is there a workaround? I doubt it. Especially if there a lot of rows with words starting with 'some' or 'tex'.
One thing to try is "+":
MATCH ('text') AGAINST ("+some* +tex*" IN BOOLEAN MODE);
Please report back whether this helped.
Hmmmm... Perhaps you want
MATCH (`text`) -- this
MATCH ('text') -- NOT this
There are two features in MariaDB:
max time spent in query
max number of rows accessed (may not apply to FULLTEXT)

Related

Cosmos DB Query Adds RUs when using OR

I have a simple core sql query that gets a count of rows. If i do the EXISTS and the IN separately, it's around 2/3RUs, but if i do a (EXISTS "OR" IN) -- I can even do (EXISTS "OR" TRUE), then it jumps up to 45RU. It makes more since for me to do 2 different queries than 1. Why does the OR cause the RU consuption to go up?
These are my queries that I've tried and I've experimented on.
SELECT VALUE COUNT(1) FROM ROOT r. -- 850 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) -- 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND r.id IN (......). 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) -- 840rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND (EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) OR r.id IN (...)) -- 840rows, 45RUs
This is also cross-listed on Microsoft Q/A as well.
Disclaimer: I have no internal view on CosmosdB engine and below is just a general guess.
While there may be tricks involved regarding data cardinality, how your index is set up and if/how the predicate tree could be pruned, but overall it is not too surprising that OR is a harder query. You can't have a covering index for OR-predicate and that requires data lookups.
For index-covered ANDs only, basically:
get matching entries from indexes for indexable predicates and take intersection.
return count
With OR-s you can't work on indexes alone:
get matching entries from indexes for indexable predicates and take intersection.
look up documents (or required parts)
Evaluate non-indexable predicates (like A OR B) on all matching documents
return count
Obviously the second requires a lot more computation and memory. Hence, higher RU. Query engine can do all kind of tricks, but the fact is that they must get extra data to make sure your "hard" predicates are taken into account.
BTW, if unhappy with RU, then you should always check which/how indexes were applied and if you can improve anything by setting up different indexes.
See: Indexing metrics in Azure Cosmos DB.
Having more complex queries have higher RU is still to be expected, though.

SQLite: re-arrange physical position of rows inside file

My problem is that my querys are too slow.
I have a fairly large sqlite database. The table is:
CREATE TABLE results (
timestamp TEXT,
name TEXT,
result float,
)
(I know that timestamps as TEXT is not optimal, but please ignore that for the purposes of this question. I'll have to fix that when I have the time)
"name" is a category. This calculation holds the results of a calculation that has to be done at each timestamp for all "name"s. So the inserts are done at equal-timestamps, but the querys will be done at equal-names (i.e. I want given a name, get its time series), like:
SELECT timestamp,result WHERE name='some_name';
Now, the way I'm doing things now is to have no indexes, calculate all results, then create an index on name CREATE INDEX index_name ON results (name). The reasoning is that I don't need the index when I'm inserting, but having the index will make querys on the index really fast.
But it's not. The database is fairly large. It has about half a million timestamps, and for each timestamp I have about 1000 names.
I suspect, although I'm not sure, that the reason why it's slow is that every though I've indexed the names, they're still scattered all around the physical disk. Something like:
timestamp1,name1,result
timestamp1,name2,result
timestamp1,name3,result
...
timestamp1,name999,result
timestamp1,name1000,result
timestamp2,name1,result
timestamp2,name2,result
etc...
I'm sure this is slower to query with NAME='some_name' than if the rows were physically ordered as:
timestamp1,name1,result
timestamp2,name1,result
timestamp3,name1,result
...
timestamp499997,name1000,result
timestamp499998,name1000,result
timestamp499999,name1000,result
timestamp500000,namee1000,result
etc...
So, how do I tell SQLite that the order in which I'd like the rows in disk isn't the one they were written in?
UPDATE: I'm further convinced that the slowness in doing a select with such an index comes exclusively from non-contiguous disk access. Doing SELECT * FROM results WHERE name=<something_that_doesnt_exist> immediately returns zero results. This suggests that it's not finding the names that's slow, it's actually reading them from the disk.
Normal sqlite tables have, as a primary key, a 64-bit integer (Known as rowid and a few other aliases). That determines the order that rows are stored in a B*-tree (Which puts all actual data in leaf node pages). You can change this with a WITHOUT ROWID table, but that requires an explicit primary key which is used to place rows in a B-tree. So if every row's (name, timestamp) columns make a unique value, that's a possibility that will leave all rows with the same name on a smaller set of pages instead of scattered all over.
You'd want the composite PK to be in that order if you're searching for a particular name most of the time, so something like:
CREATE TABLE results (
timestamp TEXT
, name TEXT
, result REAL
, PRIMARY KEY (name, timestamp)
) WITHOUT ROWID
(And of course not bothering with a second index on name.) The tradeoff is that inserts are likely to be slower as the chances of needing to split a page in the B-tree go up.
Some pragmas worth looking into to tune things:
cache_size
mmap_size
optimize (After creating your index; also consider building sqlite with SQLITE_ENABLE_STAT4.)
Since you don't have an INTEGER PRIMARY KEY, consider VACUUM after deleting a lot of rows if you ever do that.

CustTableListPage filtering is too slow

When I'm trying to filter CustAccount field on CustTableListPage it's taking too long to filter. On the other fields there is no latency. I'm trying to filter just part of account number like "*123".
I have done reindexing for custtable and also updated statics but not appreciable difference at all.
When i have added listpage's query in a view it's filtering custAccount field normally like the other fields.
Any suggestion?
Edit:
Our version is AX 2012 r2 cu8, not a user based problem it occurs for every user, Interaction class has some custimizations but just for setting some buttons enable/disable props. etc... i tryed to look query execution what i found is not clear. something like FETCH_API_CURSOR_000000..x
Record a trace of this execution and locate what is a bottleneck.
Keep in mind that that wildcards (such as *) have to be used with care. Using a filter string that starts with a wildcard kills all performance because the SQL indexes cannot be used.
Using a wildcard at the end
Imagine that you have a dictionnary and have to list all the words starting with 'Foo'. You can skip all entries before 'F', then all those before 'Fo', then all those before 'Foo' and start your result list from there.
Similarly, asking the underlying SQL engine to list all CustAccount entries starting with '123' (= filter string '123*') allows using an index on CustAccount to quickly skip to the relevant data.
Using a wildcard at the start
Imagine that you still have that dictionnary and have to list all the words ending with 'ing'. You would have no other choice than going through the entire dictionnary and checking the ending of every word (due to the alphabetical sorting).
This explains why asking the SQL engine to list all CustAccount entries ending with '123' (= filter string '*123') means that all CustAccount values must be investigated. So the AOS loops through all the entries and uses an SQL cursor to do this. That is the FETCH_API_CURSOR statement you see on the SQL level.
Possible solutions
Educate your end user that using a wildcard at the beginning of a filter string will always be slow on a large table.
Step up the SQL server hardware / allocated resources (faster CPU, more RAM, faster disk, ...).
Create a full text index on CustAccount (not a fan of this one and performance impact should be thoroughly investigated).
I've solve the problem. CustTableListPage query had a sorting over DirPartyTable.Name field. When I remove this sorting, filtering with wildcard working like a charm.

how can I get faster FTS4 query results ordered by a field in another table?

Background
I'm implementing full-text search over a body of email messages stored in SQLite, making use of its fantastic built-in FTS4 engine. I'm getting some rather poor query performance, although not exactly where I would expect. Let's take a look.
Representative schema
I'll give some simplified examples of the code in question, with links to the full code where applicable.
We've got a MessageTable that stores the data about an email message (full version spread out over several files here, here, and here):
CREATE TABLE MessageTable (
id INTEGER PRIMARY KEY,
internaldate_time_t INTEGER
);
CREATE INDEX MessageTableInternalDateTimeTIndex
ON MessageTable(internaldate_time_t);
The searchable text is added to an FTS4 table named MessageSearchTable (full version here):
CREATE VIRTUAL TABLE MessageSearchTable USING fts4(
id INTEGER PRIMARY KEY,
body
);
The id in the search table acts as a foreign key to the message table.
I'll leave it as an exercise for the reader to insert data into these tables (I certainly can't give out my private email). I have just under 26k records in each table.
Problem query
When we retrieve search results, we need them to be ordered descending by internaldate_time_t so we can pluck out only the most recent few results. Here's an example search query (full version here):
SELECT id
FROM MessageSearchTable
JOIN MessageTable USING (id)
WHERE MessageSearchTable MATCH 'a'
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0
On my machine, with my email, that runs in about 150 milliseconds, as measured via:
time sqlite3 test.db <<<"..." > /dev/null
150 milliseconds is no beast of a query, but for a simple FTS lookup and indexed order, it's sluggish. If I omit the ORDER BY, it completes in 10 milliseconds, for example. Also keep in mind that the actual query has one more sub-select, so there's a little more work going on in general: the full version of the query runs in about 600 milliseconds, which is into beast territory, and omitting the ORDER BY in that case shaves 500 milliseconds off the time.
If I turn on stats inside sqlite3 and run the query, I notice the line:
Sort Operations: 1
If my interpretation of the docs about those stats is correct, it looks like the query is completely skipping using the MessageTableInternalDateTimeTIndex. The full version of the query also has the line:
Fullscan Steps: 25824
Sounds like it's walking the table somewhere, but let's ignore that for now.
What I've discovered
So let's work on optimizing that a little bit. I can rearrange the query into a sub-select and force SQLite to use our index with the INDEXED BY extension:
SELECT id
FROM MessageTable
INDEXED BY MessageTableInternalDateTimeTIndex
WHERE id IN (
SELECT id
FROM MessageSearchTable
WHERE MessageSearchTable MATCH 'a'
)
ORDER BY internaldate_time_t DESC
LIMIT 10 OFFSET 0
Lo and behold, the running time has dropped to around 100 milliseconds (300 milliseconds in the full version of the query, a 50% reduction in running time), and there are no sort operations reported. Note that with just reorganizing the query like this but not forcing the index with INDEXED BY, there's still a sort operation (though we've still shaved off a few milliseconds oddly enough), so it appears that SQLite is indeed ignoring our index unless we force it.
I've also tried some other things to see if they'd make a difference, but they didn't:
Explicitly making the index DESC as described here, with and without INDEXED BY
Explicitly adding the id column in the index, with and without internaldate_time_t ordered DESC, with and without INDEXED BY
Probably several other things I can't remember at this moment
Questions
100 milliseconds here still seems awfully slow for what seems like it should be a simple FTS lookup and indexed order.
What's going on here? Why is it ignoring the obvious index unless you force its hand?
Am I hitting some limitation with combining data from virtual and regular tables?
Why is it still so relatively slow, and is there anything else I can do to get FTS matches ordered by a field in another table?
Thanks!
An index is useful for looking up a table row based on the value of the indexed column.
Once a table row is found, indexes are no longer useful because it is not efficient to look up a table row in an index by any other criterium.
An implication of this is that it is not possible to use more than one index for each table accessed in a query.
Also see the documentation: Query Planning, Query Optimizer.
Your first query has the following EXPLAIN QUERY PLAN output:
0 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 1 1 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
What happens is that
the FTS index is used to find all matching MessageSearchTable rows;
for each row found in 1., the MessageTable primary key index is used to find the matching row;
all rows found in 2. are sorted with a temporary table;
the first 10 rows are returned.
Your second query has the following EXPLAIN QUERY PLAN output:
0 0 0 SCAN TABLE MessageTable USING COVERING INDEX MessageTableInternalDateTimeTIndex (~100000 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
What happens is that
the FTS index is used to find all matching MessageSearchTable rows;
SQLite goes through all entries in the MessageTableInternalDateTimeTIndex in the index order, and returns a row when the id value is one of the values found in step 1.
SQLite stops after the tenth such row.
In this query, it is possible to use the index for (implied) sorting, but only because no other index is used for looking up rows in this table.
Using an index in this way implies that SQLite has to go through all entries, instead of lookup up the few rows that match some other condition.
When you omit the INDEXED BY clause from your second query, you get the following EXPLAIN QUERY PLAN output:
0 0 0 SEARCH TABLE MessageTable USING INTEGER PRIMARY KEY (rowid=?) (~25 rows)
0 0 0 EXECUTE LIST SUBQUERY 1
1 0 0 SCAN TABLE MessageSearchTable VIRTUAL TABLE INDEX 4: (~0 rows)
0 0 0 USE TEMP B-TREE FOR ORDER BY
which is essentially the same as your first query, except that joins and subqueries are handled slightly differently.
With your table structure, it is not really possible to get faster.
You are doing three operations:
looking up rows in MessageSearchTable;
looking up corresponding rows in MessageTable;
sorting rows by a MessageTable value.
As far as indexes are concerned, steps 2 and 3 conflict with each other.
The database has to choose whether to use an index for step 2 (in which case sorting must be done explicitly) or for step 3 (in which case it has to go through all MessageTable entries).
You could try to return fewer records from the FTS search by making the message time a part of the FTS table and searching only for the last few days (and increasing or dropping the time if you don't get enough results).

How Optimize sql query make it faster

I have a very simple small database, 2 of tables are:
Node (Node_ID, Node_name, Node_Date) : Node_ID is primary key
Citation (Origin_Id, Target_Id) : PRIMARY KEY (Origin_Id, Target_Id) each is FK in Node
Now I write a query that first find all citations that their Origin_Id has a specific date and then I want to know what are the target dates of these records.
I'm using sqlite in python the Node table has 3000 record and Citation has 9000 records,
and my query is like this in a function:
def cited_years_list(self, date):
c=self.cur
try:
c.execute("""select n.Node_Date,count(*) from Node n INNER JOIN
(select c.Origin_Id AS Origin_Id, c.Target_Id AS Target_Id, n.Node_Date AS
Date from CITATION c INNER JOIN NODE n ON c.Origin_Id=n.Node_Id where
CAST(n.Node_Date as INT)={0}) VW ON VW.Target_Id=n.Node_Id
GROUP BY n.Node_Date;""".format(date))
cited_years=c.fetchall()
self.conn.commit()
print('Cited Years are : \n ',str(cited_years))
except Exception as e:
print('Cited Years retrival failed ',e)
return cited_years
Then I call this function for some specific years, But it's crazy slowwwwwwwww :( (around 1 min for a specific year)
Although my query works fine, it is slow. would you please give me a suggestion to make it faster? I'd appreciate any idea about optimizing this query :)
I also should mention that I have indices on Origin_Id and Target_Id, so the inner join should be pretty fast, but it's not!!!
If this script runs over a period of time, you may consider loading the database into memory. Since you seem to be coding in python, there is a connection function called connection.backup that can backup an entire database into memory. Since memory is much faster than disk, this should increase speed. Of course, this doesn''t do anything to optimize the statement itself, since I don't have enough of the code to evaluate what it is you are doing with the code.
Instead of COUNT(*) use MAX(n.Node_Date)
SQLite doesn't keep a counter on number of tables like mysql does but instead it scans all your rows everytime you call COUNT meaning extremely slow.. yet you can use MAX() to fix that problem.

Resources