I have a simple core sql query that gets a count of rows. If i do the EXISTS and the IN separately, it's around 2/3RUs, but if i do a (EXISTS "OR" IN) -- I can even do (EXISTS "OR" TRUE), then it jumps up to 45RU. It makes more since for me to do 2 different queries than 1. Why does the OR cause the RU consuption to go up?
These are my queries that I've tried and I've experimented on.
SELECT VALUE COUNT(1) FROM ROOT r. -- 850 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) -- 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND r.id IN (......). 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) -- 840rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND (EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) OR r.id IN (...)) -- 840rows, 45RUs
This is also cross-listed on Microsoft Q/A as well.
Disclaimer: I have no internal view on CosmosdB engine and below is just a general guess.
While there may be tricks involved regarding data cardinality, how your index is set up and if/how the predicate tree could be pruned, but overall it is not too surprising that OR is a harder query. You can't have a covering index for OR-predicate and that requires data lookups.
For index-covered ANDs only, basically:
get matching entries from indexes for indexable predicates and take intersection.
return count
With OR-s you can't work on indexes alone:
get matching entries from indexes for indexable predicates and take intersection.
look up documents (or required parts)
Evaluate non-indexable predicates (like A OR B) on all matching documents
return count
Obviously the second requires a lot more computation and memory. Hence, higher RU. Query engine can do all kind of tricks, but the fact is that they must get extra data to make sure your "hard" predicates are taken into account.
BTW, if unhappy with RU, then you should always check which/how indexes were applied and if you can improve anything by setting up different indexes.
See: Indexing metrics in Azure Cosmos DB.
Having more complex queries have higher RU is still to be expected, though.
Related
I'm having trouble making a search on a fairly large (5 million entries) table fast.
This is innodb on MariaDB (10.4.25).
Structure of the table my_table is like so:
id
text
1
some text
2
some more text
I now have a fulltext index on "text" and search for:
SELECT id FROM my_table WHERE MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
This is not super slow but can yield to millions of results. Retrieving them in my Java application takes forever but I need the matching ids.
Therefore, I wanted to limit the number already by the ids I know can only be relevant and tried something like this (id is primary index):
SELECT id FROM my_table WHERE id IN (1,2) AND MATCH ('text') AGAINST ("some* tex*" IN BOOLEAN MODE);
hoping that it would first limit to the 2 ids and then apply the fulltext search and give me the two results super quickly. Alas, that's not what happened and I don't understand why.
How can I limit the query if I already know some ids to only search through those AND make the query faster by doing so?
When you use a FULLTEXT (or SPATIAL) index together with some 'regular' index, the Optimizer assumes that the former will run faster, so it does that first.
Furthermore, it is nontrivial (maybe impossible) to run MATCH against a subset of a table.
Both of those conspire to say that the MATCH will happen first. (Of course, you were hoping to do the opposite.)
Is there a workaround? I doubt it. Especially if there a lot of rows with words starting with 'some' or 'tex'.
One thing to try is "+":
MATCH ('text') AGAINST ("+some* +tex*" IN BOOLEAN MODE);
Please report back whether this helped.
Hmmmm... Perhaps you want
MATCH (`text`) -- this
MATCH ('text') -- NOT this
There are two features in MariaDB:
max time spent in query
max number of rows accessed (may not apply to FULLTEXT)
Considering the following query:
SELECT TOP 1 * FROM c
WHERE c.Type = 'Case'
AND c.Entity.SomeField = #someValue
AND c.Entity.CreatedTimeUtc > #someTime
ORDER BY c.Entity.CreatedTimeUtc DESC
Until recently, when I ran this query, the number of documents processed by the query (RetrievedDocumentCount in the query metrics) was the number of documents that satisfies the first two condition, regardless the "CreatedTimeUtc" or the TOP 1.
Only when I added a composite index of (Type DESC, Entity.SomeField DESC, Entity.CreatedTimeUtc DESC) and added them to the ORDER BY clause, the retrieved documents count dropped to the number of documents that satisfies all 3 conditions (still not one document as expected, but better).
Then, starting a few days ago, we noticed in our dev environment that the composite index is no longer needed as retrieved documents count changed to only one document (= the number in the TOP, as expected), and the RU/s reduced significantly.
My question – is this a new improvement/fix in CosmosDB? I couldn’t find any announcement/documentation on this manner.
If so, is the roll-out completed or still in-progress? We have several production instances in different regions.
Thanks
There have not been any recent changes to our query engine that would explain why this query is suddenly less expensive.
The only thing that would explain this is fewer results match the filter than before and that our query engine was able to perform an optimization that it would not otherwise be able to have done with a larger set of results.
Thanks.
Lets say I have a list of url's and I want to find out the domain that is the that appears the fewest times. Here is an example of the database:
3598 ('www.emp.de/blog/tag/fear-factory/')
3599 ('www.emp.de/blog/tag/white-russian/')
3600 ('www.emp.de/blog/musik/die-emp-plattenkiste-zum-07-august-2015/')
3601 ('www.emp.de/Warenkorb/car_/')
3602 ('www.emp.de/ter_dataprotection/')
3603 ('hilfe.monster.de/my20/faq.aspx#help_1_211589')
3604 ('jobs.monster.de/l-nordrhein-westfalen.aspx')
3605 ('karriere-beratung.monster.de')
3606 ('karriere-beratung.monster.de')
In this case it should return jobs.monster.de or hilfe.monster.de. I only want one return value. Is that possible with pure Sqlite3?
It should be some kind of counting of the main url before the ".de"
At this moment I do it this way:
con.execute("select url, date from urls_to_visit ORDER BY RANDOM() LIMIT 1")
Here's a query which should handle this correctly:
SELECT substr(url, 1, instr(url, '.de')-1)
FROM urls_to_visit
WHERE url LIKE '%.de%'
-- insurance, can leave out if you're sure the whole table matches
GROUP BY substr(url, 1, instr(url, '.de')-1)
ORDER BY count(*) ASC, RANDOM()
LIMIT 1;
Group on the thing we want to sort by, then order by count(*). This expression extracts the part of the URL before the .de/:
substr(url, 1, instr(url, '.de')-1)
The RANDOM() ensures that ties are broken randomly instead of by following the table's natural ordering.* It only comes into play if there is a tie, as described in the SQLite documentation.
* Technically, the rows would not appear in natural order, but in arbitrary order. That means whatever order is most convenient for the query planner. Database systems often use merge sort or a variant, which is a stable sort, so ties will be consistently broken in the order the rows were fed into the sorting algorithm. Unless the query can benefit significantly from index lookups, which this one almost certainly can't, the most likely query plan is a full table scan, so the sort will typically end up following natural order. But you can't rely on any of this, since the standard does not formally require it.
I executed a query on SQLite and the plan part is
0|1|5|SCAN TABLE edges AS e1 (~250000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE dihedral USING AUTOMATIC
COVERING INDEX (TYPE=? AND EDGE=?) (~7 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE bounds USING AUTOMATIC
COVERING INDEX (FACE=? AND EDGE=?) (~7 rows)
where the query in WHERE is
exists (select dihedral.edge from dihedral where ihedral.type=2 and dihedral.edge=e1.edge) and
exists (select bounds.edge from bounds where bounds.face=f1.face and bounds.edge=e1.edge) and
I understand this is not a high effeciency query,Ijust want to increase the performance.
This is my guess:
There is no subquery flattening, right?
The two exist subquery introduce the correlated subquery, and as they are acctually executed as indexed nested loop, right?
Read the query, because table dihedral and bounds are independent, both are correlated with the outer edge table, so the computational complexity is O(n^2) for no index. However, as there are covering index, the performance should be much better, right? I found on wiki, index has performance of O(log(N)) even better,so the overall performance should be O(n*log(N)), is this right?
Could anyone help me to understand what happened? thanks.
SQLite does support subquery flattening, but this it not possible for an EXISTS subquery like here.
The AUTOMATIC shows that the database creates a temporary index just for this query.
This is a strong indication that you should create these indexes permanently:
CREATE INDEX dihedral_type_edge ON dihedral(type, edge);
CREATE INDEX bounds_face_edge ON bounds(face, edge);
The outer query goes through all edge rows, and for each row, searches in the indexes.
This would result in O(edge * (log(dihedral) + log(bounds))).
The temporary index creation requires sorting these tables, so the entire runtime ends up being O(dihedral*log(dihedral) + bounds*log(bounds) + edge*(log(dihedral)+log(bounds))).
This question already has answers here:
Count(*) vs Count(1) - SQL Server
(13 answers)
COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better? [duplicate]
(5 answers)
Closed 8 years ago.
Yes I know this question is similar to this thread: COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better?, but this is a bit different.
My senior said that getting result from count(PrimaryKey), assuming that PrimaryKey cannot be NULL, is somehow faster than doing a normal count(*). Is this true?
If this is true, is it true for all RDBMS? Please refer to (semi-)official document if it's possible.
No. This appears to be a persistent misconception, based on a confusion between the syntax
SELECT * FROM ...
and
SELECT COUNT(*) FROM ...
In the first case, * refers to all columns, and returning those certainly requires more resources than returning a single column. In the second case, COUNT(*) is simply shorthand for "count all rows". The mistaken belief is that COUNT(*) somehow instructs the database engine to examine all columns in all rows, whereas COUNT(<pk_field>) would only have to look at one column.
There are a number of other comments on SO here that reference the SQL-92 standard, which explicitly states that COUNT(*) should just refer to the cardinality of the table, so, at least in theory, database engines should be able to recognize and optimize that.
As far as I can tell, in both cases, most database engines (Postgres, Oracle, MySQL InnoDB) will just perform an index scan to count the number of rows. If you specify the PK, then that index will be used; if you just use COUNT(*), then the query planner will pick an index which spans the entire table*, but the performance should be identical.
The only exception to this that I can find is MySQL with MyISAM tables -- those tables cache the number of rows, so COUNT(*) is very fast. However, the query planner also recognizes COUNT(<field>), where <field> is any non-null column, as a request for the full table size, and uses the cache in that case as well. (source) So again, no difference in performance.
* Theoretically, if you had no such indexes, then COUNT(*) would be very slow, but in that case, COUNT(<pk>) would be impossible by definition
It doesn't matter for several reasons. First, both notations -- COUNT(1) and COUNT(*) -- are wrong syntax. Consider the same question about the SUM aggregate. Oh, SUM(*) doesn't make any sense; why? Because, summation is iterative execution of the assignment
for( int columnValue : columnList )
currentSum = currentSum + columnValue;
whereas for COUNT aggregate it looks like this
for( Tuple t : tupleList )
currentSum = currentSum + 1;
Therefore, the COUNT aggregate shouldn't have any parameters at all!
Then, there are all kind of syntactic quirks, such as count distinct. This simply demonstrates incompetence of SQL designers who tried to squeeze two consecutive actions (selecting distinct tuples, then aggregating) into one operation.
The second reason why it doesn't matter is that in practice you'll encounter myriads of poorly performing queries, and COUNT(1) vs COUNT(*) is never a bottleneck.