Cosmos DB: TOP vs OFFSET LIMIT - azure-cosmosdb

I'm new to Cosmos DB. It's not clear from the documentation if there's any difference between the TOP x keyword and the OFFSET 0 LIMIT x clause in a plain (not having a GROUP BY clause) query?
From what I see results for these 2 queries are identical.
SELECT * FROM ROOT AS m ORDER BY m.id OFFSET 0 LIMIT 1
SELECT TOP 1 * FROM ROOT AS m ORDER BY m.id

You are right, it doesn't make any difference in terms of performance & cost. You can try out the two queries and measure the Request Charge. They are identical for various different combinations I tried, which is a good indication that they are treated identically:

I can't get clear description on it from ms document too. So I did a test.

Related

Cosmos DB Query Adds RUs when using OR

I have a simple core sql query that gets a count of rows. If i do the EXISTS and the IN separately, it's around 2/3RUs, but if i do a (EXISTS "OR" IN) -- I can even do (EXISTS "OR" TRUE), then it jumps up to 45RU. It makes more since for me to do 2 different queries than 1. Why does the OR cause the RU consuption to go up?
These are my queries that I've tried and I've experimented on.
SELECT VALUE COUNT(1) FROM ROOT r. -- 850 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) -- 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND r.id IN (......). 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) -- 840rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND (EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) OR r.id IN (...)) -- 840rows, 45RUs
This is also cross-listed on Microsoft Q/A as well.
Disclaimer: I have no internal view on CosmosdB engine and below is just a general guess.
While there may be tricks involved regarding data cardinality, how your index is set up and if/how the predicate tree could be pruned, but overall it is not too surprising that OR is a harder query. You can't have a covering index for OR-predicate and that requires data lookups.
For index-covered ANDs only, basically:
get matching entries from indexes for indexable predicates and take intersection.
return count
With OR-s you can't work on indexes alone:
get matching entries from indexes for indexable predicates and take intersection.
look up documents (or required parts)
Evaluate non-indexable predicates (like A OR B) on all matching documents
return count
Obviously the second requires a lot more computation and memory. Hence, higher RU. Query engine can do all kind of tricks, but the fact is that they must get extra data to make sure your "hard" predicates are taken into account.
BTW, if unhappy with RU, then you should always check which/how indexes were applied and if you can improve anything by setting up different indexes.
See: Indexing metrics in Azure Cosmos DB.
Having more complex queries have higher RU is still to be expected, though.

CosmosDB - TOP 1 query with ORDER BY - Retrieved document count and RU

Considering the following query:
SELECT TOP 1 * FROM c
WHERE c.Type = 'Case'
AND c.Entity.SomeField = #someValue
AND c.Entity.CreatedTimeUtc > #someTime
ORDER BY c.Entity.CreatedTimeUtc DESC
Until recently, when I ran this query, the number of documents processed by the query (RetrievedDocumentCount in the query metrics) was the number of documents that satisfies the first two condition, regardless the "CreatedTimeUtc" or the TOP 1.
Only when I added a composite index of (Type DESC, Entity.SomeField DESC, Entity.CreatedTimeUtc DESC) and added them to the ORDER BY clause, the retrieved documents count dropped to the number of documents that satisfies all 3 conditions (still not one document as expected, but better).
Then, starting a few days ago, we noticed in our dev environment that the composite index is no longer needed as retrieved documents count changed to only one document (= the number in the TOP, as expected), and the RU/s reduced significantly.
My question – is this a new improvement/fix in CosmosDB? I couldn’t find any announcement/documentation on this manner.
If so, is the roll-out completed or still in-progress? We have several production instances in different regions.
Thanks
There have not been any recent changes to our query engine that would explain why this query is suddenly less expensive.
The only thing that would explain this is fewer results match the filter than before and that our query engine was able to perform an optimization that it would not otherwise be able to have done with a larger set of results.
Thanks.

OrientDB Query Result set of vertices with an empty collection of edges, vertices

this might be a simple question but I am confused, please help.....
I am using OrientDB 2.1.9 and I am experimenting with VehicleHistoryGraph database.
From Studio, Browse mode, set limit to 9 records only. Now I am entering this simple query
select out() from Person
The result set I am getting back is 9 records BUT only two have Bought a vehicle. The rest are displayed with empty collections []. This is no good, I am confused. I would expect to get back only those two vertices with collections of edges !
How do I get back these two persons that bought something ?
I noticed also that there is this unwind operator in select. Is this useful in that case, can you make an example ?
Your query asks for out(), so out() is computed in all cases, and you're shown the results. If you only want the rows for which out().size() > 0 then you can construct a query like this:
select out() from v let n=out().size() where $n > 0
If you think that one ought to be able to write this more succintly, e.g. like so:
select out() as n from v where n > 0
then join the club (e.g. by supporting this enhancement request).
(select out() from v where out().size() > 0 is supported.)

How can I understand the sqlite query plan?

I executed a query on SQLite and the plan part is
0|1|5|SCAN TABLE edges AS e1 (~250000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE dihedral USING AUTOMATIC
COVERING INDEX (TYPE=? AND EDGE=?) (~7 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE bounds USING AUTOMATIC
COVERING INDEX (FACE=? AND EDGE=?) (~7 rows)
where the query in WHERE is
exists (select dihedral.edge from dihedral where ihedral.type=2 and dihedral.edge=e1.edge) and
exists (select bounds.edge from bounds where bounds.face=f1.face and bounds.edge=e1.edge) and
I understand this is not a high effeciency query,Ijust want to increase the performance.
This is my guess:
There is no subquery flattening, right?
The two exist subquery introduce the correlated subquery, and as they are acctually executed as indexed nested loop, right?
Read the query, because table dihedral and bounds are independent, both are correlated with the outer edge table, so the computational complexity is O(n^2) for no index. However, as there are covering index, the performance should be much better, right? I found on wiki, index has performance of O(log(N)) even better,so the overall performance should be O(n*log(N)), is this right?
Could anyone help me to understand what happened? thanks.
SQLite does support subquery flattening, but this it not possible for an EXISTS subquery like here.
The AUTOMATIC shows that the database creates a temporary index just for this query.
This is a strong indication that you should create these indexes permanently:
CREATE INDEX dihedral_type_edge ON dihedral(type, edge);
CREATE INDEX bounds_face_edge ON bounds(face, edge);
The outer query goes through all edge rows, and for each row, searches in the indexes.
This would result in O(edge * (log(dihedral) + log(bounds))).
The temporary index creation requires sorting these tables, so the entire runtime ends up being O(dihedral*log(dihedral) + bounds*log(bounds) + edge*(log(dihedral)+log(bounds))).

Sqlite Query Optimization (using Limit and Offset)

Following is the query that I use for getting a fixed number of records from a database with millions of records:-
select * from myTable LIMIT 100 OFFSET 0
What I observed is, if the offset is very high like say 90000, then it takes more time for the query to execute. Following is the time difference between 2 queries with different offsets:
select * from myTable LIMIT 100 OFFSET 0 //Execution Time is less than 1sec
select * from myTable LIMIT 100 OFFSET 95000 //Execution Time is almost 15secs
Can anyone suggest me how to optimize this query? I mean, the Query Execution Time should be same and fast for any number of records I wish to retrieve from any OFFSET.
Newly Added:-
The actual scenario is that I have got a database having > than 1 million records. But since it's an embedded device, I just can't do "select * from myTable" and then fetch all the records from the query. My device crashes. Instead what I do is I keep fetching records batch by batch (batch size = 100 or 1000 records) as per the query mentioned above. But as i mentioned, it becomes slow as the offset increases. So, my ultimate aim is that I want to read all the records from the database. But since I can't fetch all the records in a single execution, I need some other efficient way to achieve this.
As JvdBerg said, indexes are not used in LIMIT/OFFSET.
Simply adding 'ORDER BY indexed_field' will not help too.
To speed up pagination you should avoid LIMIT/OFFSET and use WHERE clause instead. For example, if your primary key field is named 'id' and has no gaps, than your code above can be rewritten like this:
SELECT * FROM myTable WHERE id>=0 AND id<100 //very fast!
SELECT * FROM myTable WHERE id>=95000 AND id<95100 //as fast as previous line!
By doing a query with a offset of 95000, all previous 95000 records are processed. You should make some index on the table, and use that for selecting records.
As #user318750 said, if you know you have a contiguous index, you can simply use
select * from Table where index >= %start and index < %(start+size)
However, those cases are rare. If you don't want to rely on that assumption, use a sub-query, for example using rowid, which is always indexed,
select * from Table where rowid in (
select rowid from Table limit %size offset %start)
This speeds things up especially if you have "fat" rows (e.g. that contain blobs).
If maintaining the record order is important (it usually isn't), you need to order the indices first:
select * from Table where rowid in (
select rowid from Table order by rowid limit %size offset %start)
select * from data where rowid = (select rowid from data limit 1 offset 999999);
With SQLite, you don't need to get all rows returned at once in a big fat array, you can get called back for every row. This way, you can process the results as they come in, which should address both your crashing and performance issues.
I guess you're not using C as you would already be using a callback, but this technique should be available in any other language.
Javascript example (from : https://www.npmjs.com/package/sqlite3 )
db.each("SELECT rowid AS id, info FROM lorem", function(err, row) {
console.log(row.id + ": " + row.info);
});

Resources