Cosmos DB Partition Query and RU Charges - azure-cosmosdb

I'm using the Volcano JSON sample with 1571 documents. I created one container (using the Cosmos DB Emulator) partitioned by ID and one container partitioned by Country. I expected to see that when I ran these queries:
select *
from VolcanoesById c
where c.Country = 'Japan'
select *
from VolcanoesByCountry c
where c.Country = 'Japan'
The RU cost listed in the emulator has the same charge:
Partitioned by ID:
Request Charge - 6.25 RUs
Results - 111
Retrieved document size - 56255 bytes
Output document count - 111
Output document size - 56416 bytes
Index hit document count - 111
Index lookup time - 0.13 ms
Document load time - 0.5 ms
Query engine execution time - 0.09 ms
Vs the partition by country:
Request Charge - 6.25 RUs
Results - 111
Retrieved document size - 56255 bytes
Output document count - 111
Output document size - 56416 bytes
Index hit document count - 111
Index lookup time - 10.96 ms
Document load time - 0.46 ms
Query engine execution time - 0.11 ms
Shouldn't the query by country partitioned on country be a smaller RU result?

The reason could be that the Container you created has a single physical partition.
Try creating in the Emulator containers with multiple physical partitions (you can achieve that by provisioning more than 10K RU) and repeating the same exercise.

Related

CosmosDB is not returning all results

I need to run a query to find all documents with duplicated e-mails.
SELECT * FROM (SELECT c.Email, COUNT(1) as cnt FROM c GROUP BY c.Email) a WHERE a.cnt > 1
When I run it in Data Explorer in Azure Portal it finds 4 results, but it's not a complete list of duplicated emails, because I already know one email that is duplicated and when the query is narrowed (where email = 'x') it is returned and there are about 70 duplicated emails in the collection.
Currently, throughput is set to autoscale with 6000 Max RU/s, the collection has about 4kk of documents. When running the query I observe an increased count of 429s responses on this collection.
Query Statistics shows that all documents are retrieved from the collection, but output is only 4 (should be around 70).
Query used 277324 RUs and took 71 seconds which gives 3905 RU/s in average, so it shouldn't be throttled.
Why cosmos returns only limited results for this query?
What can I do to get all duplicates?

In Azure Cosmos DB do I need to add the partition key to my query where clause?

I have a collection in Azure Cosmos DB with iot messages (called DeviceEvents). The partition key is application id. I want to do a query by device id (each device belongs to exactly one application). So I have a query like this
SELECT VALUE root
FROM root
WHERE root["ApplicationId"] = 69 AND root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.96 RUs, Index lookup time
2.22 ms, Document load time 0.41 ms and Query engine execution time 0.24 ms
When I execute the query without the partition key
SELECT VALUE root
FROM root
WHERE root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.45 RUs, Index lookup time
1.91 ms, Document load time 0.5 ms and Query engine execution time 0.24 ms
While the numbers vary the query with the partition key consistently consumes more RU and has higher index lookup time.
I don't have enough data for Cosmos DB to create different physical partitions right now but I will probably need it in the future. My relevant indexing policy is this
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "descending"
}
]
So my questions are
Do I need the partition key in the query?
Do I need the partition key in the index definition?
The reason you're getting confusing query stats is because the amount data is too small to provide meaningful results.
With a small amount of data (approx 20GB or less) you'll only be on a single physical partition. Cross-partition queries run just as fast as partitioned queries when on the same physical partition.
Where things start to blow up is when the database grows (scales). If you design your database to have a high number of cross-partition queries your database, by design, will not scale. So you definitely need (or should try as much as possible) to use the partition key in your queries, especially high volume queries.
I would also add TimeStamp in both an ascending and descending composite index.
The other thing you mentioned is every device belongs to the same applicationId. If that is the case then your container cannot grow larger than 20GB. If every device in this app has applicationId of 69 then you should redesign this container and find a new partition key. If your queries are always by device Id then that would make a much better partition key.

Mirrored data in two databases produces vastly different RU consumption

I am using Azure Cosmos to store customer data in my multi-tenant app. One of my customers started complaining about long wait times when querying their data. As a quick fix, I created a dedicated Cosmos instance for them and copied their data to that dedicated instance. Now I have two databases that contain exact copies of this customer's data. We'll call these databases db1 and db2. db1 contains all data for all customers, including this customer in question. db2 contains only data for this customer in question. Also, for both databases the partition key is an aggregate of tenant id and date, called ownerTime. Also, each database contains a single container named "call".
I then run this query in both databases:
select c.id,c.callTime,c.direction,c.action,c.result,c.duration,c.hasR,c.hasV,c.callersIndexed,c.callers,c.files,c.tags_s,c.ownerTime
from c
where
c.ownerTime = '352897067_202011'
and c.callTime>='2020-11-01T00:00:00'
and c.callTime<='2020-11-30T59:59:59'
and (CONTAINS(c.phoneNums_s, '7941521523'))
As you can see, I am isolating one partition (ownerTime: 352897067_202011). In this partition, there are about 50,000 records in each database.
In db1 (the database with all customer data), this uses 5116.38 RUs. In db2 (the dedicated instance), this query uses 65.8 RUs.
Why is there this discrepancy? The data in these two partitions is exactly the same across the two databases. The indexing policy is exactly the same as well. I suspect that db1 is trying to do a fan-out query. But why would it do that? I have the query set up so that it will only look in this one partition.
Here are the stats I retrieved for the above query after running on each database:
db1
Request Charge: 5116.38 RUs
Retrieved document count: 8
Retrieved document size: 18168 bytes
Output document count: 7
Output document size: 11793 bytes
Index hit document count: 7
Index lookup time: 5521.42 ms
Document load time: 7.8100000000000005 ms
Query engine execution time: 0.23 ms
System function execution time: 0.01 ms
User defined function execution time: 0 ms
Document write time: 0.07 ms
Round Trips: 1
db2
Request Charge: 65.8 RUs
Showing Results: 1 - 7
Retrieved document count: 7
Retrieved document size: 16585 bytes
Output document count: 7
Output document size: 11744 bytes
Index hit document count: 7
Index lookup time: 20.720000000000002 ms
Document load time: 4.8099 ms
Query engine execution time: 0.2001 ms
System function execution time: 0.01 ms
User defined function execution time: 0 ms
Document write time: 0.05 ms
Round Trips: 1
The indexing policy for both databases is:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
},
{
"path": "/callers/*"
},
{
"path": "/files/*"
}
]
}
*Update: I recently updated this question with a clearer query, and the query stats returned.
Following up from comments.
Since db1 has all customer’s data, the physical partition will have lot more unique values for callTime so the number of index pages scanned to evaluate callTime will be high. In case of db2, since only 1 customer data is there the logical and physical partition will be the same. So while this is not a fan-out, the query engine will still need to evaluate the range filter on callTime for all other other customer data.
To fix/improve the performance on db1 you should create a composite index on /ownerTime and /callTime, see below.
"compositeIndexes":[
[
{
"path":"/ownerTime"
},
{
"path":"/callTime"
}
]
],
Thanks.

Cosmos db to store and retrieve thousands of documents with in seconds

I am storing millions of documents in cosmos db with proper partitionkey. I need to retrieve say 500,000 documents to do some calculations and display the output in UI , this should happen with in say 10 second.
Would this be possible? I have tried this but taking nearly a minute. So for this kind of requirement is this the correct approach?
"id": "Latest_100_Sku1_1496188800",
"PartitionKey": "Latest_100_Sku1
"SnapshotType": 2,
"AccountCode": "100",
"SkuCode": "Sku1",
"Date": "2017-05-31T00:00:00",
"DateEpoch": 1496188800,
"Body": "rVNBa4MwFP4v72xHElxbvYkbo4dBwXaX0UOw6ZRFIyaBFfG/7zlT0EkPrYUcku+9fO/7kvca"
Size of one document : 825 byte
Am using autoscale 4000 Throughput
Query statistics - am using 2 queries.
Query 1 - select * from c where c.id in ({ids})
here i use PartitionKey in Query options.
Query Statistics
METRIC
VALUE
Request Charge
102.11 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
221672 bytes
Output document count More information
200
Output document size More information
221972 bytes
Index hit document count More information
200
Index lookup time More information
17.0499 ms
Document load time More information
1.59 ms
Query engine execution time More information
0.3401 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.16 ms
Round Trips
1
Query 2 --
select * from c where c.PartitionKey in ({keys}) and c.DateEpoch>={startDate.ToEpoch()} and c.DateEpoch<={endDate.ToEpoch()}
Query Statistics
METRIC
VALUE
Request Charge
226.32 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
176580 bytes
Output document count More information
200
Output document size More information
176880 bytes
Index hit document count More information
200
Index lookup time More information
88.31 ms
Document load time More information
4.2399000000000004 ms
Query engine execution time More information
0.4701 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.19 ms
Round Trips
1
Query #1 looks fine. Query #2 most likely would benefit from a composite index on DateEpoch. I'm not sure what the UDF is but if you're converting dates to epoch you want to read a new blog post New date and time system functions in Azure Cosmos DB
Overall, retrieving 500K documents in 1-2 queries to do some calculations seems like a strange use case. Typically most people will pre-calculate values and persist them using a materialized view pattern using change feed. Depending on how often you run these two queries, this is often a more efficient use of compute resources.

Azure cosmos db is faster without partition key

I'm confusing about the partition key with cosmos db. I have a database/container with about 4000 small records. If I try a sql statement with my partition key filter, the RUs and the duration time is larger then without.
Does someone understand this?
in this sample my partition key of the container is /partitionKey
I tried this statement:
SELECT * FROM c where c.partitionKey = 'userSettings' And c.deleted =false
Request Charge 50 RUs
Document load time 2.15 ms
and then this
SELECT * FROM c where c.cosmosEntityName = 'userSettings' And c.deleted =false
Request Charge 5 RUs
Document load time 0.38 ms
I expect exactly the opposite results.
Here some screenshots:
This question is very specific to the topology of your collection (which Azure support can help with), but generally speaking there are two cases where the latter query on non-partition key property can be lower in RUs than the partition key property:
List item
If the query on non-partition key property is incomplete, the RUs may appear lower, but you still need to read results from other partitions to ascertain there are no more results. You would have to click "More Results" in Data Explorer until it is grayed out
For this specific query where c.partitionKey = 'userSettings' And c.deleted =false, you should compare RUs with and without a composite index on /partitionKey/? and /deleted/? (https://learn.microsoft.com/azure/cosmos-db/how-to-manage-indexing-policy#composite-indexing-policy-examples). In some cases, you will get lower RUs with the composite index than with the default of /* which only indexes them individually, potentially close to ~5 RUs

Resources