I'm confusing about the partition key with cosmos db. I have a database/container with about 4000 small records. If I try a sql statement with my partition key filter, the RUs and the duration time is larger then without.
Does someone understand this?
in this sample my partition key of the container is /partitionKey
I tried this statement:
SELECT * FROM c where c.partitionKey = 'userSettings' And c.deleted =false
Request Charge 50 RUs
Document load time 2.15 ms
and then this
SELECT * FROM c where c.cosmosEntityName = 'userSettings' And c.deleted =false
Request Charge 5 RUs
Document load time 0.38 ms
I expect exactly the opposite results.
Here some screenshots:
This question is very specific to the topology of your collection (which Azure support can help with), but generally speaking there are two cases where the latter query on non-partition key property can be lower in RUs than the partition key property:
List item
If the query on non-partition key property is incomplete, the RUs may appear lower, but you still need to read results from other partitions to ascertain there are no more results. You would have to click "More Results" in Data Explorer until it is grayed out
For this specific query where c.partitionKey = 'userSettings' And c.deleted =false, you should compare RUs with and without a composite index on /partitionKey/? and /deleted/? (https://learn.microsoft.com/azure/cosmos-db/how-to-manage-indexing-policy#composite-indexing-policy-examples). In some cases, you will get lower RUs with the composite index than with the default of /* which only indexes them individually, potentially close to ~5 RUs
Related
I am new to Cosmos DB and facing issue in designing my DB.
I have a data similar to below structure
{
"userId": "64_CHAR_ID",
"gpId": "34_CHAR_ID"
... Other data
}
Currently my DB is having partition on userId as all the queries were by userId so far. Now I want to query my DB based on gpId when userId is not known. So it is ending up as cross partition query and it takes lot of clock-time(more than 5 min) and RUs (more than 3k RUs).
Query I am using is
SELECT * FROM c WHERE c.gpId='SOME_GPID'
According to Microsoft Doc we should avoid cross partition queries when dataset is large, and in my case the dataset is quite large (~80 GB).
So what would be a better design / strategy to query the data by gpId in cosmos db. My requirement is to query by gpId in almost real time.
Note: Current limit of RUs is set to 500000 RUs/s and also put on AutoScale.
I need to run a query to find all documents with duplicated e-mails.
SELECT * FROM (SELECT c.Email, COUNT(1) as cnt FROM c GROUP BY c.Email) a WHERE a.cnt > 1
When I run it in Data Explorer in Azure Portal it finds 4 results, but it's not a complete list of duplicated emails, because I already know one email that is duplicated and when the query is narrowed (where email = 'x') it is returned and there are about 70 duplicated emails in the collection.
Currently, throughput is set to autoscale with 6000 Max RU/s, the collection has about 4kk of documents. When running the query I observe an increased count of 429s responses on this collection.
Query Statistics shows that all documents are retrieved from the collection, but output is only 4 (should be around 70).
Query used 277324 RUs and took 71 seconds which gives 3905 RU/s in average, so it shouldn't be throttled.
Why cosmos returns only limited results for this query?
What can I do to get all duplicates?
I have a collection in Azure Cosmos DB with iot messages (called DeviceEvents). The partition key is application id. I want to do a query by device id (each device belongs to exactly one application). So I have a query like this
SELECT VALUE root
FROM root
WHERE root["ApplicationId"] = 69 AND root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.96 RUs, Index lookup time
2.22 ms, Document load time 0.41 ms and Query engine execution time 0.24 ms
When I execute the query without the partition key
SELECT VALUE root
FROM root
WHERE root["DeviceId"] = 2978
AND root["TimeStamp"] >= "2021-01-30T20:30:05.1635579Z"
AND root["TimeStamp"] <= "2021-02-19T20:30:05.1635969Z"
ORDER BY root["TimeStamp"] DESC OFFSET 0 LIMIT 30
When I execute the query like this I get Request Charge 10.45 RUs, Index lookup time
1.91 ms, Document load time 0.5 ms and Query engine execution time 0.24 ms
While the numbers vary the query with the partition key consistently consumes more RU and has higher index lookup time.
I don't have enough data for Cosmos DB to create different physical partitions right now but I will probably need it in the future. My relevant indexing policy is this
"compositeIndexes": [
[
{
"path": "/DeviceId",
"order": "ascending"
},
{
"path": "/TimeStamp",
"order": "descending"
}
]
So my questions are
Do I need the partition key in the query?
Do I need the partition key in the index definition?
The reason you're getting confusing query stats is because the amount data is too small to provide meaningful results.
With a small amount of data (approx 20GB or less) you'll only be on a single physical partition. Cross-partition queries run just as fast as partitioned queries when on the same physical partition.
Where things start to blow up is when the database grows (scales). If you design your database to have a high number of cross-partition queries your database, by design, will not scale. So you definitely need (or should try as much as possible) to use the partition key in your queries, especially high volume queries.
I would also add TimeStamp in both an ascending and descending composite index.
The other thing you mentioned is every device belongs to the same applicationId. If that is the case then your container cannot grow larger than 20GB. If every device in this app has applicationId of 69 then you should redesign this container and find a new partition key. If your queries are always by device Id then that would make a much better partition key.
I have a CosmosDB collection with id field and a partition key ManagerName.
When I run two queries.
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624' and c.ManagerName ='Darin Jast2'
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624'
in data explorer the RU's result is sort of strange. For the first query I get 3.070 RUs and the second I get 2.9 RUs. Almost every time I run the two queries?
That is strange to me because from what I read when you have a partition id in the where clause the query will run on a single partition.
The stranger thing is that when I run a
SELECT * FROM c
where c.ManagerName ='Darin Jast2'
I get 2.9 in fact any field I get the same number. It seams to be related to the number of where conditions instead of having or not having partitions?
Can someone explain to me what is going on here and why am I getting the results. Dose this have something to do with indexing? Size of the collection? Number of partitions?
All the resources I found on CosmosDb say you should include the partition key in your query and if you can do single partition queries.
In Azure Cosmos DB (SQL API) the following query charges 9356.66 RU's:
SELECT * FROM Core c WHERE c.id = #id -- #id is a GUID
In contrast the following more complex query charges only 6.84 RU's:
SELECT TOP 10 * FROM Core c WHERE c.type = "Agent"
The documents in both examples are pretty small having a handful of attributes. Also the document collection does not use any custom indexing policy. The collection contains 105685 documents.
To me this sounds as if there is no properly working index on the "id" field in place.
How is this possible and how can this be fixed?
Updates:
Without the TOP keyword the second query charges 3516.35 RU's and returns 100000 records.
The partition key is "/partition" and its values are 0 or 1 (evenly distributed).
If you have partition collection you need to specify partition keyif you want to do request most efficiently. Cross-partition queries is really expensive (and slower) in cosmos, because partitions data can be stored in different places.
Try following:
SELECT * FROM Core c WHERE c.id = #id AND c.partition = #partition
Or, specify partition key in feed options if you're using CosmosDB SDK.
Let me know, if this helps.
I assume the solution is the same as posted here:
Azure DocumentDB Query by Id is very slow
I will close my own question once I am able to verify this with Microsoft Support.