I need to run a query to find all documents with duplicated e-mails.
SELECT * FROM (SELECT c.Email, COUNT(1) as cnt FROM c GROUP BY c.Email) a WHERE a.cnt > 1
When I run it in Data Explorer in Azure Portal it finds 4 results, but it's not a complete list of duplicated emails, because I already know one email that is duplicated and when the query is narrowed (where email = 'x') it is returned and there are about 70 duplicated emails in the collection.
Currently, throughput is set to autoscale with 6000 Max RU/s, the collection has about 4kk of documents. When running the query I observe an increased count of 429s responses on this collection.
Query Statistics shows that all documents are retrieved from the collection, but output is only 4 (should be around 70).
Query used 277324 RUs and took 71 seconds which gives 3905 RU/s in average, so it shouldn't be throttled.
Why cosmos returns only limited results for this query?
What can I do to get all duplicates?
Related
I am storing millions of documents in cosmos db with proper partitionkey. I need to retrieve say 500,000 documents to do some calculations and display the output in UI , this should happen with in say 10 second.
Would this be possible? I have tried this but taking nearly a minute. So for this kind of requirement is this the correct approach?
"id": "Latest_100_Sku1_1496188800",
"PartitionKey": "Latest_100_Sku1
"SnapshotType": 2,
"AccountCode": "100",
"SkuCode": "Sku1",
"Date": "2017-05-31T00:00:00",
"DateEpoch": 1496188800,
"Body": "rVNBa4MwFP4v72xHElxbvYkbo4dBwXaX0UOw6ZRFIyaBFfG/7zlT0EkPrYUcku+9fO/7kvca"
Size of one document : 825 byte
Am using autoscale 4000 Throughput
Query statistics - am using 2 queries.
Query 1 - select * from c where c.id in ({ids})
here i use PartitionKey in Query options.
Query Statistics
METRIC
VALUE
Request Charge
102.11 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
221672 bytes
Output document count More information
200
Output document size More information
221972 bytes
Index hit document count More information
200
Index lookup time More information
17.0499 ms
Document load time More information
1.59 ms
Query engine execution time More information
0.3401 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.16 ms
Round Trips
1
Query 2 --
select * from c where c.PartitionKey in ({keys}) and c.DateEpoch>={startDate.ToEpoch()} and c.DateEpoch<={endDate.ToEpoch()}
Query Statistics
METRIC
VALUE
Request Charge
226.32 RUs
Showing Results
1 - 100
Retrieved document count More information
200
Retrieved document size More information
176580 bytes
Output document count More information
200
Output document size More information
176880 bytes
Index hit document count More information
200
Index lookup time More information
88.31 ms
Document load time More information
4.2399000000000004 ms
Query engine execution time More information
0.4701 ms
System function execution time More information
0.060000000000000005 ms
User defined function execution time More information
0 ms
Document write time More information
0.19 ms
Round Trips
1
Query #1 looks fine. Query #2 most likely would benefit from a composite index on DateEpoch. I'm not sure what the UDF is but if you're converting dates to epoch you want to read a new blog post New date and time system functions in Azure Cosmos DB
Overall, retrieving 500K documents in 1-2 queries to do some calculations seems like a strange use case. Typically most people will pre-calculate values and persist them using a materialized view pattern using change feed. Depending on how often you run these two queries, this is often a more efficient use of compute resources.
Considering the following query:
SELECT TOP 1 * FROM c
WHERE c.Type = 'Case'
AND c.Entity.SomeField = #someValue
AND c.Entity.CreatedTimeUtc > #someTime
ORDER BY c.Entity.CreatedTimeUtc DESC
Until recently, when I ran this query, the number of documents processed by the query (RetrievedDocumentCount in the query metrics) was the number of documents that satisfies the first two condition, regardless the "CreatedTimeUtc" or the TOP 1.
Only when I added a composite index of (Type DESC, Entity.SomeField DESC, Entity.CreatedTimeUtc DESC) and added them to the ORDER BY clause, the retrieved documents count dropped to the number of documents that satisfies all 3 conditions (still not one document as expected, but better).
Then, starting a few days ago, we noticed in our dev environment that the composite index is no longer needed as retrieved documents count changed to only one document (= the number in the TOP, as expected), and the RU/s reduced significantly.
My question – is this a new improvement/fix in CosmosDB? I couldn’t find any announcement/documentation on this manner.
If so, is the roll-out completed or still in-progress? We have several production instances in different regions.
Thanks
There have not been any recent changes to our query engine that would explain why this query is suddenly less expensive.
The only thing that would explain this is fewer results match the filter than before and that our query engine was able to perform an optimization that it would not otherwise be able to have done with a larger set of results.
Thanks.
I'm confusing about the partition key with cosmos db. I have a database/container with about 4000 small records. If I try a sql statement with my partition key filter, the RUs and the duration time is larger then without.
Does someone understand this?
in this sample my partition key of the container is /partitionKey
I tried this statement:
SELECT * FROM c where c.partitionKey = 'userSettings' And c.deleted =false
Request Charge 50 RUs
Document load time 2.15 ms
and then this
SELECT * FROM c where c.cosmosEntityName = 'userSettings' And c.deleted =false
Request Charge 5 RUs
Document load time 0.38 ms
I expect exactly the opposite results.
Here some screenshots:
This question is very specific to the topology of your collection (which Azure support can help with), but generally speaking there are two cases where the latter query on non-partition key property can be lower in RUs than the partition key property:
List item
If the query on non-partition key property is incomplete, the RUs may appear lower, but you still need to read results from other partitions to ascertain there are no more results. You would have to click "More Results" in Data Explorer until it is grayed out
For this specific query where c.partitionKey = 'userSettings' And c.deleted =false, you should compare RUs with and without a composite index on /partitionKey/? and /deleted/? (https://learn.microsoft.com/azure/cosmos-db/how-to-manage-indexing-policy#composite-indexing-policy-examples). In some cases, you will get lower RUs with the composite index than with the default of /* which only indexes them individually, potentially close to ~5 RUs
I have a CosmosDB collection with id field and a partition key ManagerName.
When I run two queries.
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624' and c.ManagerName ='Darin Jast2'
SELECT * FROM c
where c.id = '76e24380-71cb-45d5-807a-ce2374f57624'
in data explorer the RU's result is sort of strange. For the first query I get 3.070 RUs and the second I get 2.9 RUs. Almost every time I run the two queries?
That is strange to me because from what I read when you have a partition id in the where clause the query will run on a single partition.
The stranger thing is that when I run a
SELECT * FROM c
where c.ManagerName ='Darin Jast2'
I get 2.9 in fact any field I get the same number. It seams to be related to the number of where conditions instead of having or not having partitions?
Can someone explain to me what is going on here and why am I getting the results. Dose this have something to do with indexing? Size of the collection? Number of partitions?
All the resources I found on CosmosDb say you should include the partition key in your query and if you can do single partition queries.
I have about 100 thousand records in Cosmos DB. I want to get the distinct records by some property. I am using Stored Procedure to acheive this and sets the page size to -1 to get the maximum records. When i fire a query without distinct, i get about 19 thousand records. At the same time if i fired the distinct query, it gives me distinct records, and the distinct applied with in the undistincted 19 thousand records instead of the entire 100 thousand records.
Below is the query i have used:
SELECT r.[[FieldName]] FROM r -> returns 19000 records with duplicates
SELECT DISTINCT r.[[FieldName]] FROM r -> returns distinct records (few about 5000) which are distincted from the above 19000 records instead of 100 thousand records