I am working on an existing Cosmos DB where the number of physical partitions is less than 100. Each contains around 30,000,000 documents. There is an indexing policy in place on "/*".
I'm just trying to get a total count from SQL API like so:
SELECT VALUE COUNT(1) FROM mycollection c
I have set EnableCrossPartitionQuery to true, and MaxDegreeOfParallelism to 100 (so as to at least cover the number of physical partitions AKA key ranges). The database is scaled to 50,000 RU. The query is running for HOURS. This does not make sense to me. An equivalent relational database would answer this question almost immediately. This is ridiculous.
What, if anything, can I change here? Am I doing something wrong?
Microsoft support ended up applying an update to the underlying instance. In this case, the update was in the development pipeline to be rolled out gradually. This instance got it earlier as a result of the support case. The update related to using indexes to service this type of query.
Related
when I am trying to insert the data from cosmos to Azure DWH , it is inserting well for most of the databases but for some it is giving some strange issues.
Later we found out that it is due to the size of the Cosmos DB document.
Like we have 75GB of size of one of our cosmos DB.
Then if we are trying to insert all the data in initial load , it gives Null Pointer error. But if we try to limit the rows say , first 3000 and then increment the count of records by 3000 then it is able to insert but it takes significant amount of time.
Also, this is our ACC data , we are not sure of our PRD data. and now for some of the DBs we need to set it to 50000 rows per load and for some we have set 3000(like for above example).
So to load the data iterative way is the only solution ? or is there any other way?
Also, how can we determine the incremental value to load in each iteration for new DBs to be added?
P.S. I also tried increasing DWUs and IR cores to maximum but no luck.
I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem
I'm trying to execute a very large query that needs to return millions of records, so I want to partition the query and use multiple machines to process the results.
My logical partition key would be a UUID of a document, so that will not be very helpful for me to allocate different parts to each worker node. Can I get the physical partition ID and execute my query only within a particular physical partition?
Here's what I have tried:
FeedOptions feedOptions = new FeedOptions();
feedOptions.setEnableCrossPartitionQuery(false);
feedOptions.setPartitionKeyRangeIdInternal("0");
client.queryDocuments(collectionPath, "SELECT * FROM e where e.docType
= 'address'", feedOptions).flatMapIterable(FeedResponse::getResults);
But changing the partitionKeyRangeId doesn't seem to change the results at all.
Please advise.
Per my knowledge, it can't be performed within a particular physical partition so far. I could't find any parameters related to physical partition in Cosmos DB Rest Api. The PartitionKeyRangeId you mentioned in your code is used in change feed requests.
Based on the statement in official doc, we can't manage physical partitions in cosmos db:
Azure Cosmos DB will automatically scale the number of physical
partitions based on your workload. So you shouldn’t corelate your
database design based on the number of physical partitions instead you
should make sure to choose the right partition key which determines
the logical partitions.
However, since cosmos db is flexible,available and enlightened, you could submit feedback to ask for further assistant if you do have such requirements related to physical partitions.
Hope it helps you.
Update Answer:
There are many ways to improve the performance of processing large volumes of data, I just give some personal advice here.
1.You could tried to consider choosing a partition key that is more appropriate than the UUID for greatly improve performance.
2.Try using page size to limit the number of items per query, then implement query and process parallelism by multithreading.
3.Increase the RUs setting to promote performance.
More ideas,please refer to this doc.
I'm trying to use Cosmos DB and I'm having some trouble making a simple count in a collection.
My collection schema is below and I have 80.000 documents in this collection.
{
"_id" : ObjectId("5aca8ea670ed86102488d39d"),
"UserID" : "5ac161d742092040783a4ee1",
"ReferenceID" : 87396,
"ReferenceDate" : ISODate("2018-04-08T21:50:30.167Z"),
"ElapsedTime" : 1694,
"CreatedDate" : ISODate("2018-04-08T21:50:30.168Z")
}
If I run this command below to count all documents in collection, I have the result so quickly:
db.Tests.count()
But when I run this same command but to a specific user, I've got a message "Request rate is large".
db.Tests.find({UserID:"5ac161d742092040783a4ee1"}).count()
In the Cosmos DB documentation I found this cenario and the suggestion is increase RU. Currently I have 400 RU/s, when I increase to 10.000 RU/s I'm capable to run the command with no errors but in 5 seconds.
I already tryed to create index explicity, but it seems Cosmos DB doesn't use the index to make count.
I do not think it is reasonable to have to pay 10,000 RU / s for a simple count in a collection with approximately 100,000 documents, although it takes about 5 seconds.
Count by filter queries ARE using indexes if they are available.
If you try count by filter on a not indexed column the query would not time out, but fail. Try it. You should get error along the lines of:
{"Errors":["An invalid query has been specified with filters against path(s) excluded from indexing. Consider adding allow scan header in the request."]}
So definitely add a suitable index on UserID.
If you don't have index coverage and don't get the above error then you probably have set the enableScanInQuery flag. This is almost always a bad idea, and full scan would not scale. Meaning - it would consume increasingly large amounts of RU as your dataset grows. So make sure it is off and index instead.
When you DO have index on the selected column your query should run. You can verify that index is actually being used by sending the x-ms-documentdb-populatequerymetrics header. Which should return you confirmation with indexLookupTimeInMs and indexUtilizationRatio field. Example output:
"totalExecutionTimeInMs=8.44;queryCompileTimeInMs=8.01;queryLogicalPlanBuildTimeInMs=0.04;queryPhysicalPlanBuildTimeInMs=0.06;queryOptimizationTimeInMs=0.00;VMExecutionTimeInMs=0.14;indexLookupTimeInMs=0.11;documentLoadTimeInMs=0.00;systemFunctionExecuteTimeInMs=0.00;userFunctionExecuteTimeInMs=0.00;retrievedDocumentCount=0;retrievedDocumentSize=0;outputDocumentCount=1;outputDocumentSize=0;writeOutputTimeInMs=0.01;indexUtilizationRatio=0.00"
It also provides you some insight where the effort has gone if you feel like RU charge is too large.
If index lookup time itself is too high, consider if you index is selective enough and if the index settings are suitable. Look at your UserId values and distribution and adjust the index accordingly.
Another wild guess to consider is to check if the API you are using would defer executing find(..) until it knows that count() is really what you are after. It is unclear which API you are using. If it turns out it is fetching all matching documents to client side before doing the counting then that would explain unexpectedly high RU cost, especially if there are large amount of matching documents or large documents involved. Check the API documentation.
I also suggest executing the same query directly in Azure Portal to compare the RU cost and verify if the issue is client-related or not.
I think it just doesn't work.
The index seems to be used when selecting the documents to be counted, but then the count is done by reading each document, so effectively consuming a lot of RU.
This query is cheap and fast:
db.Tests.count({ UserID: { '$eq': '5ac161d742092040783a4ee1' }})
but this one is slow and expensive:
db.Tests.count({ ReferenceID: { '$gt': 10 }})
even though this query is fast:
db.Tests.find({ ReferenceID: { '$gt': 10 }}).sort({ ReferenceID: 1 })
I also found this: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/36142468-make-count-aware-of-indexes. Note the status: "We have started work on this feature. Will update here when this becomes generally available."
Pretty disappointing to be honest, especially since this limitation hasn't been addressed for almost 2 years. Note - I am not an expert in this matter and I'd love to be proven wrong, since I also need this feature.
BTW: I noticed that simple indexes seem to be created automatically for each individual field, so no need to create them manually.
I experienced a scenario where a select count(*) on a table every minute (yes, this should definitely be avoided) caused a huge increase in Cassandra writes to around 150K writes per second.
Can anyone explain this weird behavior? Why would a Select query significantly increase write count in Cassandra?
Thanks!
If you check
org.apache.cassandra.metrics:type=ReadRepair,name=RepairedBackground
and
org.apache.cassandra.metrics:type=ReadRepair,name=RepairedBlocking
metrics you can see if its read repairs sending mutations. Perhaps reading all the data to service the count(*) is causing a lot of read repairs if your data is inconsistent. If thats the case lowering the read_repair_chance and dclocal_read_repair_chance on the table (ALTER TABLE) could reduce load.
Other likely possibilities are:
You have tracing enabled (either globally or on the table) as some %.
Or if you use DSE and you have slow query's enabled.
A possible explanation could be found in the write path of an update:
During a write , Cassandra adds each new row to the database without checking on whether a duplicate record exists. This policy makes it possible that many versions of the same row may exist in the database.
Then
Most Cassandra installations store replicas of each row on two or more nodes. Each node performs compaction independently. This means that even though out-of-date versions of a row have been dropped from one node, they may still exist on another node.
And finally:
This is why Cassandra performs another round of comparisons during a read process. When a client requests data with a particular primary key, Cassandra retrieves many versions of the row from one or more replicas.