query with pagination causing high throughput usage

query with pagination causing high throughput usage - azure-cosmosdb

I use Azure Cosmos DB API for MongoDB account, version 3.6. In queries using skip and limit I noticed higher throughput usage. The higher skip is the more costly query is.
db.MyCollection.find({Property:"testtest"}).skip(12000).limit(10)
Above query costs around 3000 RU. The property that is in find clause is my partition key. I have read that currently cosmosdb is capable of doing queries with offset and limit but I found that officaly only in SQL API for CosmosDb there's OFFSET LIMIT clause. Is it possible with MongoDb API either or should I live with costly queries with skip ?

The SQL API will yield the same result with OFFSET LIMIT. You'll find an almost linear increase in RU as you increase the offset as each query loops over all skipped documents.
If possible you should try to use the continuation token if possible in your context. You could also adjust your filter criteria using a indexed property to move over your data.
The RU charge of a query with OFFSET LIMIT will increase as the number of terms being offset increases. For queries that have multiple pages of results, we typically recommend using continuation tokens. Continuation tokens are a "bookmark" for the place where the query can later resume. If you use OFFSET LIMIT, there is no "bookmark". If you wanted to return the query's next page, you would have to start from the beginning.
Source

Related

Delete items in Cosmos DB with spare RUs

We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?

A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.

It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.

Simple CosmosDb query high RU

I am evaluating Cosmos Db for a project and working through the documentation. I have created a sample collection following the documentation on this page https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-getting-started. When I run the first query on this page in the local emulator I get the following results:
Why is the Request Charge 2.89 RUs? From all of the documentation I have read this should be 1 RU. The collection is partitioned on the id field and is auto indexed and Cross Partition Queries are enabled. I have event tried putting both items in the same partition and I get the same results.

1 RU is the cost of a Point-Read operation, not a query. Reference: https://learn.microsoft.com/azure/cosmos-db/request-units:
The cost to read a 1 KB item is 1 Request Unit (or 1 RU).
Also there:
Query patterns: The complexity of a query affects how many RUs are consumed for an operation. Factors that affect the cost of query operations include
If you want to read a single document, and you know the id and partition key, just do a point operation, it will always be cheaper than a query with the id="something" query. If you don't know the partition key, then yes, you need a cross partition query, because you don't know on which partition key is stored and there could be multiple documents with the same id (as long as their partition keys are different, see https://learn.microsoft.com/azure/cosmos-db/partitioning-overview).
You can use any of the available SDKs or work with the REST API.

Elastic Cache vs DynamoDb DAX

I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K

ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/

AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!

I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.

I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem

Cosmos DB - Slow COUNT

I am working on an existing Cosmos DB where the number of physical partitions is less than 100. Each contains around 30,000,000 documents. There is an indexing policy in place on "/*".
I'm just trying to get a total count from SQL API like so:
SELECT VALUE COUNT(1) FROM mycollection c
I have set EnableCrossPartitionQuery to true, and MaxDegreeOfParallelism to 100 (so as to at least cover the number of physical partitions AKA key ranges). The database is scaled to 50,000 RU. The query is running for HOURS. This does not make sense to me. An equivalent relational database would answer this question almost immediately. This is ridiculous.
What, if anything, can I change here? Am I doing something wrong?

Microsoft support ended up applying an update to the underlying instance. In this case, the update was in the development pipeline to be rolled out gradually. This instance got it earlier as a result of the support case. The update related to using indexes to service this type of query.

Deal with I/O capacities in dynamodb

We are using the dynogels library to query a dynamoDB table. Unfortunately, as dynamoDB do not have a pagination feature, for a specific need, we are retrieving all data from the table through a loadAll to get all items of the table (18K items) and we are facing a error due to exceed of the I/O read capacity.
Except this query that retrieve all the content of the table, we only have very small read usage of the table. We also tried to dynamically update the I/O unity but we are limited to 4 changes/per hour.
Can you suggest as a solution? Do you know how to use the pagination in dynamodB ? is-it possible to use DAX as a local dynamoDB cache?
Thank you

Unfortunately, as dynamoDB do not have a pagination feature, for a specific need,
DynamoDB does have pagination feature where you can specify the limit on number of pages to query and as part of the result, DynamoDB query / scan API returns a nextStartKey which can be used as the exclusiveStartKey to retrieve next, and do this until the nextStartKey is null which indicates the end of results: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Pagination
Don't they have support of pagination in the dynogels library?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

query with pagination causing high throughput usage - azure-cosmosdb

Related

Delete items in Cosmos DB with spare RUs

Simple CosmosDb query high RU

Elastic Cache vs DynamoDb DAX

Cosmos DB - Slow COUNT

Deal with I/O capacities in dynamodb

Categories

Resources