We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?
A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.
It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.
Related
I have multi-tenant product offering and use dynamodb database, so all our web-request is being served from dynamodb. I have use case where I want to move data of a tenant from one region to another, this would be background process.
How do I ensure background process does not hog the database ? otherwise it will give bad user experience and may bring website down.
Is there a way I can have dedicated read and write capacity provisioned for background process.
You cannot dedicate read and write capacity units to specific processes, but you could temporarily change the table's capacity mode to on-demand for the move, and then switch it back to provisioned mode later when the move is complete. You can make this capacity mode switch once every 24 hours. By changing to on-demand capacity mode, you are less likely to be throttled in this specific situation.
That said, without knowing your current table capacity mode and capacity settings on those tables, it is difficult for me to make concrete recommendations though.
Sorry answer from Kirk is not a good idea for saving $$$. DynamoDB has TTL feature so say you want to delete something, you expire the item, meaning queries for that used to get that item no longer retrieve it, because the TTL has expired.
But it is not yet DELETED ! It will be scheduled for deletion later, saving you those precious capacity units when it deletes items in batches as opposed to one by one, greatly saving you money and is what the technology is for.
This is the log of my azure cosmos db for last write operations:
Is it possible that write operations of documents with size between 400kb to 600kb have this costs?
Here my document (a list of coordinate):
Basically I thought at the beginning it was a hotPartition problem, but afterwards I understood (I hope) that it is a problem in the loading of documents ranging in size from 400kb to 600kb. I wanted to understand if there was something wrong in the database setting, in the indexing policy or other as it seems to me anomalous that about 3000 ru are used to load a json of 400kb, when in the documentation it is indicated that to load a file of equal size at 100kb it takes about 50ru. Basically the document to be loaded is a road route and therefore I would not know in what other way to model it.
This is my indexing policy:
Thanks to everybody. I spent months behind this problem without having solutions...
It's hard to know for sure what the expected RU/s cost should be to ingest a 400KB-600KB item. The cost of this operation will depend on the size of the item, your indexing policy and the structure of the item itself. Greater hierarchy depth is more expensive to index.
You can get a good estimate for what the cost for a single write for an item will be using the Cosmos Capacity Calculator. In the calculator, click Sign-In, cut/paste your index policy, upload a sample document, reduce the writes per second to 1, then click calculate. This should give you the cost to insert a single item.
One thing to note here, is if you have frequent updates to a small number of properties I would recommend you split the documents into two. One with static properties, and another that is frequently updated. This can drastically reduce the cost for updates on large documents.
Hope this is helpful.
You can also pull the RU cost for a write using the SDK.
Check storage consumed
To check the storage consumption of an Azure Cosmos container, you can run a HEAD or GET request on the container, and inspect the x-ms-request-quota and the x-ms-request-usage headers. Alternatively, when working with the .NET SDK, you can use the DocumentSizeQuota, and DocumentSizeUsage properties to get the storage consumed.
Link.
I have use case where I write data in Dynamo db in two table say t1 and t2 in transaction.My app needs to read data from these tables lot of times (1 write, at least 4 reads). I am considering DAX vs Elastic Cache. Anyone has any suggestions?
Thanks in advance
K
ElastiCache is not intended for use with DynamoDB.
DAX is good for read-heavy apps, like yours. But be aware that DAX is only good for eventually consistent reads, so don't use it with banking apps, etc. where the info always needs to be perfectly up to date. Without further info it's hard to tell more, these are just two general points to consider.
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache that can reduce Amazon DynamoDB response times from milliseconds to microseconds, even at millions of requests per second. While DynamoDB offers consistent single-digit millisecond latency, DynamoDB with DAX takes performance to the next level with response times in microseconds for millions of requests per second for read-heavy workloads. With DAX, your applications remain fast and responsive, even when a popular event or news story drives unprecedented request volumes your way. No tuning required. https://aws.amazon.com/dynamodb/dax/
AWS recommends that you use **DAX as solution for this requirement.
Elastic Cache is an old method and it is used to store the session states in addition to the cache data.
DAX is extensively used for intensive reads through eventual consistent reads and for latency sensitive applications. Also DAX stores cache using these parameters:-
Item cache - populated with items with based on GetItem results.
Query cache - based on parameters used while using query or scan method
Cheers!
I'd recommend to use DAX with DynamoDB, provided you're having more read calls using item level API (and NOT query level API), such as GetItem API.
Why? DAX has one weird behavior as follows. From, AWS,
"Every write to DAX alters the state of the item cache. However, writes to the item cache don't affect the query cache. (The DAX item cache and query cache serve different purposes, and operate independently from one another.)"
Hence, If I elaborate, If your query operation is cached, and thereafter if you've write operation that affect's result of previously cached query and if same is not yet expired, in that case your query cache result would be outdated.
This out of sync issue, is also discussed here.
I find DAX useful only for cached queries, put item and get item. In general very difficult to find a use case for it.
DAX separates queries, scans from CRUD for individual items. That means, if you update an item and then do a query/scan, it will not reflect changes.
You can't invalidate cache, it only invalidates when ttl is reached or nodes memory is full and it is dropping old items.
Take Aways:
doing puts/updates and then queries - two seperate caches so out of sync
looking for single item - you are left only with primary key and default index and getItem request (no query and limit 1). You can't use any indexes for gets/updates/deletes.
Using ConsistentRead option when using query to get latest data - it works, but only for primary index.
Writing through DAX is slower than writing directly to Dynamodb since you have a hop in the middle.
XRay does not work with DAX
Use Case
You have queries that you don't really care they are not up to date
You are doing few putItem/updateItem and a lot of getItem
My use-case is that I want to be able to provide the user an auto-suggest feature in drop-down box where user starts typing first few characters and he should be shown suggestions.
The problem is that the field I want the suggestions on is also the hash key for my DynamoDB table. And queries on hash key have to specify the full value of hash key and not with prefix.
Can anyone suggest a good DynamoDB pattern for this use-case?
10,000 entries with, say, 20 characters = 200K. This is totally feasible to keep in memory and would be very fast to access.
Compare this with performing a database query every time the user types a character in the drop-down box and you'll be making maybe 10 database calls as they type. Then, multiply by the number of concurrent users and you could conceivably be hitting hundreds of database accesses per second. The DynamoDB table would need to be provisioned with a high Read Capacity to support this.
It would be much more sensible to keep it in memory, or use Amazon DynamoDB Accelerator (DAX) – Fully managed in-memory cache for DynamoDB or Amazon ElastiCache table.
Suppose I am creating a transaction app.
How will I store transactions?
I know I need to denormalize.
Would I save the transaction within a transaction node at the first
db level? Or would i save the transaction node under each user's node? Or would i save it in both the transaction node on the first level and the
transaction node under each user's node?
What if the user changed their name, how would I reflect these
changes in both the transaction history of the user and the business?
I feel like the best way is to put it in just the first level of the database and have the user's query the entire list to see their transaction history.
But, If i have a lot of users wouldn't this be extremely slow?
Or is firebase smart enough and fast enough to handle such queries.
Does the user's internet speed affect this querying, especially on a
mobile device?
Can you display the transaction on the screen as it is being loaded?
Would firebase indexing allow me to do these very large dataset queries easily? Perhaps indexing a user's username that is contained inside each transaction?
First, rather than filtering history of transaction data using username I would suggest using userId which will never changed and always unique.
Second, I think saving the transaction globally (without using '/userId') is better. Because :
We need to able to summarize all transactions for accounting reason
If you think the query will be slow even after using index, you can consider loading part of query result using limitToFirst() just like pagination in web (infinite scroll in android). There is great tutorial here