Pricing of DynamoDB continuous backups - amazon-dynamodb

The DynamoDB's pricing page contains the following text explaining how much storing continuous backups (a.k.a. PITR - point-in-time recovery) costs:
DynamoDB charges for PITR based on the size of each DynamoDB table (table data and local secondary indexes) on which it is enabled. DynamoDB monitors the size of your PITR-enabled tables continuously throughout the month to determine your backup charges and continues to bill you until you disable PITR on each table.
This seems to say that that the user is charged for continuous backups based on the size of the table they are enabled on - not the size of the backup stored. It means that if a user continuously modifies existing data instead of adding new data, Amazon may need huge amounts of storage to store 35 days worth of modifications, space for which the user does not pay. That doesn't make sense to me - I suspect their pricing needs to correspond to the size of the backup, not the table - but this is not claimed in the above text or in any of its similar variants I found on Amazon's site.
So my question is - how does Amazon charge for continuous-backup storage? By the table size, or by the backup size (i.e., the amount of changes)? Is this documented anywhere?
Curiously, I couldn't find any other source on the web which discusses this question.
I found many slightly-modified versions of the above text copied to all sorts of tutorials, but none of them give any example of answers my question. It's as if nobody really cares how much this feature will cost before they start using it :-)

Pricing
Your assumptions are correct, and it is of course the price of the table which you pay for. This means that PITR is extremely cost-effective when comparing it to taking multiple on-demand backups. Moreover, PITR also lets you restore to any point in time in the previous 35 days.
But How?
How is it done, it's simply smarts from the DynamoDB team, which uses S3 and snapshotting to store your backups. Learn more from this re:Invent presentation which goes into further details.

The pricing model seems to be clearly stated in your quote, being based on the size of the table. An on-demand backup can easily have its size estimated based on the size of the table, but in the absence of details as to the implementation of continuous backups, it seems that the uncertainty as to the total backup size is taken into account when setting the price at double the price of on-demand backup storage, perhaps with the help of analytical analysis to determine the usual/average levels of activity.

Related

Huge amount of RU to write document of 400kb - 600kb on Azure cosmos db

This is the log of my azure cosmos db for last write operations:
Is it possible that write operations of documents with size between 400kb to 600kb have this costs?
Here my document (a list of coordinate):
Basically I thought at the beginning it was a hotPartition problem, but afterwards I understood (I hope) that it is a problem in the loading of documents ranging in size from 400kb to 600kb. I wanted to understand if there was something wrong in the database setting, in the indexing policy or other as it seems to me anomalous that about 3000 ru are used to load a json of 400kb, when in the documentation it is indicated that to load a file of equal size at 100kb it takes about 50ru. Basically the document to be loaded is a road route and therefore I would not know in what other way to model it.
This is my indexing policy:
Thanks to everybody. I spent months behind this problem without having solutions...
It's hard to know for sure what the expected RU/s cost should be to ingest a 400KB-600KB item. The cost of this operation will depend on the size of the item, your indexing policy and the structure of the item itself. Greater hierarchy depth is more expensive to index.
You can get a good estimate for what the cost for a single write for an item will be using the Cosmos Capacity Calculator. In the calculator, click Sign-In, cut/paste your index policy, upload a sample document, reduce the writes per second to 1, then click calculate. This should give you the cost to insert a single item.
One thing to note here, is if you have frequent updates to a small number of properties I would recommend you split the documents into two. One with static properties, and another that is frequently updated. This can drastically reduce the cost for updates on large documents.
Hope this is helpful.
You can also pull the RU cost for a write using the SDK.
Check storage consumed
To check the storage consumption of an Azure Cosmos container, you can run a HEAD or GET request on the container, and inspect the x-ms-request-quota and the x-ms-request-usage headers. Alternatively, when working with the .NET SDK, you can use the DocumentSizeQuota, and DocumentSizeUsage properties to get the storage consumed.
Link.

Are CosmosDB attachments (still) a good way to store payloads larger than the document limit of 2MB?

I need to save CosmosDB documents that contain a large list - larger than the document limit of 2 MB.
I'd like to split that list into an 'attachment' associated to the "main document".
But this documentation page briefly mentions that
Attachment feature is being depreciated [sic]
What's the deprecation plan? Will newly created collections (in the future) stop supporting attachments?
The same page of documentation mentions a limit of 2GB for "Maximum attachment size per Account".
What does that mean? 2GB per attachment? 2 GB total for all attachments?
I recommend not taking a dependency on attachments. We are still planning on deprecating them but have not started in earnest on this.
Depending on your access patterns for this data you may want to break this up as individual documents or modeled in some other way. CRUD operations on large documents can be very costly and will experience high latency because of the large payload in each request.
If you have an unbounded array these should definitely be stored as individual documents or modeled such that increasing size does not cause eventual performance issues. If your data is updated frequently it should be modeled such that the frequently updated portions are separate from properties that are static.
This article here describes scenarios and considerations when modeling data in Cosmos and may help you come up with a more efficient model.
https://learn.microsoft.com/en-us/azure/cosmos-db/modeling-data
Hope this is helpful.

DynamoDB tables per customer considering DynamoDB's advanced recovery abilities

I am deciding whether or not I have tables per a customer, or a customer shares a table with everybody else. Creating a table for every customer seems problematic, as it is just another thing to manage.
But then I thought about backing up the database. There could be a situation where a customer does not have strong IT security, or even a disgruntled employee, and that this person goes and deletes a whole bunch of crucial data of the customer.
In this scenario if all the customers are on the same table, one couldn't just restore from a DynamoDB snapshot 2 days ago for instance, as then all other customers would lose the past 2 days of data. Before cloud this really wasn't such a prevalent consideration IMO because backups were not as straight forward offering such functionality to your customers who are not tier 1 businesses wasn't really on the table.
But this functionality could be a huge selling point for my SAAS application so now I am thinking it will be worth the hassle for me to have table per customer. Is this the right line of thinking?
Sounds like a good line of thinking to me. A couple of other things you might want to consider:
Having all customer data in one table will probably be cheaper as you can distribute RCUs and WCUs more efficiently. From your customer point of view this might be good or bad because one customer can spend any customers RCUs/WCUs (if you want to think about like that). If you split customer data into separate tables your can provision them independently.
Fine grained security isn't great in DynamoDB. You can only really implement row (item) level security if the partition key of the table is an Amazon uid. If this isn't possible you are relying on application code to protect customer data. Splitting customer data into separate tables will improve security (if you cant use item level security).
On to your question. DynamoDB backups don't actually have to be restored into the same table. So potentially you could have all your customer data in one table which is backed up. If one customer requests a restore you could load the data into a new table, sync their data into the live table and then remove the restore table. This wouldn't necessarily be easy, but you could give it a try. Also you could be paying for all the RCUs/WCUs as you perform your sync - a cost you don't incur on a restore.
Hope some of that is useful.
Separate tables:
Max number of tables. It's probably a soft limit but you'd have to contact support rather often - extra overhead for you because they prefer to raise limits in small (reasonable) bits.
A lot more things to manage, secure, monitor etc.
There's probably a lot more RCU and WCU waste.
Just throwing another idea up in the air, haven't tried it or considered every pro and con.
Pick up all the write ops with Lambda and write them to backup table(s). Use TTL (for how long can users restore their stuff) to delete old entries for free. You could even modify TTL per costumer basis if you e.g provide longer backups for different price classes of your service.
You need a good schema to avoid hot keys.
customer-id (partition ID) | time-of-operation#uuid (sort key) | data, source table etc
-------------------------------------------------------------------------------------------
E.g this example might be problematic if some of your costumers are a lot more active than others.
Possible solution: use known range of int-s to suffix IDs, e.g customer-id#1, customer-id#2 ... customer-id#100 etc. This will spread the writes and your app knows the range - able to query.
Anyway, this is just a quick and dirty example off the top of my head.
Few pros and cons that come to my mind:
Probably more expensive unless separate tables have big RCU/WCU headroom.
Restoring from regular backup might be a huge headache, e.g which items to sync?
This is very cranual, users can pick any moment in your TTL range to restore.
Restore specific items, revert specific ops w/ very low cost if your schema allows it.
Could use that backup data to e.g show history of items in front-end.

Is there a way to estimate how much space will be consumed by archivelogs?

We are trying to get a grasp of the space requirements of switching from a backup process based on datapumps, to one based on RMAN and archivelogs. Currently we're a bit limited by space constraints, even though the database is about 20GB give or take, if you include temporary tablespaces, etc. I need to know how to estimate how large my archivelogs will become if I switch over to using them so I can tell our sysadmins how much space they need to give me (due to the backup measures they take on the server side, space increases take up exponentially amounts of additional space on the tape drives, so they get irritatingly stingy about it.
Space consumed by archivelogs mostly depends upon the type of transactions happening on your database. If transactions are mostly select queries, you won't need that much space for archive storage as the generation frequency will be low.
If the database transactions are mostly dml or write based queries, you will need a larger storage area for archives as the frequency of generation will be high.
This also depends on how frequent checkpoints occur in the database and the size of your undo segments.
Since you have space constraints, I would suggest having a decent amount of space for archives and have frequent tape backups of the archivelogs so that backed up archives can be deleted from the server storage to gain space for newer archives.

Is it ok to build architecture around regular creation/deletion of tables in DynamoDB?

I have a messaging app, where all messages are arranged into seasons by creation time. There could be billions of messages each season. I have a task to delete messages of old seasons. I thought of a solution, which involves DynamoDB table creation/deletion like this:
Each table contains messages of only one season
When season becomes 'old' and messages no longer needed, table is deleted
Is it a good pattern and does it encouraged by Amazon?
ps: I'm asking, because I'm afraid of two things, met in different Amazon services -
In Amazon S3 you have to delete each item before you can fully delete bucket. When you have billions of items, it becomes a real pain.
In Amazon SQS there is a notion of 'unwanted behaviour'. When using SQS api you can act badly regarding SQS infrastructure (for example not polling messages) and thus could be penalized for it.
Yes, this is an acceptable design pattern, it actually follows a best practice put forward by the AWS team, but there are things to consider for your specific use case.
AWS has a limit of 256 tables per region, but this can be raised. If you are expecting to need multiple orders of magnitude more than this you should probably re-evaluate.
You can delete a table a DynamoDB table that still contains records, if you have a large number of records you have to regularly delete this is actually a best practice by using a rolling set of tables
Creating and deleting tables is an asynchronous operation so you do not want to have your application depend on the time it takes for these operations to complete. Make sure you create tables well in advance of you needing them. Under normal circumstances tables create in just a few seconds to a few minutes, but under very, very rare outage circumstances I've seen it take hours.
The DynamoDB best practices documentation on Understand Access Patterns for Time Series Data states...
You can save on resources by storing "hot" items in one table with
higher throughput settings, and "cold" items in another table with
lower throughput settings. You can remove old items by simply deleting
the tables. You can optionally backup these tables to other storage
options such as Amazon Simple Storage Service (Amazon S3). Deleting an
entire table is significantly more efficient than removing items
one-by-one, which essentially doubles the write throughput as you do
as many delete operations as put operations.
It's perfectly acceptable to split your data the way you describe. You can delete a DynamoDB table regardless of its size of how many items it contains.
As far as I know there are no explicit SLAs for the time it takes to delete or create tables (meaning there is no way to know if it's going to take 2 seconds or 2 minutes or 20 minutes) but as long your solution does not depend on this sort of timing you're fine.
In fact the idea of sharding your data based on age has the potential of significantly improving the performance of your application and will definitely help you control your costs.

Resources