What does DynamoDB's scan really cost? - amazon-dynamodb

I'm using Dynamodb to store users like this:
Users: uuid(Primary Key) joined_at(Sort Key) full_name
I'd like to paginate through thousands of users sorted by joined_at in batches of 20 users.
I wanted to use scan operation for this but I heard it literally reads all of the table (which would be very costly on read units).
Is that true? how do I just consume 20 reads at a time?
P.S. using query on the hand requires me to filter by the ID of a single user which is not what i want.

Pagination is available using query in dynamoDB. You have to implement as explained here. The problem in pagination is DynamoDB is you cannot get the total count of items in the table. To solve that you can maintain a separate table and store the items count that are in the tables.
The next and previous page is evaluated by
{
"TableName":"users",
"ScanIndexForward":true, // false implies get data before the sort key
"KeyConditionExpression":"#hkey = :hvalue AND #rkey > :rvalue",
"ExpressionAttributeNames": {
"#hkey":"uuid","#rkey":"joined_at"
},
"ExpressionAttributeValues" :{
":hvalue":1094,":rvalue":0
},
"ExclusiveStartKey":{
"uuid":"<your-uuid>","joined_at":<your published at data>
},
"Limit":10
}

Amazon DynamoDB pricing is based upon provisioned capacity. You choose the amount of Provisioned Throughput for Reads and Writes and you will be charged for this capacity.
When performing a large scan, you will consume a lot of the capacity. It will not cost you money, but it will consume the capacity you have provisioned and other users might be impacted because there is insufficient capacity to service their requests at the same time. Your scan might also take a long time because it can only Read at the rate defined by your provisioned capacity.

Related

Filtering a large dynamodb table for data analytics purposes

We have a request come in from our compliance department asking us to scan a dynamodb table which has millions of records, we need to be able to filter all the records for approximately 1300 email addresses, the email address on this table is not the partition key and is a secondary global index.
This is not a one time request and we need to be able to repeat this process with minimal effort in future. That means the table might have grown in that time or the number of requested emails might be larger.
What would be the best approach to filter the data and only take the records related to these emails?
I can only think of the following two approaches, maybe utilizing a lambda or step functions if the work needs to be done in batches but am open to any scalable alternatives:
should we export the whole table to S3 and then process that?
go through each email and call dynamodb
You say that the emails are in a GSI. If the email is in the primary key for the GSI then the easiest solution is to call DynamoDB once for each email, and you can make these calls in parallel (but you may want to do them in chunks of 1000 to avoid throttles or exhausting file handles on your host).
If the email is not in the PK, then running a scan on the GSI, returning KEYS_ONLY can be ok depending on your table size and how often you run the task. If you have 10 million records with 1KB average record size in the GSI, this will cost $0.30 USD each time it is run. You can run a parallel scan to make it run fast. You can judge if the time/money tradeoff makes sense versus another solution that takes more engineering effort, such as exporting to S3.

Delete items in Cosmos DB with spare RUs

We use Cosmos DB to track all our devices and also data that is related to the device (and not stored in the device document itself) is stored in the same container with the same partition ID.
Both the device document and the related documents have /deviceId as the partition key. When a device is removed, then I remove the device document. I actually want to remove the entire partition, but this doesn't seem to be possible. So I revert to a query that queries for all items with this partition key and remove them from the database.
This works fine, but may consume a lot of RUs if there is a lot of related data (which may be true in some cases). I would rather just remove the device and schedule all related data for removal later (it doesn't hurt to have them in the database for a while). When RU utilization is low, then I start removing these items. Is there a standard solution to do this?
The best solution would be to schedule this and that Cosmos DB would process these commands when it has spare RUs, just like with the TTL deletion. Is this even possible?
A feature is now in preview to delete all items by partition key using fire and forget background processing model with a limited amount of available throughput. There's a signup link in the feature request page to get access to preview.
Currently, the API looks like a new DeleteAllItemsByPartitionKey method in the SDK.
It definitely is possible to set a TTL and then let Cosmos handle expiring data out of the container when it is idle. However, the cost to update the document in the first place is about what it costs to delete it anyway so you're not gaining much.
An approach as you suggest, may be to have a separate container (or even a queue) where you insert a new item with the deviceId to retire. Then in the evenings or during a time when you know the system is idle. Run a job that reads the next deviceId in the queue, queries for all the items with that partition key, then deletes the data or sets the TTL to expire the data.
There is a feature to delete an entire partition in the works that would be perfect for this scenario (in fact, it's designed for it) but no ETA on availability.

DynamoDB partition key design with On-Demand

How much do I need to care about partition key design with DynamoDB On-Demand and Adaptive Capacity? What would happen if I tried to write to single partition key 40,000 times in one second? Does the per-partition write request unit cap of 1,000 still exist such that it would throttle those 40,000 requests, or is there some magic that boosts that single partition temporarily up to the table limit?
It's not an arbitrary question, as I'd like to use incrementing integers for all our entities in DynamoDB via the method suggested within this SO post, but that would require maintaining the latest id for an entity on a single partition key. Every new item created would get their ID by writing to that partition key and inspecting the new value returned in the response. If I were writing something like a chat app and using this method to get the new ID for each message, would my app only be able to create 1,000 new messages a second?

Azure CosmosDB - partition strategy for dictionary-like object collections

We need to move out a huge amount of data from our memory cache as it takes too much space. For that purpose, we are considering CosmosDB. The data structure and use cases are provided at the bottom. While testing it I get a few issues I can't solve: Single item retrieval takes too long (around 2 seconds), transactions seem like costing more RU then it should and can't decide on optimal throughput.
So, I have these questions:
How partitioning should be handled with the provided data structure? And if it even would have an effect?
General throughput during the week should be low (few hundreds of requests per second), but we anticipate that in a timely manner there will be spikes on requests (dozens of times more). How can we configure the container to bypass the risk of throttling and not overpay when usage is low?
Should I consider an alternative?
[
{
id: '<unique_id>',
hash: '<custom_hash>'
data: [{}, {},...]
},
...
]
There are three use cases for the collection:
Read whole collection and taking id's and hash'es to identify which items changed
Replace/insert batch of items if there are changes
Read single item retrieving data property values

Model daily game ranking in DynamoDB

I have a question. I m pretty new to DynamoDB but have been working on large scale aggregation on SQL databases for a long time.
Suppose you have a table called GamePoints (PlayerId, GameId, Points) and would like to create a ranking table Rankings (PlayerId, Points) sorted by points.
This table needs to be updated on an hourly basis but keeping the previous version of its contents is not required. Just the current Rankings.
The query will always be give me the ranking table (with paging).
The GamePoints table will get very very large over time.
Questions:
Is this the best practice schema for DynamoDB ?
How would you do this kind of aggregation?
Thanks
You can enable a DynamoDB Stream on the GamePoints table. You can read stream records from the stream to maintain materialized views, including aggregations, like the Rankings table. Set StreamViewType=NEW_IMAGE on your GamePoints table, and set up a Lambda function to consume stream records from your stream and update the points per player using atomic counters (UpdateItem, HK=player_id, UpdateExpression="ADD Points #stream_record_points", ExpressionAttributeValues={"#stream_record_points":[put the value from stream record here.]}). As the hash key of the Rankings table would still be the player ID, you could do full table scans of the Rankings table every hour to get the n highest players, or all the players and sort.
However, considering the size of fields (player_id and number of points probably do not take more than 100 bytes), an in memory cache updated by a Lambda function could equally well be used to track the descending order list of players and their total number of points in real time. Finally, if your application requires stateful processing of Stream records, you could use the Kinesis Client Library combined with the DynamoDB Streams Kinesis Adapter on your application server to achieve the same effect as subscribing a Lambda function to the Stream of the GamePoints table.
An easy way to do this is by using DynamoDb's HashKey and Sort key. For example, the HashKey is the GameId and Sort key is the Score. You then query the table with a descending sort and a limit to get the real-time top players in O(1).
To get the rank of a given player, you can use the same technique as above: you get the top 1000 scores in O(1) and you then use BinarySearch to find the player's rank amongst the top 1000 scores in O(log n) on your application server.
If the user has a rank of 1000, you can specify that this user has a rank of 1000+. You can also obviously change 1000 to a greater number (100,000 for example).
Hope this helps.
Henri
The PutItem can be helpful to implement the persistence logic according to your Use Case:
PutItem Creates a new item, or replaces an old item with a new item.
If an item that has the same primary key as the new item already
exists in the specified table, the new item completely replaces the
existing item. You can perform a conditional put operation (add a new
item if one with the specified primary key doesn't exist), or replace
an existing item if it has certain attribute values. Source:
http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_PutItem.html
In terms of querying the data, if you know for sure that you are going to be reading the entire Ranking table, I would suggest doing it through several read operations with minimum acceptable page size so you can make the best use of your provisioned throughput. See the guidelines below for more details:
Instead of using a large Scan operation, you can use the following
techniques to minimize the impact of a scan on a table's provisioned
throughput.
Reduce Page Size
Because a Scan operation reads an entire page (by default, 1 MB), you
can reduce the impact of the scan operation by setting a smaller page
size. The Scan operation provides a Limit parameter that you can use
to set the page size for your request. Each Scan or Query request that
has a smaller page size uses fewer read operations and creates a
"pause" between each request. For example, if each item is 4 KB and
you set the page size to 40 items, then a Query request would consume
only 40 strongly consistent read operations or 20 eventually
consistent read operations. A larger number of smaller Scan or Query
operations would allow your other critical requests to succeed without
throttling.
Isolate Scan Operations
DynamoDB is designed for easy scalability. As a result, an application
can create tables for distinct purposes, possibly even duplicating
content across several tables. You want to perform scans on a table
that is not taking "mission-critical" traffic. Some applications
handle this load by rotating traffic hourly between two tables – one
for critical traffic, and one for bookkeeping. Other applications can
do this by performing every write on two tables: a "mission-critical"
table, and a "shadow" table.
SOURCE: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScanGuidelines.html#QueryAndScanGuidelines.BurstsOfActivity
You can also segment your tables by GameId (e.g. Ranking_GameId) to distribute the data more evenly and give you more granularity in terms of provisioned throughput.

Resources