This is a question in a certifcation which has me stumped: if a dynamodb table is getting requests from two gaming apps App A is sending 5,00,000 requests per second and App B is sending 10,000 requests per second and each request is 20KB Users are complaining about ItemCollectionSizeLimitExceededException
Current design
Primary Key : is game_name and sort key event identifier (uid)
LSI: player_id and event_time
What aould be the correct choice? It looks like LSI is the problem here but am no 100% certain.
Choice A Use the player identifier as the partition key. Use the event time as the sort key. Add a global secondary index with the game name as the partition key and the event time as the sort key.
Choice BCreate one table for each game. Use the player identifier as the partition key. Use the event time as the sort key
I have a cross partition query that returns the rows for each partition in turn, which makes sense, all of partition 1’s results, all of partition 2’s etc.
For each row returned I need to perform an action, could be a delete or update.
There are too many records to read them all in and then perform the actions, so I need to stream in the results and perform the actions at the same time.
The issue I get is I run out of RU very quickly as my actions get run on each partition in turn and a single partition has a tenth of the RUs allocated.
I can specify a PartitionKey in the FeedOptions but that does not help me as I don’t know what the key will be.
My query looks like
select r.* from r where r.deleted
the partition is on a field called container
Imagine I have the following items
container|title |deleted
jamjar |jam |true <--- stored in partition 5
jar |pickles |true <--- stored in partition 5
tin |cookies |true <--- stored in partition 8
tub |sweets |true <--- stored in partition 9
I do select r.title from r where r.deleted
my query will return the rows in the following order
jam <--- stored in partition 5
pickles <--- stored in partition 5
cookies <--- stored in partition 8
sweets <--- stored in partition 9
I use an ActionBlock to allow me to spin up 2 threads to do my action on each row returned, so I work on jam and pickles then cookies and sweets thus consuming RUs from partion 5 when I am carrying out the action on jam and pickles
I would like the results to be returned as:
jam <--- stored in partition 5
cookies <--- stored in partition 8
sweets <--- stored in partition 9
pickles <--- stored in partition 5
For normal API calls we always know the container, this is a requirement for a bulk and very infrequent delete.
If know the number of partition and could supply the partition number to the query that would be fine, I would be happy to issue 10 query and just treat this as 10 separate jobs.
You need to set the MaxDegreeOfParallelism which is part of the FeedOptions :
FeedOptions queryOptions = new FeedOptions
{
EnableCrossPartitionQuery = true,
MaxDegreeOfParallelism = 10,
};
It will create a client thread for each partion, you can see what is happening if you inpsect the HTTP Headers
x-ms-documentdb-query-enablecrosspartition: True
x-ms-documentdb-query-parallelizecrosspartitionquery: True
x-ms-documentdb-populatequerymetrics: False
x-ms-documentdb-partitionkeyrangeid: QQlvANNcKgA=,3
Notice the QQlvANNcKgA=,3 you see 10 of these with ,0 through to ,9 i suspect the first part is some page tracking and the second part is the partition
See the docs Parallel query execution
Here's the timeline view of 3 queries in Fiddler:
MaxDegreeOfParallelism = 10: slower and not quite parallel, while the threads and connections are spun up (you can see the 5 extra SSL handshakes in the listing on the left, and a gap before the last 5 requests of the 'green' set in the timeline). There are also 2 (for some reason) requests to get the PK ranges for the collection
MaxDegreeOfParallelism = 10 (again) : almost optimally parallel. The PK range info seems to be cached from the previous request and reused here without making any extraneous requests.
MaxDegreeOfParallelism = 0: completely sequential.
Interestingly, these requests don't specify a x-ms-documentdb-partitionkeyrangeid header.
The query is run against a collection with 6 physical partitions, using DocumentClient v2.x.
Notice also that 7 requests are fired for every query, the 1st one is a 'query plan request' (not parallelizable) while the following 6 return the actual data.
I have a pipeline like this -
table 1(dynamo db) -> aws lambda -> table 2 (dynamo db)
So whenever there is any update hapeens in table 1 then lambda gets trigered. So lambda basically batch read( 1000 records) from table 1 , then perform a batch compute to come up with the list of records that's needed to be updated in table 2. Table 2 basically maintains the count of certain event happening in table 1.
So problem is if we send the same batch of records twice then it will increment the count in table 2 twice.
Why am i considering this as during outage on one of the lambda function ( the number of lambda running is 1:1 relation with the number of partitions in dynamo db ) while it had performed some of the writes operation, it will resend the last batch read.
To avoid this one way can be to store the sequence number of the records we have already computed and store that in table 2. So when ever we update we can check if its already computed. But we need to maintain the size of that list else we will get performance issue. But what size it should be is an issue.
What should be the write approach to handle these kind of issues?
I am getting throttled update requests on a DynamoDB table even though there is provisioned capacity to spare.
What could be causing this?
I have a hunch that this must be related to "hot keys" in the table, but I want to get an opinion before going down that rabbit-hole. If that is the problem, any suggestions on tools or processes to help visualize/debug the issue would be appreciated.
Frequent updates to the same hash key but different range key?
i.e. userId + timeStamp
userId = Hash Key
timeStamp = Range Key
e.g.
user1 + 2016-06-23:23:00:01
user1 + 2016-06-23:23:00:02
user1 + 2016-06-23:23:00:03
user1 + 2016-06-23:23:00:04
user1 + 2016-06-23:23:00:05
This causes hotkeys.
There is not a non-invasive techniques that i know of. If you have access to the code base to make changes, i suggest logging the hash and range key, this is one way you can determine if you are having hot rows / hot keys.
Problem
Suppose I have a system of nodes that can communicate with a parent node, but not among each other. Suppose then a file on the parent node is split up into blocks and divided among the children. The file is then deleted from the parent node.
If the parent were to then request the blocks back from the children, how can the original order be reconstructed without retaining a list of all the files on the parent. Additionally, to prevent one of the nodes from maliciously modifying a block, the parent would also have to validate the blocks coming back.
Optimal Solution
A system of naming the blocks of a file, where the list of files can be generated on any node given a seed. Given the list, a parent should be able to use the list somehow to validate the blocks coming back from children.
Attempt #1
So what I have got so far is the ability to minimally store a list of the blocks. I do so by naming the blocks as such:
block_0 = hash(file_contents)
block_n = hash(block_n-1) [hashing the name of the previous file]
This enables the order of the files to be retained by just keeping the seed (name of block_0), and the number of blocks (e.g. 5d41402abc4b2a76b9719d911017c592,5 --> seed,files). However this will not allow the files to be validated independently.
Attempt #2
Simply take the hash of each block and store that in a list. However this is not efficient and will result in a large amount of memory allocated to this task alone if a large number of blocks need to be tracked. This will not do.
I'm not sure if I've got the problem, but I guess this is a possible solution:
| Distribution:
parent | buffer = [hash(key, id)), data[id]]; send(buffer);
nodes | recv(buffer); h_id, data = buffer;
The parent node uses some local key to generate a hashed value (h_id) to the id part of data it is sending, and the local nodes will receive both the resulting h_id and the data itself.
| Reduction:
nodes | buffer = [h_id, data]; send(buffer);
parent | recv(buffer); h_id, data_id = buffer;
On the counter flow, the nodes must send both the original h_id and the data formerly received, otherwise, the following verification will fail:
hash(key, data) == h_id
Since key is only known in the parent node, it would be hard for the local nodes to alter data and h_id in such a way that hash(key, data_id), in the parent node, would still be valid.
Concerning the ordering, you could simply assume that the four initial bytes of data store the number of the partition -- for later reconstruction.
Edit:
I may have not noticed this extra storage you pointed, but here is what I've tried to propose. Consider four machines, A, B, C, and P, with the initial data:
P{key, data[3]}
____|____
/ | \
A{} B{} C{}
Then, P distributes the data among the machines, sending both the data shard itself, and the generated hash:
P{key, data[3]}
____|____
/ | \
A | C
{data[0], hash(key, data[0])} | {data[2], hash(key, data[2])}
B
{data[1], hash(key, data[1])}
If you assume the first bytes in data[i] store a global index, you're able to rebuild the initial base data[3] in the original order. Also, if you allow each machine to store/receive key, you'll be able to later un hash data[i] and rebuild data[3] on every local node.
Notice the addition of errors can only take place over the shards of data data[i], and over the received key hash(key, data[i]), as you must assume key to be globally valid. The main point here is that the list of hash(key, data[i]) values are also distributed among the machines, not only the data partitions themselves, i.e., you need no list with all files to be stored in any machine alone.
Considering you can afford to maintain key in every node, or at least to send key to the one node trying to rebuild the original data, here goes an example of a reduction step, say, for node B. A and C send their local {data[i], hash(key, data[i])} to node B, and P sends key to B, so this node can un hash the received data:
P{key, data[3]}
|
A | C
{data[0], hash(key, data[0])} | {data[0], hash(key, data[0])}
\ | /
B
{data[1], hash(key, data[1])}
Then, B computes:
/ {data[1], hash(key, data[1])} \ {data[1]}
unhash( {data[0], hash(key, data[0])} ) => {data[0]} => {data[3]}
\ {data[2], hash(key, data[2])} / {data[2]}
Which restores the original data with the correct ordering.