How to fix ItemCollectionSizeLimitExceededException for this scenario? - amazon-dynamodb

This is a question in a certifcation which has me stumped: if a dynamodb table is getting requests from two gaming apps App A is sending 5,00,000 requests per second and App B is sending 10,000 requests per second and each request is 20KB Users are complaining about ItemCollectionSizeLimitExceededException
Current design
Primary Key : is game_name and sort key event identifier (uid)
LSI: player_id and event_time
What aould be the correct choice? It looks like LSI is the problem here but am no 100% certain.
Choice A Use the player identifier as the partition key. Use the event time as the sort key. Add a global secondary index with the game name as the partition key and the event time as the sort key.
Choice BCreate one table for each game. Use the player identifier as the partition key. Use the event time as the sort key

Related

Using a GUID as entity Id vs the entity's "actual" Id

In every cosmos db repository example I've seen, the id/row key has been generated like this: {partitionKey}:{Guid.newGuid()}. I'm working on a web api where the user won't necessarily have any way of knowing what this random GUID is. But they will know the EmployeeId, ProjectId etc. of the respective object, so I'm wondering if there are any issues with using i.e. EmployeeId as both the partition key and Id?
There's nothing technically wrong with the approach of setting id and partition key the same however you will have just one document per partition and that's bad design IMHO as all your read queries will be cross-partition queries (e.g. listing all employees).
One approach could be to set the partition key as the type of the entity (Employee, Project etc.) and then set the id as the unique identifier of the entity (employee id, project id etc.).
To be honest, if you know the partition key AND the item id, you can do a Point read which is the fastest.
We used to also take the approach of using random guids for all item IDs, but this means you will always need to know this id and partition key. Sometimes a more functional key as the item ID makes more sense so have a good thought about it!
And remember, an item ID is not unique, the uniqueness is only within the partition key.
So you could have two items with the same item ID and different partition key.

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

DynamoDB NoSQL design for queries

I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?
For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution
I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});

NoSQL Structuring of Data

Coming from a relational database background, I find that sometimes finding the right way to structure my NoSQL databases is a challenge (yes, I realize the statement sounds silly). I work with DynamoDB.
If I have 3 entities - a user, a report and a building and many users can submit many reports on a building, would the following structure be acceptable?
User - index on userId
Building - index on buildingId
Report - index on reportId, userId and buildingId
Or do I need a fourth table to keep track of reports submitted by users? My points of concern are performance, throughput and storage space.
When using DynamoDB a global secondary indexes provides alternative methods to query data from a table.
Based on the tables you have described here is a structure that may work:
User Table
Hash Key: userId
Building Table
Hash Key: buildingId
Report Table
Hash Key: reportId
ReportUser GSI
Hash Key: userId
BuildingUser GSI
Hash Key: buildingId
The key to the above design are the global secondary indexes on the Report table. Unlike the hash key (and optional range key) on the main table the hash key (and optional range key) on a GSI do not have to be unique. This means you can query all of the reports submitted by a specific userId or all of the reports for a specific buildingId.
In real life these GSIs would probably want to include a Range key (such as date) to allow for ordering of the records when they are queried.
The other thing to remember about GSIs is that you need to choose what attributes are projected, able to be retrieved, as a GSI is actually a physical copy of the data. This also means the GSI is always updated asynchronously so reads are always eventually consistent.

Whats the proper way to implement a unique secondary index on a DynamoDB table that has no range key?

I'm a bit confused by how to properly set up secondary indexes in DynamoDB.
the documentation states secondary indexes are for tables which have a hash and rangekey, but in my case, I have no need of the range key.
The scenario is basically like this. I have a list of mobile clients which will call into my API. those clients are identified by a 6 character unique client ID. Each client also has a unique device ID, which is basically a long GUID -- quite long and inconvenient to use as the primary key.
The question comes when a client registers itself it sends is device ID (the long GUID) in a registration request and the server generates the unique clientID (the six char unique ID) which it returns to the client for future communication. One of the checks that the server side must do is make sure the request is not a duplicate registration, i.e. that the deviceID is not already present in the table under another client ID.
In a SQL table, I would have the clientID be the primary key, and 'd just define the a unique index on the deviceID field, but it seems like I can't do that in DynamoDB, since I only have a hash key on the table, not a hash and range key. I could do a query to find out if there's a dupe deviceID somewhere but that would seem to require a table scan which I'd like to avoid.
What's the proper way to set up something like this in DynamoDB? Do I just use a dummy range key like "foo" on all my rows and use a local secondary index? Seems inefficient somehow.
I personally don't like to use indexes.
What I recommend is to keep two tables.
DEVICES
Hash: device_id
attribute: client_id
CLIENT_DEVICES
Hash: client_id
Range: device_id
This allows you to reason about whether a client has devices, which devices, as well as ask for a device if it attached to a client.
This IMO is more readable than global/local secondary indexes.

Resources