DynamoDB NoSQL design for queries - amazon-dynamodb

I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?

For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution

I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});

Related

Is using "Current Date" a good partition key for data that will be queried by date and id?

I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.

How to query and order on two separate sort keys in DynamoDB?

GROUPS
userID: string
groupID: string
lastActive: number
birthday: number
Assume I have a DynamoDB table called GROUPS which stores items with these attributes. The table records which users are joined to which groups. Users can be in multiple groups at the same time. Therefore, the composite primary key would most-commonly be:
partition key: userID
sort key: groupID
However, if I wanted to query for all users in a specific group, within a specific birthday range, sorted by lastActive, is this possible and if so what index would I need to create?
Could I synthesize lastActive and userID to create a synthetic sort key, like so:
GROUPS
groupID: string
lastActiveUserID: string (i.e. "20201230T09:45:59-abc123")
birthday: number
Which would make for a different composite primary key where the partition key is groupID and the sort key is lastActiveUserID, which would sort the participants by when they were last active, and then a secondary index to filter by birthday?
As written, no this isn't possible.
within a specific birthday range
implies sk_birthday between :start and :end
sorted by lastActive
implies lastActive as a sort key.
which are mutually exclusive...I can't devise a sort key that would be able to contain both values in a usable format.
You could have a Global Secondary Index with a hash key of group-id and lastActive as a sort key, then filter on birthday. But, that only affects the data returned, it doesn't affect the data read nor the cost to read that data. Additionally, since DDB only reads 1MB of data at a time, you'd have to call it repeatedly in a loop if it's possibly a given group has more than 1MB worth of members.
Also, when your index has a different partition (hash) key than your table, that is a global secondary index (GSI). If your index has the same partition key but a different sort key than the table, that can be done with a local secondary index (LSI)
However for any given query, you can only use the table or a given index. You can't use multiple indexes at the same time
Now having said all that, what exactly to you mean by "specific birthday range" If the range in question is a defined period, by month, by week. Perhaps you could have a GSI where the hash key is "group-id#birthday-period" and sort key is lastActive
So for instance, "give me GROUPA birthdays for next month"
Query(hs = "GROUPA#NOVEMBER")
But if you wanted November and December, you'd have to make two queries and combine & sort the results yourself.
Effective and efficient use of DDB means avoiding Scan() and avoiding the use of filterExpressions that you know will throw away lots of the data read.

Querying for a range of dates, when date is the only key

I want to use DynamoDB to store historical stock closing values.
My store will have a few stocks, and grow to include more as requirements change.
I figured I'll have a single table where the only key is "DATE" formatted as YYYY-MM-DD.
This means that each item in the table will have a date key and several attributes of the form { TICKER = CLOSING_VALUE }
Queries for a given date will also filter by a subset of desired stock tickers, e.g. ["INTC", "AAPL"].
I am a bit confused since this single key should work both as partition and sort keys.
How should I query to retrieve a subset of stock tickers for a given date range ?
Update:
I'm creating the table with...
{
AttributeDefinitions: [
{
AttributeName: Date,
AttributeType: S
}
],
TableName: "Historic",
KeySchema: [
{
AttributeName: Date,
KeyType: HASH
}
]
}
And the query:
{
table_name: "Historic",
projection_expression: "USD,CAD",
filter_expression: "#k between :val1 and :val2",
expression_attribute_names: { "#k" => "Date" },
expression_attribute_values: {
":val1" => "2019-12-01",
":val2" => "2020-01-10"
}
}
And I get an error:
Aws::DynamoDB::Errors::ValidationException: Either the KeyConditions or KeyConditionExpression parameter must be specified in the request.
You cannot sort by - or efficiently retrieve a range of - the partition key, you can only sort by the sort key. To understand why, you need to understand how DynamoDB stores its data.
The "partition key" is also called in the CreateTable operation a "hash key" - and indeed it works like a key in a hash table: DynamoDB runs a hash function on this key, and using the resulting number, decides which node(s) of its large cluster should hold this partition. This approach allows distributing the table across the cluster, but it makes it impossible to efficiently retrieve the different partitions ordered by their key. The "Scan" operation will return the partitions in seemingly-random order (they are likely to be sorted by the hash function of their key), and it's impossible to efficiently scan just a range of partition keys. It's possible to do this inefficiently - by scanning the entire table and filtering just the partitions you want. If I understand correctly, this is what you were trying to do. But this only makes sense for tiny databases - would that be your case?
As you noticed, the other component of the key is the "sort key". Inside a partition, in one node, the different items in of that partition are kept sequentially sorted by the "sort key" order. This allows DynamoDB to efficiently retrieve them sorted in this order, or efficiently retrieve only a range of these sort keys - the Query request can do both these things.
So to achieve what you want, you need the date to be the sort key, not the partition key. How to do the rest of the data modeling depends on what your typical queries look like:
If you have a large number of stocks, but a typical query only asks for a handful of stocks, the most reasonable approach is to use the stock name as the partition key, and as I said, the data as the sort key. This will allow you to efficiently Query a date range for one particular stock - and if you need 3 different stocks, you'll need to do 3 Querys (you can and should do them in parallel!) but each of these queries will be efficient and you'll only be paying for the actual data you retrieve, without any post-filtering.
If there is a huge number of different dates (e.g, you keep data at 1 second resolution), your partitions can grow huge, and for various reasons this is not recommended. In such a case, you can split each partition into multiple partition by some coarse time window. For example, instead of having one huge partition for stock "GOOG", have one partition "GOOG Nov 2019", one "GOOG Dec 2019", etc. When you query a small date range, you'll know which specific partition you need to read from. But when the query spans more than one month, you'll need to query multiple of these partitions. Note that very large queries will read (and return) huge amounts of data, so will be very expensive, so you're only likely to want to do this in large analytic jobs.

Recommended Schema for DynamoDB calendar/event like structure

I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.

Performing a conditional expression query on GSI in dynamodb

I know the query below is not supported in DynamoDB since you must use an equality expression on the HASH key.
query({
TableName,
IndexName,
KeyConditionExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});
How can I organize my data so I can efficiently make a query for all items purchased >= 6 times?
Right now I only have 3 columns, orderID (Primary Key), address, confirmations (GSI).
Would it be better to use a different type of database for this type of query?
You would probably want to use the DynamoDB streams feature to perform aggregation into another DynamoDB table. The streams feature will publish events for each change to your data, which you can then process with a Lambda function.
I'm assuming in your primary table you would be tracking each purchase, incrementing a counter. A simple example of the logic might be on each update, you check the purchases count for the item, and if it is >= 6, add the item ID to a list attribute itemIDs or similar in another DynamoDB table. Depending on how you want to query this statistic, you might create a new entry every day, hour, etc.
Bear in mind DynamoDB has a 400KB limit per attribute, so this may not be the best solution depending on how many items you would need to capture in the itemIDs attribute for a given time period.
You would also need to consider how you reset your purchases counter (this might be a scheduled batch job where you reset purchase count back to zero every x time period).
Alternatively you could capture the time period in your primary table and create a GSI that is partitioned based upon time period and has purchases as the sort key. This way you could efficiently query (rather than scan) based upon a given time period for all items that have purchase count of >= 6.
You dont need to reorganise your data, just use a scan instead of a query
scan({
TableName,
IndexName,
FilterExpression: 'purchases >= :p',
ExpressionAttributeValues: { ':p': 6 }
});

Resources