One to Many Schema in DynamoDB - amazon-dynamodb

I am currently designing dynamo DB schemas for the following use case:
A company can have many channels, and then for each channel, I have multiple channel cards.
I am thinking of having following tables:
Channel Table:
Partition Key: CompanyId
Sort Key: Combination of Time Stamp and deleted or not
Now after getting the channels for a company, I need to fetch its channel cards, for this, I am thinking to have following Table Schema for ChannelCard.
ChannelCard Table:
Partition Key: channelId
Sort Key: Combination of Time Stamp and deleted or not
Now to get the channel cards for a company, I need to do the following:
1. First query the channels for the company using partition key (1 query)
2. Getting channel cards for each channel (number of channels query)
So in this case, we will be making many queries, can we have any less number of queries in our case?
Any suggestions for modifying the database tables or about how to query the database are welcome.

You could also have
Channel Table
Partition Key: CompanyId
Sort Key: Deleted+timestamp
Channel Card Table
Partition Key: CompanyId
Sort Key: Deleted+ChannelCardTimeStamp
GSI #1:
Partition Key: ChannelId
Sort Key: Deleted+ChannelCardTimeStamp
This way you can have one query for the most recent channelcards for any given company and you can also query for the most recent channelcards for any channel.

Related

Is using "Current Date" a good partition key for data that will be queried by date and id?

I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.

Single table DynamoDB design tips

I have an old application I am modernizing and bringing to AWS. I will be using DynamoDB for the database and am looking to go with a single table design. This is a multitenant application.
The applications will consist of Organisations, Outlets, Customers & Transactions.
Everything stems from an organization, an organization can have multiple outlets, outlets can have multiple customers and customers can have multiple transactions.
Access patterns are expected to be as follows:
Fetch a customer by its ID
Search for a customer by name or email
Get all customers for a given outlet
Get all transactions for a customer
Get all transactions for an outlet
Get all transactions for an outlet during a given time period (timestamps will be stored with each transaction)
Get all outlets for a given organisation
Get an outlet by its ID
I've been reading into single table designs and utilizing the primary key and sort keys to enable this sort of access but right now I can't quite figure out the table/schema design.
The customer will have the outletID and OrganiastionID attached so I should always know those ID's
Data Structure (can be modified)
Organisations:
id
Name
Owner
List of Outlets
createdAt (timestamp)
Outlets:
OrganisationId
Outlet Name
Number of customers
Number of transactions
createdAt (timestamp)
Customers:
id
OrganisationID
OutletID
firstName
lastName
email
total transactions
total spent
createdAt (timestamp)
Transactions:
id
customerID
OrganisationID
OutletID
createdAt (timestamp)
type
value
You're off to a great start by having a thorough understanding of your entities and access patterns! I've taken a stab at modeling for these access patterns, but keep in mind this is not the only way to model a solution. Data modeling in DynamoDB is iterative, so this is very likely that this specific design might not fit 100% of your use cases.
With that disclaimer out of the way, let's get into it!
I've modeled your access patterns using a single table named data with global secondary indexes (GSI) named GSI1 and GSI2. Each GSI has partition and sort keys named GSI#PK and GSI#SK respectively.
The base table models the following access patterns:
Fetch customer by ID: getItem where PK=CUST#<id> and SK = A
Fetch all transactions for a customer: query where PK=CUST#<id> and SK begins_with TX
Fetch an outlet by ID: getItem where PK=ORG#<id> and SK = A
Fetch all customers for an outlet: query where PK=OUT#<id>#CUST
That last access pattern may require a bit more explanation. I've chosen to model the relationship between outlets and customers using a unique PK/SK pattern where PK is OUT#<id>#CUST and SK isCUST#<id>. When your application records a transaction for a particular customer, it can insert two records in DDB using a batch write operation. The batch write operation would perform two operations:
Write a new Transaction into the Customer partition (e.g. PK = CUST#1 and SK = TX#<id>)
Write a new record to the CUSTOMERLIST partition (e.g. PK = OUT#<id>#CUST and SK = CUST#<id>). It this record already exists, DynamoDB will just overwrite the existing record, which is fine for your use case.
Moving onto GSI1:
GSI1 supports the following operations:
Fetch outlets by organization: query GSI1 where GSI1PK = ORG#<id>
Fetch transactions by outlet: query GSI1 where GSI1PK = OUT#<id>
Fetch transactions by outlet for a given time period: `query GSI1 where GSI1PK=OUT# and GSI1SK between and
And finally, there's GSI2
GSI2 supports the following transactions:
Fetch transactions by organization: query GSI2 where GSI2PK = ORG#<id>
Fetch transactions by organization for a given time period: query GSI2 where GSI2PK=OUT#<id> and GSI2SK between <period1> and <period2>
For your final access pattern, you've asked to support searching for customers by email or name. DynamoDB is really good at finding items by their primary key. DynamoDB is not good for search, where fuzzy or partial matches are expected. If you need an exact match on email or name, you could do that in DynamoDB by incorporating email//name in the primary key of the User item.
I hope this gives you some ideas on how to model your access patterns!

Auto-increment integer in dynamodb

I'm modeling a dynamodb diagram for an invoice app and I'm looking for generate the unique invoice id that need to be incremented from 1 to X. There is in 2019 a solution about this kind of problem with aws appsync and dynamodb as datasource ?
Auto-incrementing integers are not a recommended pattern in DynamoDB although it is possible to implement something similar using application level logic. A DynamoDB table is distributed to many logical partitions according to the table's partition key. Items are then sorted within that partition according to their sort key. You will need to decide what structure makes sense for you app and what an auto-incrementing means for your app. The simplest case would be to omit a sort key and treat the auto-incremented id as the partition key which would guarantee its uniqueness but also has implications that every row lives in its own partition and thus listing all invoices would have to be a Scan and thus does not preserve order which may or may not make sense for your app.
As mentioned in this SO post (How to use auto increment for primary key id in dynamodb) you can use code like this:
const params = {
TableName: 'CounterTable',
Key: { HashKey : 'auto-incrementing-counter' },
UpdateExpression: 'ADD #a :x',
ExpressionAttributeNames: {'#a' : "counter_value"},
ExpressionAttributeValues: {':x' : 1},
ReturnValues: "UPDATED_NEW" // ensures you get value back the new key
};
new AWS.DynamoDB.DocumentClient().update(params, function(err, data) {});
to atomically increment the integer stored in the CounterTable row designated by the partition key "auto-incrementing-counter". After the atomic increment, you can use the returned id to create the new Invoice.
You can implement this pattern using DynamoDB & AppSync but the first thing to decide is if it suits your use case. You may also be interested in the RDS integration via the RDS data API which would have more native support for auto-incrementing IDs but would lose out on the set it and forget it scaling of DynamoDB.

DynamoDB NoSQL design for queries

I am looking to store a log of user events. It is going to be a lot of entries so I thought DynamoDB would be good as everything else is hosted there.
I need to query these events in two ways, totalt of events for a user for a date (range) and occasionally all the events for a date.
I was thinking to store it in one table as user id (key), sequence number (key), date, time and duration.
Should it be multiple tables? How can this be done most efficient?
For a small amount of data this structure is ok.
Keep in mind that the sequence number (your range key) has to be provided by you. It seems a good idea to choose the date as a unix timestamp with a milliseconds accuracy as a sort key.
There is no need for extra tables.
However your structure depends largely on the read write capacity that you want to achieve, and the data size.
Supposing your user_id is your partition key.
For every distinct partition key value, the total sizes of all table and index items cannot exceed 10 GB.
A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units.
You need to create your partition keys by taking into consideration these limitations.
For example a very active user has many events thus you need more than 1000 write capacity units. Unfortunately you have choosen as a partition the user id.
In this case you are limited to 1000 write capacity units therefore you might have failures.
You need to have a different structure. For example a partition name like
user_id_1 user_id_2 etc. Therefore a partition naming mechanism spreading the data to partitions according to your application's needs.
Check these links on dynamodb limitations.
Tables guidance,
Partition distribution
I would suggest the following structure for your events table:
user id -- hash key
event date/time (timestamp with milliseconds) -- range key
duration
Having event timestamp as a range key should be sufficient to provide uniqueness for an event (unless a user can have multiple events right in the same millisecond), so you don't need a sequence number.
Having such a schema, you can get all events for a user for a date by using simple query.
Unfortunately, DynamoDB do not support aggregate queries, so you can't get a total number of events for a user quickly (you would have to query all records and calculate total manually).
So I would suggest creating a separate table for user events statistics like this:
user id -- hash key
date -- range key
events_cnt (total number of events for a user for a date)
So, after you add a new record into your events table, you have to increment events counter for the user in statistics table like shown below:
var dynamodbDoc = new AWS.DynamoDB.DocumentClient();
var params = {
TableName : "user_events_stats",
Key: {
userId: "65716110-f4df-11e6-bc64-92361f002671" ,
date: "2017-02-17",
},
UpdateExpression: "SET #events_cnt = if_not_exists(#events_cnt, :zero) + :one",
ExpressionAttributeNames: {
"#events_cnt": "events_cnt",
},
ExpressionAttributeValues: {
":one": 1,
":zero": 0,
},
};
dynamodbDoc.update(params, function(err, data) {
});

NoSQL Structuring of Data

Coming from a relational database background, I find that sometimes finding the right way to structure my NoSQL databases is a challenge (yes, I realize the statement sounds silly). I work with DynamoDB.
If I have 3 entities - a user, a report and a building and many users can submit many reports on a building, would the following structure be acceptable?
User - index on userId
Building - index on buildingId
Report - index on reportId, userId and buildingId
Or do I need a fourth table to keep track of reports submitted by users? My points of concern are performance, throughput and storage space.
When using DynamoDB a global secondary indexes provides alternative methods to query data from a table.
Based on the tables you have described here is a structure that may work:
User Table
Hash Key: userId
Building Table
Hash Key: buildingId
Report Table
Hash Key: reportId
ReportUser GSI
Hash Key: userId
BuildingUser GSI
Hash Key: buildingId
The key to the above design are the global secondary indexes on the Report table. Unlike the hash key (and optional range key) on the main table the hash key (and optional range key) on a GSI do not have to be unique. This means you can query all of the reports submitted by a specific userId or all of the reports for a specific buildingId.
In real life these GSIs would probably want to include a Range key (such as date) to allow for ordering of the records when they are queried.
The other thing to remember about GSIs is that you need to choose what attributes are projected, able to be retrieved, as a GSI is actually a physical copy of the data. This also means the GSI is always updated asynchronously so reads are always eventually consistent.

Resources