Single table DynamoDB design tips - amazon-dynamodb

I have an old application I am modernizing and bringing to AWS. I will be using DynamoDB for the database and am looking to go with a single table design. This is a multitenant application.
The applications will consist of Organisations, Outlets, Customers & Transactions.
Everything stems from an organization, an organization can have multiple outlets, outlets can have multiple customers and customers can have multiple transactions.
Access patterns are expected to be as follows:
Fetch a customer by its ID
Search for a customer by name or email
Get all customers for a given outlet
Get all transactions for a customer
Get all transactions for an outlet
Get all transactions for an outlet during a given time period (timestamps will be stored with each transaction)
Get all outlets for a given organisation
Get an outlet by its ID
I've been reading into single table designs and utilizing the primary key and sort keys to enable this sort of access but right now I can't quite figure out the table/schema design.
The customer will have the outletID and OrganiastionID attached so I should always know those ID's
Data Structure (can be modified)
Organisations:
id
Name
Owner
List of Outlets
createdAt (timestamp)
Outlets:
OrganisationId
Outlet Name
Number of customers
Number of transactions
createdAt (timestamp)
Customers:
id
OrganisationID
OutletID
firstName
lastName
email
total transactions
total spent
createdAt (timestamp)
Transactions:
id
customerID
OrganisationID
OutletID
createdAt (timestamp)
type
value

You're off to a great start by having a thorough understanding of your entities and access patterns! I've taken a stab at modeling for these access patterns, but keep in mind this is not the only way to model a solution. Data modeling in DynamoDB is iterative, so this is very likely that this specific design might not fit 100% of your use cases.
With that disclaimer out of the way, let's get into it!
I've modeled your access patterns using a single table named data with global secondary indexes (GSI) named GSI1 and GSI2. Each GSI has partition and sort keys named GSI#PK and GSI#SK respectively.
The base table models the following access patterns:
Fetch customer by ID: getItem where PK=CUST#<id> and SK = A
Fetch all transactions for a customer: query where PK=CUST#<id> and SK begins_with TX
Fetch an outlet by ID: getItem where PK=ORG#<id> and SK = A
Fetch all customers for an outlet: query where PK=OUT#<id>#CUST
That last access pattern may require a bit more explanation. I've chosen to model the relationship between outlets and customers using a unique PK/SK pattern where PK is OUT#<id>#CUST and SK isCUST#<id>. When your application records a transaction for a particular customer, it can insert two records in DDB using a batch write operation. The batch write operation would perform two operations:
Write a new Transaction into the Customer partition (e.g. PK = CUST#1 and SK = TX#<id>)
Write a new record to the CUSTOMERLIST partition (e.g. PK = OUT#<id>#CUST and SK = CUST#<id>). It this record already exists, DynamoDB will just overwrite the existing record, which is fine for your use case.
Moving onto GSI1:
GSI1 supports the following operations:
Fetch outlets by organization: query GSI1 where GSI1PK = ORG#<id>
Fetch transactions by outlet: query GSI1 where GSI1PK = OUT#<id>
Fetch transactions by outlet for a given time period: `query GSI1 where GSI1PK=OUT# and GSI1SK between and
And finally, there's GSI2
GSI2 supports the following transactions:
Fetch transactions by organization: query GSI2 where GSI2PK = ORG#<id>
Fetch transactions by organization for a given time period: query GSI2 where GSI2PK=OUT#<id> and GSI2SK between <period1> and <period2>
For your final access pattern, you've asked to support searching for customers by email or name. DynamoDB is really good at finding items by their primary key. DynamoDB is not good for search, where fuzzy or partial matches are expected. If you need an exact match on email or name, you could do that in DynamoDB by incorporating email//name in the primary key of the User item.
I hope this gives you some ideas on how to model your access patterns!

Related

Is using "Current Date" a good partition key for data that will be queried by date and id?

I'm new to Azure Cosmos DB and I have this new project where I decided to give it a go.
My DB has only one collection where around 6,000 new items are added everyday and each looks like this
{
"Result": "Pass",
"date": "23-Sep-2021",
"id": "user1#example.com"
}
The date is partition key and it will be the date of which the item was added to the collection where the same id can be added again everyday as follows
{
"Result": "Fail",
"date": "24-Sep-2021",
"id": "user1#example.com"
}
The application that uses this DB will query by id and date to retrieve the Result.
I read some Azure Cosmos DB documentations and found that selecting the partition key very carefully can improve the performance of the database and RUs used for each request.
I tried running this query and it consumed 2.9 RUs where the collection has about 23,000 items.
SELECT * FROM c
WHERE c.id = 'user1#example.com' AND c.date = '24-Sep-2021'
Here are my questions
Is using date a good partition key for my scenario? Any rooms for improvements?
Will consumed RUs per request increase over time if number of items in collection increase?
Thanks.
For a write-heavy workload using date as a partition key is a bad choice because you will always have a hot partition on the current date. However, if the amount of data being written is consistent and the write volume is low, then it can be used and you will have good distribution of data on storage.
In read-heavy scenarios, date can be a good partition key if it is used to answer most of the queries in the app.
The value for id must be unique per partition key value so for your data model to work you can only have one "id" value per day.
If this is the case for your app then you can make one additional optimization and replace the query you have with a point read, ReadItemAsync(). This takes the partition key value and the id. This is the fastest and most efficient way to read data because it does not go through the query engine and reads directly from the backend data store. All point reads for 1KB of data or less will always cost 1RU/s.

How best to perform a query on primary partition key only, for a table which has both partition key and sort key?

Ok, I have a table with primary partition key (Employee ID) and Sort Key (Poject ID). Now I want a list of all projects an employee works on. Also I want list of all employees working on a project. The relationship is many to many. I have created schema in AppSync (GraphQL). Appsync created the required queries and mutations for the type (EmployeeProjects). Now the ListEmployeeProjects takes a filter input with different attributes. My question is when I do the two searches on Employee ID or Project ID only, will it be a complete table scan? How efficient will that be. If it is a table scan, can I reduce the time complexity by creating indexes (GSI or LSI). The end product will have huge amount of data, so I cannot test the app with such data before hand. My project works fine, but I am worried about the problems that might arise later on with a lot of data. Can someone please help.
You don't need to (and should not) perform a Scan for this.
To get all of the projects an employee is working on, you just need to perform a Query on the base table, specifying employee ID as the partition key.
To get all of the employees on a project, you should create a GSI on the table. The partition key should be project ID and sort key should be employee ID. Then perform a Query on the GSI, using partition key of project ID.
In order to model this correctly you will probably want three tables
Employee Table
Project Table
Employee-Project reference table (i.e. just two attributes of employee ID and project ID)

One to Many Schema in DynamoDB

I am currently designing dynamo DB schemas for the following use case:
A company can have many channels, and then for each channel, I have multiple channel cards.
I am thinking of having following tables:
Channel Table:
Partition Key: CompanyId
Sort Key: Combination of Time Stamp and deleted or not
Now after getting the channels for a company, I need to fetch its channel cards, for this, I am thinking to have following Table Schema for ChannelCard.
ChannelCard Table:
Partition Key: channelId
Sort Key: Combination of Time Stamp and deleted or not
Now to get the channel cards for a company, I need to do the following:
1. First query the channels for the company using partition key (1 query)
2. Getting channel cards for each channel (number of channels query)
So in this case, we will be making many queries, can we have any less number of queries in our case?
Any suggestions for modifying the database tables or about how to query the database are welcome.
You could also have
Channel Table
Partition Key: CompanyId
Sort Key: Deleted+timestamp
Channel Card Table
Partition Key: CompanyId
Sort Key: Deleted+ChannelCardTimeStamp
GSI #1:
Partition Key: ChannelId
Sort Key: Deleted+ChannelCardTimeStamp
This way you can have one query for the most recent channelcards for any given company and you can also query for the most recent channelcards for any channel.

Modeling an invite schema with embedded collections with dynamodb or docuemntdb

I'm investigating whether to use AWS DynamoDb or Azure DocumentDb or google cloud for price and simplicity for my app and am wondering what the best approach is for a typical invite schema.
An invite has
userId : key (who created the invite)
gameId : key
invitationList : collection of userIds
The queries I would be running are
Get invites where userId == me
Get invites where my userId is in the invitationList
In Mongo, I would just set an index on the embedded invitationList, and in SQL I would set up a join table of gameId and invited UserIds.
Using dynamodb or documentdb, could I do this in one "table" or would I have to set up a second denormalized table one that has an invited UserId per row with a set of invitedGameIds?
e.g.
A secondary table with
InvitedUserId : key
GameIds : Collection
Similar to hslriksen's answer, if certain criteria are met, I recommend that you denormalize all of this into a single document. Those criteria are:
The invitationList for games cannot grow unbounded.
Even if it's bounded, will a maximum length array fit in the document and transaction limits.
However, different from hslriksen, I recommend that an example document look like this:
{
gameId: <some game key>,
userId: <some user id>,
invitationList: [<user id 1>, <user id 2>, ...]
}
You might also decide to use the built-in id field for games in which case the name above is wrong.
The key difference between what I propose and hslriksen is that the invitationsList is a pure array of foreign keys. This will allow indexes to be used for an ARRAY_CONTAINS clause in your query.
Note, in DocumentDB, you would tend to store all entity types in the same big bucket and just distinguish them with a string type field or slightly better, an is_my_type boolean field.
For DocumentDB you could probably just keep this in one document per inviting user
where the document Id could equal the key of the inviting user. If you have many games, you could use gameId as partitionKey.
{
"id" : "gameKey+invitingUserKey",
"gameKey" : "someGameKey",
"invitingUserId": "key",
"invites": ["inviteKey1", "inviteKey2"]
}
This is based on a limited number of invites for a user/gameKey. It is however hard to determine the structure without knowing your query patterns. I find that the query patterns often dictates the document structure.

NoSQL Structuring of Data

Coming from a relational database background, I find that sometimes finding the right way to structure my NoSQL databases is a challenge (yes, I realize the statement sounds silly). I work with DynamoDB.
If I have 3 entities - a user, a report and a building and many users can submit many reports on a building, would the following structure be acceptable?
User - index on userId
Building - index on buildingId
Report - index on reportId, userId and buildingId
Or do I need a fourth table to keep track of reports submitted by users? My points of concern are performance, throughput and storage space.
When using DynamoDB a global secondary indexes provides alternative methods to query data from a table.
Based on the tables you have described here is a structure that may work:
User Table
Hash Key: userId
Building Table
Hash Key: buildingId
Report Table
Hash Key: reportId
ReportUser GSI
Hash Key: userId
BuildingUser GSI
Hash Key: buildingId
The key to the above design are the global secondary indexes on the Report table. Unlike the hash key (and optional range key) on the main table the hash key (and optional range key) on a GSI do not have to be unique. This means you can query all of the reports submitted by a specific userId or all of the reports for a specific buildingId.
In real life these GSIs would probably want to include a Range key (such as date) to allow for ordering of the records when they are queried.
The other thing to remember about GSIs is that you need to choose what attributes are projected, able to be retrieved, as a GSI is actually a physical copy of the data. This also means the GSI is always updated asynchronously so reads are always eventually consistent.

Resources