I have a relationship whereby each SITE can have one or more CAMERAs.
So the parent-child relatioship would be that of SITE->CAMERA[s].
The 99% of my queries will be "Give me all the cameras at a given site" and "Give me camera XYZ" and "Give me all cameras where enabled===true" -- at roughly 1:1:1 ratio.
The DynamoDB design, if I understand correctly, would be to have the partition key be 'SITE_ID' and the sort key be 'CAMERA_ID'. Done and done.
....
However, not every CAMERA belongs to a SITE. About 10% of my CAMERAs are not associated with a SITE. I could just put 'noSite' or something as the Partitionkey, but that seems like a kludge... or is it?
I'm new to DynamoDB and unsure how best to set up this relationship. I've always just used MongoDB and never spent time in the SQL world, so needing to worry about indexes isn't something I have experience with. Cost is more important than raw speed and the DB will remain somewhat small (currently around 500 cameras and likely never more than 10k).
What is the best way to set up this table?
Detailed question first: a noSite key is not a bad design choice for unassigned cameras. SiteID is important and
the key cannot be blank.
Your access patterns give you flexibility. Your low data volumes reduce the stakes of the design decisions.
What are the Partion Key and Sort Key names? Regardless of which "columns" you end up selecting for the keys, naming the keys PK and SK give you the option to add other record types in a single-table design later. This is a common practice.
What are the PK and SK columns?
You have two good options for PK and SK for your Camera records:
# Option 1 - marginally better, CameraID has the higher cardinality
PK: CameraID, SK: SiteID
# Option 2
PK: SiteID, SK: CameraID
At this point, 1 of your "queries" will be executed as a query (faster and cheaper) and the other 2 as scans (slower and more expensive). Scanning 500 records is nothing, though, so you could be "done and done" as you say.
Sooner, Later or Never
If required, we can remove the scan operations by adding secondary indexes. Secondary indexes add storage cost (records are literally duplicated) but reduce access costs. Net net change is case dependent. Performance will improve.
# Add an index to query "Give me all the cameras at a given site"
GSI1PK: SiteID, GSI1SK: CameraID # reverse your choice for primary keys
# or, to get fancy and be able to query enabled cameras by site, too, use a concatenated SK with a begins_with query
GSI1PK: SiteID, GSI1SK: Enabled#True#CameraID
# Add an index to query "Give me all cameras where enabled===true"
# Concatenate SiteID and CameraID in the GSI Sort Key to enable 2 types of queries
# 1. all enabled cameras? GSI2PK = true and GSI2SK > ""
# 2. all enabled cameras at Site123? GSI2PK = true and GSI2SK begins_with("Site123")
GSI2PK: Enabled, GSI2SK: SiteID#CameraID
Related
I believe this would be easier with PostgreSQL or MongoDB, both of which I'm familiar with, but I'm using DynamoDB with my project for the sake of learning how to use it and getting comfortable with it. I've never used it before.
I want to use DynamoDB to store high scores for my typing test project. There are 4 data attributes to be stored:
name (doesn't need to be unique)
WPM
number of errors
test type (because I have 2 different kinds of typing tests)
At first, my partition key was testType, and my sort key was WPM. Then I realized that if anyone got the same WPM as a previous user, it would overwrite the previous user's data, because testType and WPM, the two key components, were identical. So ties did not work.
So, now, name is my partition key, and WPM is my sort key. In order to filter by testType, I just use JS array filter methods. This still doesn't seem optimal though for multiple reasons. For my small typing test project, I think it's ok, but I can see that it's possible for 2 people to input the same name and get the same WPM and overwrite each other.
What would be a better way to set this up with DynamoDB?
Assuming you want the top X many WPM results for a given test type:
Set the partition key to be the test type. Set the sort key as <WPM>#<username>. Make sure to zero-pad the WPM so it’s always 3 digits even if the score is below 100. That keeps it numerically sorted.
With this key structure you have a sorted list (in the sort key) of all the scores for a given test type. You can Query against the test type and use ScanIndexForward=false to get descending high scores.
Notice how multiple identical scores by different usernames won’t overwrite each other. The username can be pulled from the returned sort key or from an attribute on the item, along with other metadata about the high score event.
If you have multiple users with the same username, well, that’s kinda weird. Presumably you have an internal identifier. You can use that as the suffix in the sort key instead of the username.
A bit of context: I am trying to build an inventory to list my AWS resources in various accounts and I am planning to use DynamoDB to store the data. These will be the columns for my table: ResourceARN, ResourceName, ResourceType, StandardTag, IsDeleted, LastUpdateTime and ResourceCreationDate ( this field is available only for a few resource types like Ec2).
Question: I want to query my DDB table using account ID, resource type and tag name. I am stumped on choosing the primary key for the table. Since primary key should be unique and has to have 1:many relationship. Hence, I cannot use a combination of resourceType and account Id. Nor can I use resourceArn as my primary key since it is 1:1 relationship. Also, using the resourceARN as the sort key does not make sense to me. I understand that I can use a simple scan operation, but that is very costly and will take time if I add more data in my DDB.
I would appreciate any suggestions or guidance over the same.
Short answer
Partition key: Account ID
Sort key: <resource type>/<resource ID>
Rationale
It's a common pattern for a sort key to be a string concatenating multiple attributes. Since sort keys can be queried by prefix, you can leverage this in your queries:
Get all account resources: query all sort keys on the Account ID partition key
Get all EC2 instances of an account: query with partition key = <your account ID> and sort key begins_with('ec2-instance').
You may notice that ARNs follow such a hierarchy as well (what's probably not a coincidence). This would be effectively using a subset of the ARN as the sort key.
Some notes:
DynamoDB is about attributes as much as about columns. You don't need to include ResourceCreationDate in the records which don't have it, and doing so will save you space (see next point).
Attribute names count as storage for every record, which impacts cost and also throughput. It's common to use shorthand for names for this reason (rct instead of ResourceCreationTime for example).
You can use LSIs (Local Secondary Indexes) to order by creation and update times if you need this.
Say if I had a DynamoDB table:
UserId: S
BookName: S
BorrowedTimestamp: S
HasReturned: B
UserId (partition) and BookName (range) would be keys on the base table.
However I want to query using the other non-key fields e.g. BorrowedTimestamp > 3days and HasReturned is false.
I think I'd need to setup a GSI for this query to work, but it doesn't sound right having a binary field, HasReturned, as the partition key (with BorrowedTimestamp as range key). Is that correct, or am I missing something?
No, you don't need a GSI, but it might be more efficient depending on your circumstances.
Lets take your example of BorrowedTimestamp > 3days. Im going to assume this is for a particular user, so you have a userid to query.
You could do a query with a KeyConditionExpression of userid, then a FilterExpression of BorrowedTimestamp > 3days. Lets say the user has 10 books and 2 of them have a BorrowedTimestamp > 3days. This query will cost you 10 RCU (Read Capacity Units). That's because a FilterExpression just filters out items in your result set - DynamoDB actually found all 10 items in the query.
Now lets say you have a GSI where the partition key was userid and the range key was BorrowedTimestamp. Your KeyConditionExpression could specify both the parition key of the userid and the range key of BorrowedTimestamp > 3days. The result would be exactly the same. However this time it would only cost you 2 RCUs, and those RCUs would come from the index capacity not the table capacity.
Less RCUs sounds good, but remember you have to purchase throughput capacity for your primary index and GSI separately. This can be less efficient because you can't share purchased throughput between queries that use your primary key and GSI.
Finally if you didn't want to specify a userid at all you would use a scan. Scans sometimes don't scale well because they always evaluate every item in the table, but whether it works for you really depends on a lot of things (like how often you will use the scan, how many items you will have in the table etc).
I'm pretty new to DynamoDB design and trying to get the correct schema for my application. In this app different users will enter various attributes about their day. For example "User X, March 1st 12:00-2:00, Tired". There could be multiple entries for a given time, or overlapping times (e.g. tired from 12-2 and eating lunch from 12-1).
I'll need to query based on user and time ranges. Common queries:
Give me all the "actions" for user X between time t1 and t2
Give me all the start times for action Z for user X
My initial thought was that the partition key would be userid and range key for the start time, but that wont work because of duplicate start times right?
A second thought:
UserID - Partition Key
StartTime - RangeKey
Action - JSON document of all actions for that start time
[{ action: "Lunch", endTime:"1pm"},{action:tired, endTime:"2pm"}]
Any recommendation on a proper schema?
This doesn't really have a one solution. And you will need to evaluate multiple options depending on your use case how much data you have/how often would you query and by which fields etc.
But one good solution is to partition your schema like this.
Generated UUID as partition key
UserID
Start time (in unix epoch time or ISO8601 time format)
Advantages
Can handle multiple time zones
Can easily query for userID and start date (you will need secondary index with primary key userID and sort key start time)
More even distribution and less hot keys of your data across dynamoDB partitions because of randomly generated primary key.
Disadvantages
More data for every item (because of UUID) (+16 bytes)
Additional cost for new secondary index, note scanning the data in table is generally much more expensive than having secondary index.
This is pretty close to your initial thought, in order to get a bit more precise answer we will need a lot more information about how many writes and reads are you planning, and what kind of queries you will need.
You are right in that UserID as Partition key and StartTime as rangeKey would be the obvious choice, if it wasn't for the fact of your overlapping activities.
I would consider going for
UserID - Partition Key
StartTime + uuid - RangeKey
StartTime - Plain old attribute
Datetimes in DynamoDB just get stored as strings anyway. So the idea here is that you have StartTime + some uuid as your rangekey, which gives you a sortable table based on datetime whilst also assuring you have unique primary keys. You could then store the StartTime in a separate attribute or have a function for adding/removing the uuid from the StartTime + uuid attribute.
I'm new to AWS DynamoDB and wanted to clarify something. Is it possible to query a table and filter base on a non-primary key attribute. My table looks like the following
Store
Id: PrimaryKey
Name: simple string
Location: simple string
Now I want to query on the Name, but I think I have to give the key as well from what I know? Apart from that I can use the scan but then I will be loading all the data.
From the docs:
The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
DynamoDB requires queries to always use the partition key.
In your case your options are:
create a Global Secondary Index that uses Name as a primary key
use a Scan + Filter if the table is relatively small, or if you expect the result set will include the majority of the records in the table
There are few designs principals that you can follow while you are using DynamoDB. If you are coming from a relational background, you have already witnessed the query limitations from primary key attributes.
Design your tables, for querying and separating hot and cold data.
Create Indexes for Querying from Non Key attributes (You have two options, Global Secondary Index which you can define at any time and Local Secondary Index which you need to specify at table creation time).
With the Global Secondary Index you can promote any NonKey attribute as the Partition Key for the Index and select another attribute for Sort Key for querying. For Local Secondary Index, you can promote any Non Key attribute as the Sort Key keeping the same Partition Key.
Using Indexes for query is important also to improve the efficiency in using provisioned throughput.
Although having indexes consumes the read throughput from the table, it also saves read through put from in a way that, if you project the right amount of attributes to read, it can give a huge benefit in reading. Check the following example.
Lets say you have a DynamoDB table that has items of 40KB. If you read directly from the table to list 10 items, it consumes 100 Read Throughput Units (For one item 10 Units since one unit can read 4KB and multiply it by 10). If you have an index defined just to project the attributes needed to list which will be having 4KB per item, then it will be consuming only 10 Read Throughput Units(One Unit per item) which makes a huge difference in terms of cost.
With DynamoDB its really important how you define Indexes to optimize for Querying not only from Query capability but also in terms of throughput.
You can not query based non-primary key attribute in Dynamo Db.
If you wanted to still do that you can do it using scan query,but scan is costly operation in DyanmoDB and if table is large, then it will affect performance and not recommended because it will scan each item in table and AWS cost you for all item it scan for that query.
There are two ways to achieve it
Keep Store Id as your PrimaryKey/ Partaion key of Dyanmo DB table and add Name/Location as sort Key (only one as Dyanmo DB accept only one Attribute as sort key by design.
Create Global Secondary Indexes for Querying from Non Key attributes which you are more frequenly required.
There are 3 ways to created GSI in Dyanamo DB, In your case select GSI with option INCLUDE and add Name , Location and store ID in Idex.
KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values. The KEYS_ONLY option results in the smallest possible secondary index.
INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.
ALL – The secondary index includes all of the attributes from the source table. Because all of the table data is duplicated in the index, an ALL projection results in the largest possible secondary index.