I have a table structure consisting of cities and comments. I need to get all comments related to a city. I have made my primary key for comments the name of the city. Now when I query my table I can get all the comments related to the city but I need them in order of votes for that comment. the votes value are constantly changing. I have considered adding ordered by to my query or adding vote as a range key and deleting are re-adding the recored every time the votes changes. These solutions don't seam that efficient and was wondering if there is a better way of doing it?
One easy thing you could do is to use a local secondary index - this DynamoDB feature can create a second table whose hash key is the same (the city name), but the sort key is the number of votes - which remains just an ordinary attribute in your original table. DynamoDB will automatically - and consistently - take care of the second table for you as you modify the first one.
Using a LSI is easier than coding the extra deletion and addition, and more efficient in the sense of less network activity and client work - but may not be significantly cheaper in Amazon bills, because DynamoDB charges you extra for that LSI work.
Related
I have a question about many-to-many relationships within DynamoDB and what happens on a GSI versus shallow duplication.
Say I want to model the standard many-to-many within social media : a user can follow many other pages and a page has many followers. So, your access patterns are that you need to pull all the followers for a page and you need to see all the pages that a user follows.
If you create an item that has a primary key of the id of the page and a sort key of the user id, this lets you pull all followers for that page.
You could them place a GSI on that item with an inverted index. This would like you call all pages a user is following.
What exactly is happening there? Is DynamoDB duplicating that data somewhere with the keys rearranged? Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
So, you have this item:
Item 1:
PK SK
FOLLOWEDPAGE#<PageID> USER#<UserId>
And you can create a GSI and invert SK and PK, or you could simply create this second item:
Item 2:
FOLLOWINGUSER#<UserId> PAGE#<PageID>
Other than the fact that you now have to maintain this second item, how is this functionally different?
Does a GSI duplicate items with that index?
Does it duplicate items without that index?
Is DynamoDB duplicating that data somewhere with the keys rearranged?
Yes, a secondary index is an opaque copy of your data. As the docs say: A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You choose what data gets copied (DynamoDB speak: projected) to the index.
Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
Apart from the maintenance burden you mention, conceptually they are similar. There are some technical differences between a Global Secondary Index and DIY replication:
A GSI requires separate provisioned concurrency, although the read and write units consumed and storage costs incurred are the same for both approaches.
A GSI is eventually consistent.
A Scan operation will be ~2x worse with the DIY approach, because the table is ~2x bigger.
See the Best practices for using secondary indexes in DynamoDB for optimization patterns.
Does a GSI duplicate items with that index?
Yes.
Does it duplicate items without that index?
No.
Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results
I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.