DynamoDB - GSI versus duplication - amazon-dynamodb

I have a question about many-to-many relationships within DynamoDB and what happens on a GSI versus shallow duplication.
Say I want to model the standard many-to-many within social media : a user can follow many other pages and a page has many followers. So, your access patterns are that you need to pull all the followers for a page and you need to see all the pages that a user follows.
If you create an item that has a primary key of the id of the page and a sort key of the user id, this lets you pull all followers for that page.
You could them place a GSI on that item with an inverted index. This would like you call all pages a user is following.
What exactly is happening there? Is DynamoDB duplicating that data somewhere with the keys rearranged? Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
So, you have this item:
Item 1:
PK SK
FOLLOWEDPAGE#<PageID> USER#<UserId>
And you can create a GSI and invert SK and PK, or you could simply create this second item:
Item 2:
FOLLOWINGUSER#<UserId> PAGE#<PageID>
Other than the fact that you now have to maintain this second item, how is this functionally different?
Does a GSI duplicate items with that index?
Does it duplicate items without that index?

Is DynamoDB duplicating that data somewhere with the keys rearranged?
Yes, a secondary index is an opaque copy of your data. As the docs say: A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You choose what data gets copied (DynamoDB speak: projected) to the index.
Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
Apart from the maintenance burden you mention, conceptually they are similar. There are some technical differences between a Global Secondary Index and DIY replication:
A GSI requires separate provisioned concurrency, although the read and write units consumed and storage costs incurred are the same for both approaches.
A GSI is eventually consistent.
A Scan operation will be ~2x worse with the DIY approach, because the table is ~2x bigger.
See the Best practices for using secondary indexes in DynamoDB for optimization patterns.
Does a GSI duplicate items with that index?
Yes.
Does it duplicate items without that index?
No.

Related

DynamoDB GSI table replication when partition key is same as the main table

Case 1) When we create a GSI with a partition key different from the main table's partition key, the dynamo replicates the data into another table under the hood. Which is understood.
Case 2) What if I create a GSI with the same partition key as the main table's PK but just with a different sort key? Will it replicate the data the same way as in Case 1? This situation sounds similar to an LSI because they also share the partition key with the main table. If I created an LSI instead, would it save me any data replication and hence the cost associated with it?
Yes, it replicates the same as Case 1. In general people should use GSIs unless they absolutely require LSIs.
Pros of an LSI:
Enables strongly consistent reads out of the index
Cons of an LSI:
Cannot be added or deleted after table creation
Prevents an item collection (items having the same PK) from growing beyond 10 GB (because to maintain strong reads the item collection has to be co-located)
Prevents adaptive capacity from isolating hot items in the item collection across different partitions (again, due to the need to be co-located)
Increases the likelihood of a hot partition because the base table write and LSI writes always go to the same partition, limiting write throughput to that partition (whereas a GSI has its own write capacity)
It's not actually true to say LSIs don't cost extra. They still consume write capacity, just out of the base table's allotment.
Any GSI regardless of the key is a separate table you pay extra for.
An LSI doesn't cost any extra quite as much as a GSI; especially if using a provision table. Additionally, an LSI has strongly consistent reads available just like the base table. GSI only offer eventually consistent reads.
However, the downside to using an LSI instead of a GSI, is that a table with an LSI is limited to a partition size of 10GB.
In other words, if you try to add data above 10GB in a table with the same partition (aka hash) key, if there's any LSIs it will fail.
If there are no LSIs, then it will succeed.
Item collection size limit
The maximum size of any item collection for
a table which has one or more local secondary indexes is 10 GB. This
does not apply to item collections in tables without local secondary
indexes, and also does not apply to item collections in global
secondary indexes. Only tables that have one or more local secondary
indexes are affected.
So depending on your data, it might behoove you to pay for the GSI even if an LSI would work instead.

Dynamodb using partition key in a global secondary index

New to DynamoDB, I have the partition group_id, and sort key groupid_storeid_sortk.
I am wanting to setup additional access pattern with the group_id and store_addrss_sortk.
Will this have any impact on performance using the partition key in the secondary index, or would it be better to create a new attribute as the secondary key, even though it would be duplicate data.
ThankYou
It’s fine to use the same partition key attribute again as the PK for the GSI. No problem there.
For the future: You may want to watch some videos on single-table design and start using PK/SK as generic names since you might want to overload what’s inside them for different items. And then you might want GSI1PK/GSI1SK as the GSI keys.
That’s a style thing when you aim for some optimizations single-table design can bring.
An index is simply another table that you don't have to manage yourself. When you create an index, the service (DynamoDB, for example) creates a new table for you and manages the synchronization of the data between the tables.
In DynamoDB you have two types of secondary indexes, Global and Local. If you use the same partition key, you can use both of these options. However, you have to define the secondary local index (SLI) when you create the table and you can't add it later. Only secondary global indexes (SGI) can be added after the creation of the table. You can read more about it in DyanmoDB documentation.
Regarding performance, you need to consider the cost (read/write capacity) on top of the usual time considerations. You need to see if you are writing a lot to the table and not only reading a lot. Based on that you can plan carefully the projection of the data into the new index. Remember that writes are about 10 times more expensive and slower than reads. You can read more about projection best practices here.

Should I make this field a GSI, a regular attribute, or something else in order to have efficient queries?

For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results

How to query on more than 2 attributes in DynamoDB using GSI?

I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.

What's the recommended index schema for dynamo for a typical crud application?

I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.

Resources