I have a requirement where when I update in a row with primarykey and sortkey - I need to fetch all old rows with same primary key back. Can this be done in single request. Note:say my primary key is userid and sort key is activity type. When I update a row corresponding to one activitytype for a user , I need to fetch all rows of that user with all activity types. Is it possible?
TD;DR no
The longer answer is this is not supported by DynamoDB. You can't even always get all items for the same sort key in one request (it's only possible when the number of items is relatively small and the items themselves are not too big), much less issue a single request to update an item and execute a query with potentially many result pages, all in one go.
The underlying answer to your unasked question is DynamoDB doesn't support transactions (multiple operations executed atomically, without interference from other operations).
Related
I have a question about many-to-many relationships within DynamoDB and what happens on a GSI versus shallow duplication.
Say I want to model the standard many-to-many within social media : a user can follow many other pages and a page has many followers. So, your access patterns are that you need to pull all the followers for a page and you need to see all the pages that a user follows.
If you create an item that has a primary key of the id of the page and a sort key of the user id, this lets you pull all followers for that page.
You could them place a GSI on that item with an inverted index. This would like you call all pages a user is following.
What exactly is happening there? Is DynamoDB duplicating that data somewhere with the keys rearranged? Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
So, you have this item:
Item 1:
PK SK
FOLLOWEDPAGE#<PageID> USER#<UserId>
And you can create a GSI and invert SK and PK, or you could simply create this second item:
Item 2:
FOLLOWINGUSER#<UserId> PAGE#<PageID>
Other than the fact that you now have to maintain this second item, how is this functionally different?
Does a GSI duplicate items with that index?
Does it duplicate items without that index?
Is DynamoDB duplicating that data somewhere with the keys rearranged?
Yes, a secondary index is an opaque copy of your data. As the docs say: A secondary index is a data structure that contains a subset of attributes from a table, along with an alternate key to support Query operations. You choose what data gets copied (DynamoDB speak: projected) to the index.
Is this any different that just creating a second item in the table with a primary key of the user and the sort key of the page?
Apart from the maintenance burden you mention, conceptually they are similar. There are some technical differences between a Global Secondary Index and DIY replication:
A GSI requires separate provisioned concurrency, although the read and write units consumed and storage costs incurred are the same for both approaches.
A GSI is eventually consistent.
A Scan operation will be ~2x worse with the DIY approach, because the table is ~2x bigger.
See the Best practices for using secondary indexes in DynamoDB for optimization patterns.
Does a GSI duplicate items with that index?
Yes.
Does it duplicate items without that index?
No.
For my DynamoDB table, I currently have a schema like this:
Partition key - Unique ID, so every item has a completely unique ID
Sort key - none
Attribute - JSON that contains some values
Now, I want to add a new field that will be required for every item and will indicate the specific region (e.g. NA-1, NA-2, JP-1, and so on) and I want to be able to do queries on just this field. For example, I might want to perform a query on my table to retrieve all items with the region NA-1.
My question is should I make this field a GSI? I'm new to DynamoDB so I've been researching online and it seems that using a GSI is preferred when that field may only be present for select items in the table, but my field will be required for every item, so I think using a GSI is not an option.
The other possible option I've seen is performing a scan operation and using a filter expression, but from what I've seen, that's a costly operation because DynamoDB has to look at the entire table part-by-part and then filter afterwards. My table isn't very big right now, but it may become quite large in the future, so I would like a scalable option.
TL;DR Is there someway I can add a mandatory regionID field to my table and perform efficient queries on it? What are some good options I should look into?
Yeah, a GSI might not be the best fit here. Maybe you can somehow make it part of the partition key?
Yes. Perform 2 writes on the table. First row will be what you are currently writing, and the second row will have your region as the partition key. Do not forget use transactions as it is possile that one of the writes does not succeed.
While you can use GSI, you have to realize that it is eventual consistent. It will take some time to update it and you might get inconsistent data if you query soon enough after writing.
DynamoDB is a distributed data-store i.e. it stores the data not in a single server but does partitions using the provided partition key (PK). This means your data is spread across multiple servers and brings the limitation that you can query a single partition at a time.
Coming back to your query pattern,
retrieve all items with the region X
You need to add region-id as an attribute in the main table and make it part of the GSI. Do note that to avoid conflicts you need to make the GSI SK a composite SK.
I would recommend using <region>#<unique-id>
This way you can query the GSI like,
where BEGINS_WITH ('X', SK)
Also, if any of your entry moves to a new region or a new entry is created in a region, it will automatically reflect in the GSI and your query results
I have a use-case where i have to query on more than 2 attributes on dynamoDB table. As far as I know, we can only query for upto 2 attributes(partition key, sort key) on DDB table using GSI. is there anything which allows us to query on multiple attribute(say invoiceId, clientId, invoiceStatus) using GSI.
Yes, this is possible, but you need to take into account every access pattern you want to support when you design your table.
This topic has been discussed at re:Invent multiple times. Here is an video from a few years ago https://youtu.be/HaEPXoXVf2k?t=2102 but similar talks have been given on the topic every year.
Two main options are using composite keys or query filters.
Composite keys are very powerful and boil down to making new 'synthetic' keys that simply concatenate other fields that you have in your record and then using these in your GSI.
For example, if you have a client where you want to be able to get all of their open invoice but also want to be able to get an individual invoice you could use clientId as the partition key and concatenate invoiceStatus and invoiceId together as the sort key. You can then use begins_with to only have certain invoice status returned. In this example, you'd get the have to know the invoiceStatus and invoiceId making this not the best example.
The composite key pattern is also useful for dates as you can use greater than or less than to search certain time ranges. However, it is also possible just to directly get the records with the concatenation.
An alternative design is using query filters. This is less efficient as DynamoDB will have to scan every record that matches the partition and sort key. However, the filter can be applied to any attribute and reduces the amount of data transmitted from DynamoDB to your application. This is useful when your main keys are mostly selective, but multiple matches are possible and the filter gets you the rest of the way there.
The other aspect of using a GSI that can help reduce cost is projecting only the attributes you care about. When a record is updated the GSI only updates if one of the projected attributes is updated. By keeping the GSI skinny it makes the previously listed strategies more cost effective.
I would like to be able to filter a pagination result using query operation before the limit is taken into consideration.Is there any suggestion to get right pagination on filtered results?
I would like to implement a DynamoDB Scan OR Query with the following logic:
Scanning -> Filtering(boolean true or false) -> Limiting(for pagination)
However, I have only been able to implement a Scan OR Query with this logic:
Scanning -> Limiting(for pagination) -> Filtering(boolean true or false)
Note: I have already tried Global Secondary Index but it didn't work in my case Because I have 5 different attributes to filter and limit.
Unfortunatelly DynamoDB is not capable to do this, once you do Query on one of your indexes, it will read every single item that satisfies your partition and sort key.
Lets check your example - You have boolean and you have index over that field. Lets say 50% of items are false and 50% are true. Once you search by that index you will read through 50% of all items in table (so its almost like SCAN). If you set up limit, it will read only that number of items and then it stops. You cannot use the combination of limit and skip/page/offset like in other databases.
There is some level of pagination https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.Pagination.html but it does not allow you to jump to i.e. page 10, it only allows you go through all the pages one by one. Also I am not sure how it is priced, maybe internally the AWS will go through all the items before preparing the results for you, so you will pay for reading 50% of whole table even if you stop iterating before you reach the end.
There is also the limitation that index can have maximum of 2 fields (partition, sort).
EXAMPLE
You wrote that you have 5 parameters you want to query. The workaround that is used to address these limitations is to create and manage extra fields that have combination of parameters you want to query. Lets say you have table of users and you have there gender, age, name, surname and position. Lets say its huge database, so you have to think about amount of data you can load. Then if you want to use DynamoDB, you have to think about all queries you want to do.
You most likely want to search by name and surname, so you create index with surname as partition key and name as sort key (in such case you can search by surname or by both surname and name). It can work for lot of names, but you found out that some name combinations are too common and you need to filter by position as well. In such case, you create new field (column) called i.e. name-surname and whenever you create or update item, you will need to handle this field in your app to make sure it contains both of it, i.e. will-smith. Then you can make another index, that has name-surname as partition key and position as sort key. Now you can use it for such searches.
However you found out, that for some name-surname-position combination you get too many results and you dont want to handle it on application level and you want to limit results by age as well. Then you can create index with name-surname-position as partition key and age as sort key. At this moment you can also figure out that your old name-surname field and index can be removed as it server no purposes anymore (name and surname are handled by another index and for searching just name-surname-position you can use this index)
You want to query by gender as well sometimes? Its probably better to handle that in application level (or extra filter in db query) rather than creating new index that must be handled and payed for. There are only two types of gender (ok, lets say there exists more, but 99% of people will have just male or female) so its probably cheaper to just hide few fields on application level if someone wants to check only male/female/transgenders..., but load all of them. Because for extra index you would have to pay for every single insert, but this filter will be used only from time to time. Also when someone searches already by name, surname and position you dont expect that much results anyway, so if you get 20 (all genders) or just 10 (male only) results does not make much difference.
This ^^ was just example of how you can think and work with DynamoDB. How exactly you use it depends on your business logic.
Very important note: DynamoDB is very simple database that can only do very simple queries. It has little more functionality than Redis but a lot less functionality than traditional databases. The valid result of thinking about your business model/use-cases is that maybe you should NOT use the DynamoDB at all, because it can simply not satisfy your needs and queries.
Some basic thinking can look like this:
Is key-value persistant storage enough? Use DynamoDB
Is key-value persistant storage, where one item can have multiple keys and I can search and filter by maximum of 2 fields enough? Use DynamoDB
Is persistant storage, where I want to search single Table/Collection by many multiple keys with lot of options enough? Use MongoDB
Do I need to search through multiple tables or do complex joins or need transactions? Use traditional SQL database
I am writing an API, which has a data model with a status field which is boolean.
And 90% of the calls to the API will require filter over that status = “active"
Context:
Currently, I have it as a DyanmoDB Boolean field and use a filtered expression over it but I am contending the decision over creating a separate table with the relevant identifier which acts as a hash key for the query and saving corresponding item information corresponding to "active" status, as there can be only one item with "active" status in the item for a particular hash key.
Now my questions are:
Data integrity is a big question here since I will be updating two
tables depending upon the request.
Is using separate tables a good practice in Dynamo DB in this use
case or I am using a wrong DB?
Is the query execution over filtered expression efficient enough and
I can use the current setup?
Scale of the API usage is medium right now but it is expected to increase.
A filter expression is going to be inefficient because filter expressions are applied to results after the scan or query is processed. They could save on network bandwidth in some cases but otherwise you could just as well apply the filter in your own code with pretty much the same results and efficiency.
You other option would be to create a Global Secondary Index (GSI) with a partition key on the boolean field, which might be better if you have significantly less "active" records than "inactive". In that case a useful pattern is to create a surrogate field, say "status_active", which you set to TRUE only for active fields, and to NULL for others. Then, if you create a GSI with a partition key on the "status_active" field it will contain only the active records (NULL values do not get indexed).
The index on a surrogate field is probably the best option as long as you expect than the set of active records is sparse in the table (ie. there's less actives than inactives).
If you expect that about 50% of records would be active and 50% would be inactive then having two tables and dealing with transaction integrity on your own might be a better choice. This is especially attractive if records are only infrequently expected to transition between states. DynamoDB provides very powerful atomic counters and conditional checks that you can use to craft a solution that ensures state transitions are consistent.
If you expect that many records would be active and only a few inactive, then using a filter might actually be the best option, but keep in mind that filtered records still count towards your provisioned throughput, so again, you could simply filter them out in the application with much the same result.
In summary, the answer depends on the distribution of values in the status attribute.