Dynamodb - automatically updating GSI - amazon-dynamodb

When you update an item on a table, will all items be reflected on the index, including its primary key(s) and projections? The docs mention that it does, but it wasn't specific enough.
Also, the image below shows an example of a composite key made up of Status and Date fields in the table. It was specifically to show how modeling can drastically improve performance. However, I couldn't help but wonder: Does StatusDate update itself when Status changes? How does this work?

Yes, when you update an item, all projections and indexes are updated as well.
However, both Local Secondary Index and Global Secondary Index behave in an eventually consistent fashion. This means you can end up reading data that is stale.
Fortunately, at least on Local Secondary Indexes, you can request a strongly consistent read.
Now, about StatusDate and composite indexes, this is done on the application level.

Related

Is it sensible to create unused generic local secondary indexes on my DynamoDB tables in case I need them later?

I currently have a need to add a local secondary index to a DynamoDB table but I see that they can't be added after the table is created. It's fine for me to re-create the table now while my project is in development, but it would be painful to do that later if I need another index when the project is publicly deployed.
That's made me wonder whether it would be sensible to re-create the table with the maximum number of secondary indexes allowed even though I don't need them now. The indexes would have generically-named attributes that I am not currently using as their sort keys. That way, if I ever need another local secondary index on this table in the future, I could just bring one of the unused ones into service.
I don't think it would be waste of storage or a performance problem because I understand that the indexes will only be written to when an item is written that includes the attribute that they index on.
I Googled to see if this idea was a common practice, but haven't found anyone talking about it. Is there some reason why this wouldn't be a good idea?
Don’t do that. If a table has any LSIs it follows different rules and cannot grow an item collection beyond 10 GB or isolate hot items within an item collection. Why incur these downsides if you don’t need to? Plus later you can always create a GSI instead of an LSI.

Dynamodb using partition key in a global secondary index

New to DynamoDB, I have the partition group_id, and sort key groupid_storeid_sortk.
I am wanting to setup additional access pattern with the group_id and store_addrss_sortk.
Will this have any impact on performance using the partition key in the secondary index, or would it be better to create a new attribute as the secondary key, even though it would be duplicate data.
ThankYou
It’s fine to use the same partition key attribute again as the PK for the GSI. No problem there.
For the future: You may want to watch some videos on single-table design and start using PK/SK as generic names since you might want to overload what’s inside them for different items. And then you might want GSI1PK/GSI1SK as the GSI keys.
That’s a style thing when you aim for some optimizations single-table design can bring.
An index is simply another table that you don't have to manage yourself. When you create an index, the service (DynamoDB, for example) creates a new table for you and manages the synchronization of the data between the tables.
In DynamoDB you have two types of secondary indexes, Global and Local. If you use the same partition key, you can use both of these options. However, you have to define the secondary local index (SLI) when you create the table and you can't add it later. Only secondary global indexes (SGI) can be added after the creation of the table. You can read more about it in DyanmoDB documentation.
Regarding performance, you need to consider the cost (read/write capacity) on top of the usual time considerations. You need to see if you are writing a lot to the table and not only reading a lot. Based on that you can plan carefully the projection of the data into the new index. Remember that writes are about 10 times more expensive and slower than reads. You can read more about projection best practices here.

DynamoDB: Using filtered expression vs creating separate table with selected data for efficiency

I am writing an API, which has a data model with a status field which is boolean.
And 90% of the calls to the API will require filter over that status = “active"
Context:
Currently, I have it as a DyanmoDB Boolean field and use a filtered expression over it but I am contending the decision over creating a separate table with the relevant identifier which acts as a hash key for the query and saving corresponding item information corresponding to "active" status, as there can be only one item with "active" status in the item for a particular hash key.
Now my questions are:
Data integrity is a big question here since I will be updating two
tables depending upon the request.
Is using separate tables a good practice in Dynamo DB in this use
case or I am using a wrong DB?
Is the query execution over filtered expression efficient enough and
I can use the current setup?
Scale of the API usage is medium right now but it is expected to increase.
A filter expression is going to be inefficient because filter expressions are applied to results after the scan or query is processed. They could save on network bandwidth in some cases but otherwise you could just as well apply the filter in your own code with pretty much the same results and efficiency.
You other option would be to create a Global Secondary Index (GSI) with a partition key on the boolean field, which might be better if you have significantly less "active" records than "inactive". In that case a useful pattern is to create a surrogate field, say "status_active", which you set to TRUE only for active fields, and to NULL for others. Then, if you create a GSI with a partition key on the "status_active" field it will contain only the active records (NULL values do not get indexed).
The index on a surrogate field is probably the best option as long as you expect than the set of active records is sparse in the table (ie. there's less actives than inactives).
If you expect that about 50% of records would be active and 50% would be inactive then having two tables and dealing with transaction integrity on your own might be a better choice. This is especially attractive if records are only infrequently expected to transition between states. DynamoDB provides very powerful atomic counters and conditional checks that you can use to craft a solution that ensures state transitions are consistent.
If you expect that many records would be active and only a few inactive, then using a filter might actually be the best option, but keep in mind that filtered records still count towards your provisioned throughput, so again, you could simply filter them out in the application with much the same result.
In summary, the answer depends on the distribution of values in the status attribute.

Riak solution for querying data by books or unique pages

Consider a set of data called Library, which contains a set of Books and each book contains a set of Pages.
Let's say you are using Riak to store this data, and you need to be access the data in two possible ways:
- Query for a particular page (with a unique id)
- Query for all pages in a particular book (with a unique name)
Additionally, you need to be able to easily update and delete pages of a particular Book.
What would be the best way to accomplish this in Riak?
Obviously Riak Search will do the trick, but maybe is inefficient for what I am trying to do. I am wondering if it makes sense to set up buckets where each bucket can be a Book (which would make for potentially millions of "Book" buckets). Maybe that is a bad idea...
Can this be accomplished with secondary indexes?
I am trying to keep this simple...
I am new to Riak and I am trying to find the best way to accomplish something that is probably relatively simple. I would appreciate any help from the Stack Overflow community. Thanks!
A common way to model master-detail relationships in Riak is to have the master record contain a list of detail record IDs, possibly together with some information about the detail record that may be useful when deciding which detail records to retrieve.
In your example, you could have two buckets called 'books' and 'pages'. The master record in the 'books' bucket will contain metadata and information about the book as a whole together with a list of pages that are included in the book. Each page would contain the ID of the 'pages' record holding the page data as well as the corresponding page number. If you e.g. wanted to be able to query by chapter, you could also add information about which chapters a certain page belongs to.
The 'pages' bucket would contain the text of the page and possibly links to images and other media data that are included on that page. This data could be stored in yet another bucket.
In order to get a specific page or a range of pages, one would first retrieve the master record from the 'books' bucket and then based on the contents of the record the appropriate pages. Even though this requires several GET operations, they are all direct lookups based on keys, which is the most efficient and scalable way to retrieve data from Riak, so it is will perform and scale well.
This approach also makes it simple to change the order of pages and/or chapters as only the master record needs to be updated. Adding, deleting or modifying pages would however require both the master record as well as one or more detail records to be updated, added or deleted.
You can most certainly also solve this problem by adding secondary indexes to the objects and query based on this. Secondary index queries in Riak does however have to include processing on a covering set (generally ring size / n_val) of partitions in order to fulfil the request, and therefore puts a bit more load on the system and generally results in higher latencies than retrieving a single object containing keys through a direct key lookup (which only needs to involve the partitions where the object is actually stored).
Although maintaining a separate object containing indexes adds a bit of extra work when inserting or deleting pages/entries, this approach will generally result in more efficient reads, as only direct key lookups are required. If your application is heavy on reads, it probably makes sense to use this approach, while secondary indexes could be more efficient for a write heavy application as inserts and modifications are made cheaper at the expense of more expensive reads. You can however always add secondary indexes just in case in order to keep your options open.
In cases like this I would usually recommend performing some benchmarks to test the solutions and chech which solution that best matches you particular performance and scaling requirements.
The most efficient way will be to store hole book as an one object, and duplicate it's pages as another separate objects.
Pros:
you will be able to select any object by its key(the most cheapest op
in riak is kv query)
any query will be predicted by latency
this is natural way of storing for riak
Cons:
If you need to update any page you must update whole book, and then page. As riak doesn't have atomic ops, you must to think how to recover any failure situation (like this: book was updated, but page was not).
Riak is about availability predictable latency, so if you will use something like 2i to collect results, it will make unpredictable time query, which will grow with page numbers

DynamoDB secondary sort

I'm assessing whether if I can use DynamoDB for our next project, what we are building is quite similar to a blogging platform, here is a simple table
Blog Post
ID - primary hash key
Title
DateCreated - primary range key
Votes
I've read enough to know how to List - list of blog posts, Paging - using last fetched index, Get post details - get a row, I will be sorting using DateCreate, which is my range key.
I'm struggling on how do do sort on a secondary index. For example, if we have a column called Votes, how do you do Most Votes? My interpretation is that you can only sort using the range index which I'm already using.
Update
AWS has just announced general availability of the much anticipated Global Secondary Indexes for Amazon DynamoDB, which are addressing the limitations of Local Secondary Indexes discussed further below:
You can now create indexes and perform lookups using attributes other than the item's primary key. [...]
You can now create up to five Global Secondary Indexes when you create a table, each referencing either a hash key or a hash key and a range key. You can also create up to five Local Secondary Indexes, and you can choose to project some or all of the table's attributes into each of the table’s indexes.
Please refer to the blog post for more details on the choice between these two models.
Correction
As rightly pointed out by vartec, I've been getting ahead of myself adding this information at the day Local Secondary Indexes had been announced without properly analyzing the problem at hand, where those are in fact not applicable - ironically I've stressed just that myself in a later comment on another question:
[...] however, please note that local is a crucial limitation: A local secondary index is a data structure that maintains an alternate range key for a given hash key - while this covers many real world scenarios, it doesn't apply to arbitrary non primary key field queries like those of the question at hand.
Thanks vartec for spotting this error and apologies for being misleading here.
Initial (erroneous) answer
Amazon DynamoDB has just announced Support for Local Secondary Indexes to address your use case:
[...] We call the newest capability Local
Secondary Indexes (LSI). While DynamoDB already allows you to perform
low-latency queries based on your table’s primary key, even at
tremendous scale, LSI will now give you the ability to perform fast
queries against other attributes (or columns) in your table. This
gives you the ability to perform richer queries while still meeting
the low-latency demands of responsive, scalable applications.
See also the introductory blog post Local Secondary Indexes for Amazon DynamoDB for a more detailed explanation.
As usual for AWS, the new functionality is released with a constrained feature set at first, which is going to be expanded over time:
Today, local secondary indexes must be defined at the time you create
your DynamoDB tables. In the future, we plan to provide you with an
ability to add or drop LSI for existing tables. If you want to equip
an existing DynamoDB table to local secondary indexes immediately, you
can export the data from your existing table using Elastic Map Reduce,
and import it to a new table with LSI. [emphasis mine]
looks like this isn't possible, you can only sort by the range hashkey
I'm going to load up the table in memory and sort it in memory.

Resources