This article (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html) talks about a technique for sharding global secondary index values across multiple partitions, by introducing a random integer as the partition key.
That makes sense to me, but the article does not clearly explain how to then query that index. Let's say I'm using a random integer from 1-10 as the partition key, and a number as the sort key, and I want to fetch the 3 records with the highest sort key value (from all partitions).
Would I need to do 10 separate queries, sorting each one, with a limit of 3 items, then do an in-memory sort of the resulting 30 items and pick the first 3? That seems needlessly complicated, and not very efficient for the client.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
Would I need to do 10 separate queries
Yes. This is called a scatter read in the Dynamo docs...
Normally the client would do so with multiple threads...so while it adds complexity, efficiency is usually good.
Why the limit 3? That requirement seems to be the bigger cause of inefficiency.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
The only way to query all partitions is with a full table Scan. But that doesn't provide sorting & ordering. You'd still need to do it in your app. The scan would be a lot less efficient than the scatter read.
If this is a "Top 3 sellers" type list...I believe the recommended practice to to (periodically) calculate & store the results. Rather than having to constantly derive the results. Take a look here: Using Global Secondary Indexes for Materialized Aggregation Queries
Related
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.
Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.
Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.
If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.
Any suggestions are welcome.
This is a great question. I'm also interested to hear what others are doing to solve this problem.
If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.
I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.
One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.
You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.
Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.
There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions
Quickly highlighting the two things mentioned here.
Definitely even distribution of partitions keys is a best practice.
Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.
Now the problem is how can I query only a certain type of object? For
example I want to retrieve all users, how do I do that?
Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.
And creating multiple tables is a bad practice
I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.
Having said that, what would be a better approach to follow.
The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
Do I always have the partition key in hand to perform the select that I need to perform?
What would the update statements look like, will it have again the partition key to perform updates?
Do I need to further filter by additional columns and can that be the default sort order?
As you start answering some of these questions, a better model might appear altogether.
We are designing an application which will use DynamoDB as storage system.
We identified the different access patterns and after reviewing Global Secondary Indexes documentation, we got stuck on making a decision about which approach to use: Index overloading or having 2 sparse index.
To give more context, our application stores orders, we can have internal or external orders. Based on that, they will be linked to a Customer or a Warehouse:
As we would like to search by customer and/or warehouse we thought about 2 solutions.
First solution would be, keeping the above data structure, creating 2 indexes on:
GSI1 - Customer (PK)
GSI2 - Warehouse (PK)
Second solution is to overload another column like:
So only 1 index required: Destination (PK), and queried applies with a prefix.
The question is: "Is there any benefit between index overloading over having 2 different sparse Global Secondary Indexes?" (Cost saving on capacity provisioning, data transport, query times, data complexity...)
As I didn't get any answer yet, I'll add my opinion.
There is no big difference between the 2 approaches in both cases all items will end up being indexed and similar attributes stored.
Some benefits I could find are:
Benefits using 2 GSI
Data schema is easier to understand (no overloading)
More flexibility for evolving the schema: if requirements change, an order can be assigned to both a customer and a warehouse.
Capacity to adjust better projections (may not be always applicable, but you may only need 2 fields for Customer access pattern and 3 for Warehouse)
Smaller indexes have greater performance
Benefits using 1 GSI
No need to worry about capacity units, they can be similar to main table. When using 2 indexes, you need to know an estimation of how many records will fall under each of them, otherwise you need to over-provision them.
Example: If you set 50% RCU and WCU from main table to each of the indexes, but you have 70% orders which are for customers, some requests will be throttled.
In summary, even using 2 indexes allows to get a more precise configuration, it may end up having a higher cost and the need to review index configuration to adjust it to access patterns usage from time to time.
I have a use case where I have to return all elements of a table in Dynamo DB.
Suppose my table has a partition key (Column X) having same value in all rows say "monitor" and sort key (Column Y) with distinct elements.
Will there be any difference in execution time in the below approaches or is it the same?
Scanning whole table.
Querying data based on the partition key having "monitor".
You should use the parallell scans concept. Basically you're doing multiple scans at once on different segments of the Table. Watch out for higher RCU usage though.
Avoid using scan as far as possible.
Scan will fetch all the rows from a table, you will have to use pagination also to iterate over all the rows. It is more like a select * from table; sql operation.
Use query if you want to fetch all the rows based on the partition key. If you know which partition key you want the results for, you should use query, because it will kind of use indexes to fetch rows only with the specific partition key
Direct answer
To the best of my knowledge, in the specific case you are describing, scan will be marginally slower (esp. in first response). This is when assuming you do not do any filtering (i.e., FilterExpression is empty).
Further thoughts
DynamoDB can potentially store huge amounts of data. By "huge" I mean "more than can fit in any machine's RAM". If you need to 'return all elements of a table' you should ask yourself: what happens if that table grows such that all elements will no longer fit in memory? you do not have to handle this right now (I believe that as of now the table is rather small) but you do need to keep in mind the possibility of going back to this code and fixing it such that it addresses this concern.
questions I would ask myself if I were in your position:
(1) can I somehow set a limit on the number of items I need to read (say,
read only the first 1000 items)?
(2) how is this information (the list of
items) used? is it sent back to a JS application running inside a
browser which displays it to a user? if the answer is yes, then what
will the user do with a huge list of items?
(3) can you work on the items one at a time (or 10 or 100 at a time)? if the answer is yes then you only need to store one (or 10 or 100) items in memory but not the entire list of items
In general, in DDB scan operations are used as described in (3): read one item (or several items) at a time, do some processing and then moving on to the next item.
I have a table in DynamoDB which contains attributes like this:
OrderId, OrderJson,OrderStatus.
The value of order status can be 0 or 1.
I need to be able to update the status of the specified order and also to fetch orders based on the status field.
one of the options is to use scan , the other one is to have a secondary index with status as partition key, but status field has small range of values.
Please suggest what is the best practice for described requirements?
Thanks!
I wouldn't go with scan since it's not cost effective or particularly efficient unless you have very few Orders.
In short, you were on the right track with Global Secondary Indexes. (I assume you were talking about Global Secondary Indexes. There are Local Secondary Indexes also but I don't see how those will be of much help for this case.
Anyway, I would create a GSI with OrderStatus as the Hash key and OrderId as the Range key. There are a couple of things you need to be careful of though.
1) Write throughput. Remember that Orders with the same OrderStatus will be written to the same disk on the GSI. This is just the way Dynamo works, documents with the same Hash key go to the same place. This means that no matter what your write throughput it set to for the table, there is an upper limit to the write throughput on a single disk. Make sure you won't exceed that upper limit.
2) Read throughput. Pretty much the same thing as write throughput but for reads. the read limit is higher than the write limit but it is still something to be aware of.
3) Paging. Whenever you Query a Dynamo table using a Hash key, in this case, OrderStatus, it will automatically limit the size of the response to 1 MB. Because of this, you might need to make multiple sequential Query requests to read all of the Orders for a particular OrderStatus.
The nice thing is all of these problems have basically the same solution, "sharding". What I mean by sharding in this case is adding a suffix to your OrderStatus. For example, if OrderStatus can be either 1 or 0, you would create another field, e.g. OrderShard, which can be 1_0, 1_1, 1_2, ..., 1_9, 0_0, 0_1, 0_2, ..., 0_9. We basically just add a random integer between 0 and 9 to the end of the OrderStatus to create more possible Hash key values on the GSI. This will mean that your data gets spread out over more disks, solving 1 and 2, and you can make parallel Query requests, solving 3 for the most part.
Instead of using OrderStatus as your Hash key on the GSI, now you will use OrderShard. Still use OrderId as the Range key. Also, if 10 shards per OrderStatus value isn't enough, just increase the number of shards. For example, add a random number between 0-99. How many shards you will need depends of your scale and throughput.
I already have an index set up with the second sort key set to what I want (an integer timestamp). The API keeps complaining that I'm not giving it a KeyConditionExpression. Then if I give it one, it says id must be specified. I've tried forcing it to just give me everything using id <> null and it STILL won't do it. Is this even possible?? Maybe its time to get rid of dynamo if it can't do this utterly simple task.
For the love of god, all I'm trying to do is query the entire table AND have it use my sort key. I would have had this going in SQL hours ago..
First of all, DynamoDB is a NOSQL database, so it's intentionally NOT SQL. Perhaps you shouldn't expect to be able to perform SQL like queries that you are used to, and be frustrated by the fact that these are two completely different types of databases, each with its strengths and weaknesses.
Records in DynamoDB are partitioned using the hash key, and may optionally be sorted within each partition.
The hash key should be picked so that items are as evenly distributed over partitions as possible. The use of partitions is what makes DynamoDB extremely scalable and fast. But if what you need is to scan over all your items and get them in sorted order, then you probably either are using the wrong tool for the job, or you need to sort the items on the client side.
The scan operation will simply go through all partitions, returning all items from each partition. At this point, the items can only be sorted within their respective partition.
As an example, consider a set of data being partitioned into 3 partitions:
Partition A Partition B Partition B
Sort key Sort key Sort key
A D C
C E K
P G L
As you can see, you can easily query each partition and get the items in it in sorted order. But if you scan, you will probably get items sorted as
[A, C, P, D, E, G, C, K, L], if the sort order is at all deterministic. At this point you would have to sort the items yourself.
A "trick" that is sometimes seen is to use a "dummy" hash key with an equal value for all items, like you mentioned in your own answer. This way you can query for "dummy = 1" and get the items sorted according to the sort key. However, this completely defeats the purpose of the hash key as all items will be put in the same partition, thus not making the table scale at all. But if you find yourself using DynamoDB even though you have a really small dataset, by all means it would work. But again, with a small data set and use-cases like this, you should probably be using another tool such as RDS in the first place.
Just to elaborate on #JHH though. In general I'd say he is correct that you shouldn't need to sort all elements in DynamoDB. I also have a requirement similar to this, as I need to get the top N number of elements, which could all be in different partitions.
DynamoDB does have a way of doing this, it just isn't out of the box. I don't think that it's so correct to say you should then need an SQL database, as arguably you'd never use a NoSQL database because you will always have one of these limitations. Also if you only ever use NoSQL for large data-sets then you will always have to rework your application later.
What to do then? Well you do have a few options, and it depends on your use-case, lets' assume that you are at least having sorting within your partitions, this makes it easier. We'll also assume you are looking for the max.
The simplest way would be if you would get the first value from every partition. And find the max. If you needed say the top 10 values you could still utilise this strategy but would get too complicated.
Next option is to make use of DynamoDB Streams. Say we want to keep a list of the top 100 elements. These would sit ready and waiting on their own top values partition, sorted and ready for instant retrieval. You would need to maintain this list yourself by checking when items are inserted or updated, that they are greater than the 100th element. If that is the case you would insert the element into the top values partition, and delete the last value. This I think would be the most likely way to approach this problem.
So in NoSQL if there is some sort of query, you would love to do which is oh so easy in SQL, and you cant use your Table/GSI/LSI, then you pretty much need to compute the result manually, and have it ready for consumption.
Now if you weren't going to make use of these top values very often, then you might go with the first method, and scan every partition top values till you had the list you wanted, but depending on how much the values are scattered across partitions this could take many capacity units.
Hope that helps.
Turns out, you can also add an IndexName to a scan. That helps. Furthermore, if you create an index with a sort key, all primary indices MUST be identical for the sort to occur.