Let's assume that I have a table with following attributes:
UNIQUE user_id (primary hash key)
category_id (GSI hash index)
timestamp
I will have a lot of users, but only few categories.
user_id | category_id
1 1
3 1
4 1
5 3
.. ..
50000000 1
Is it ok to store millions of records with the same category_id value as a Global Secondary Index? Should I expect any restrictions?
I'm wondering if scan is not a bad choice. I will use filtering by category_id only once a day. What is the cost (time and money) of scanning millions of records?
Thanks!
According to the Limits documentation, the only limitation is:
No practical limit for tables without local secondary indexes.
For a table with local secondary indexes, there is a limit on item collection sizes: For every distinct hash key value, the total sizes of all table and index items cannot exceed 10 GB. Depending on your item sizes, this may constrain the number of range keys per hash value. For more information, see Item Collection Size Limit.
Now for your second question of whether you should be doing Query or Scan, you asked both from performance and monetary cost. Maintaining a GSI is expensive, because you have to pay for the throughput (and if I recall correctly also the storage) so its like paying for another table, plus its another table whose throughput you have to monitor to make sure you aren't being throttled. On the other hand, the performance is much better.
If you're planning on going through all categories once a day (which means every Document in the Table), then Scan is the way to go. You aren't gaining anything from Querying. Plus its cheaper (no extra GSI) and you don't have to worry about projections.
Related
I have a DynamoDB table containing:
productID (PK), name, description, url, createTimestamp, <constant>
I'm trying to retrieve the latest 10 products by createTimestamp (unix timestamp).
In SQL, I would probably pull out the data like:
select * from [table] order by createTimestamp desc limit 10;
Q: How can I achieve the same result using DynamoDB without using scan?
The table can be pretty large and data will be accessed often (e.g., whenever user access the e-commerce website) so using scan wouldn't be optimal. I'm thinking of creating a GSI using a constant value as PK (because there isn't any other attribute we could use to narrow the results) and sort key as createTimestamp but this is considered anti-pattern. Is there a better alternative?
That’s the way to go, with a GSI having a singular PK and the timestamps in the SK.
If your write rate will exceed 1,000 write units per second then you’ll want to shard the PK value to one of N many randomly chosen values to increase throughout to N,000 writes per second.
That means you’ll need to do N many Query calls to get your unified answer but each Query will be highly efficient and index optimized.
This is a common design pattern.
We are designing an application which will use DynamoDB as storage system.
We identified the different access patterns and after reviewing Global Secondary Indexes documentation, we got stuck on making a decision about which approach to use: Index overloading or having 2 sparse index.
To give more context, our application stores orders, we can have internal or external orders. Based on that, they will be linked to a Customer or a Warehouse:
As we would like to search by customer and/or warehouse we thought about 2 solutions.
First solution would be, keeping the above data structure, creating 2 indexes on:
GSI1 - Customer (PK)
GSI2 - Warehouse (PK)
Second solution is to overload another column like:
So only 1 index required: Destination (PK), and queried applies with a prefix.
The question is: "Is there any benefit between index overloading over having 2 different sparse Global Secondary Indexes?" (Cost saving on capacity provisioning, data transport, query times, data complexity...)
As I didn't get any answer yet, I'll add my opinion.
There is no big difference between the 2 approaches in both cases all items will end up being indexed and similar attributes stored.
Some benefits I could find are:
Benefits using 2 GSI
Data schema is easier to understand (no overloading)
More flexibility for evolving the schema: if requirements change, an order can be assigned to both a customer and a warehouse.
Capacity to adjust better projections (may not be always applicable, but you may only need 2 fields for Customer access pattern and 3 for Warehouse)
Smaller indexes have greater performance
Benefits using 1 GSI
No need to worry about capacity units, they can be similar to main table. When using 2 indexes, you need to know an estimation of how many records will fall under each of them, otherwise you need to over-provision them.
Example: If you set 50% RCU and WCU from main table to each of the indexes, but you have 70% orders which are for customers, some requests will be throttled.
In summary, even using 2 indexes allows to get a more precise configuration, it may end up having a higher cost and the need to review index configuration to adjust it to access patterns usage from time to time.
This article (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-gsi-sharding.html) talks about a technique for sharding global secondary index values across multiple partitions, by introducing a random integer as the partition key.
That makes sense to me, but the article does not clearly explain how to then query that index. Let's say I'm using a random integer from 1-10 as the partition key, and a number as the sort key, and I want to fetch the 3 records with the highest sort key value (from all partitions).
Would I need to do 10 separate queries, sorting each one, with a limit of 3 items, then do an in-memory sort of the resulting 30 items and pick the first 3? That seems needlessly complicated, and not very efficient for the client.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
Would I need to do 10 separate queries
Yes. This is called a scatter read in the Dynamo docs...
Normally the client would do so with multiple threads...so while it adds complexity, efficiency is usually good.
Why the limit 3? That requirement seems to be the bigger cause of inefficiency.
Is there some way to do a single DynamoDB operation that queries all 10 partitions, does the sorting, and just returns the 3 records with the highest vavlue?
The only way to query all partitions is with a full table Scan. But that doesn't provide sorting & ordering. You'd still need to do it in your app. The scan would be a lot less efficient than the scatter read.
If this is a "Top 3 sellers" type list...I believe the recommended practice to to (periodically) calculate & store the results. Rather than having to constantly derive the results. Take a look here: Using Global Secondary Indexes for Materialized Aggregation Queries
I have a table in DynamoDB which contains attributes like this:
OrderId, OrderJson,OrderStatus.
The value of order status can be 0 or 1.
I need to be able to update the status of the specified order and also to fetch orders based on the status field.
one of the options is to use scan , the other one is to have a secondary index with status as partition key, but status field has small range of values.
Please suggest what is the best practice for described requirements?
Thanks!
I wouldn't go with scan since it's not cost effective or particularly efficient unless you have very few Orders.
In short, you were on the right track with Global Secondary Indexes. (I assume you were talking about Global Secondary Indexes. There are Local Secondary Indexes also but I don't see how those will be of much help for this case.
Anyway, I would create a GSI with OrderStatus as the Hash key and OrderId as the Range key. There are a couple of things you need to be careful of though.
1) Write throughput. Remember that Orders with the same OrderStatus will be written to the same disk on the GSI. This is just the way Dynamo works, documents with the same Hash key go to the same place. This means that no matter what your write throughput it set to for the table, there is an upper limit to the write throughput on a single disk. Make sure you won't exceed that upper limit.
2) Read throughput. Pretty much the same thing as write throughput but for reads. the read limit is higher than the write limit but it is still something to be aware of.
3) Paging. Whenever you Query a Dynamo table using a Hash key, in this case, OrderStatus, it will automatically limit the size of the response to 1 MB. Because of this, you might need to make multiple sequential Query requests to read all of the Orders for a particular OrderStatus.
The nice thing is all of these problems have basically the same solution, "sharding". What I mean by sharding in this case is adding a suffix to your OrderStatus. For example, if OrderStatus can be either 1 or 0, you would create another field, e.g. OrderShard, which can be 1_0, 1_1, 1_2, ..., 1_9, 0_0, 0_1, 0_2, ..., 0_9. We basically just add a random integer between 0 and 9 to the end of the OrderStatus to create more possible Hash key values on the GSI. This will mean that your data gets spread out over more disks, solving 1 and 2, and you can make parallel Query requests, solving 3 for the most part.
Instead of using OrderStatus as your Hash key on the GSI, now you will use OrderShard. Still use OrderId as the Range key. Also, if 10 shards per OrderStatus value isn't enough, just increase the number of shards. For example, add a random number between 0-99. How many shards you will need depends of your scale and throughput.
I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.