I am new in Dynamo and I have a table that is partitioned by company id and receives news everyday related to the companies, so I just insert a new record for every news that I got using the respective company id. I would like to know if there is an easy way to know which company has the most news. I thought maybe by knowing the biggest partition, but I don't find info about this, do I have to query every company and count the items they return?
There's no way for you to know anything about the physical partitions in use by DDB. I assume AWS Engineers can find out, but it's not something they are open about.
Unless your DDB data is more than 10GB, or you've configured(used) more than 3000RCU / 1000 WCU...it's highly probable that your data is in fact in a single physical partition....regardless of the number of partition key values in that data.
100 partition key values, doesn't translate into 100 physical partitions.
Related
I've been thinking a lot about the possible strategies of querying unbound amount of items.
For example, think of a forum - you could have any number of forum posts categorized by topic. You need to support at least 2 access patterns: post details view and list of posts by topic.
// legend
PK = partition key, SK = sort key
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
PK = postId
Great for querying all the posts for given topic but all are in same partition ("hot partition").
PK = topic and SK = postId#addedDateTime
Store items in buckets, e.g new bucket for each day. This would push a lot of logic to application layer and add latency. E.g if you need to get 10 posts, you'd have to query today's bucket and if bucket contains less than 10 items, query yesterday's bucket, etc. Don't even get me started on pagionation. That would probably be a nightmare if it crosses buckets.
PK = topic#date and SK = postId#addedDateTime
So my question is that how to store and query unbound list of items in "DynamoDB way"?
I think you've got a good understanding about your options.
I can't profess to know the One True Way™ to solve this particular problem in DynamoDB, but I'll throw out a few thoughts for the sake of discussion.
While it's easy to get a single post, you can't effectively query a list of posts without a scan.
This would definitely be the case if your Primary Key consists solely of the postId (I'll use POST#<postId> to make it easier to read). That table would look something like this:
This would be super efficient for the 'fetch post details view (aka fetch post by ID)" access pattern. However, we haven't built-in any way to access a group of Posts by topic. Let's give that a shot next.
There are a few ways to model the one-to-many relationship between Posts and topics. The first thing that comes to mind is creating a secondary index on the topic field. Logically, that would look like this:
Now we can get an item collection of Posts by topic using the efficient query operation. Pagination will help you if your number of Posts per topic grows larger. This may be enough for your application. For the sake of this discussion, let's assume it creates a hot partition and consider what strategies we can introduce to reduce the problem.
One Option
You said
Store items in buckets, e.g new bucket for each day.
This is a great idea! Let's update our secondary index partition key to be <topic>#<truncated_timestamp> so we can group posts by topic for a given time frame (day/week/month/etc).
I've done a few things here:
Introduced two new attributes to represent the secondary index PK and SK (GSIPK and GSISK respectively).
Introduced a truncated timestamp into the partition key to represent a given month. For example, POST#1 and POST#2 both have a posted_at timestamp in September. I truncated both of those timestamps to 2020-09-01 to represent the entire month of September (or whatever time boundary that makes sense for your application).
This will help distribute your data across partitions, reducing the hot key issue. As you correctly note, this will increase the complexity of your application logic and increase latency since you may need to make multiple requests to retrieve enough results for your applications needs. However, this might be a reasonable trade off in this situation. If the increased latency is a problem, you could pre-populate a partition to contain the results of the prior N months worth of a topic discussion (e.g. PK = TOPIC_CACHE#<topic> with a list attribute that contains a list of postIds from the prior N months).
If the TOPIC_CACHE ends up being a hot partition, you could always shard the partition using calculated suffix:
Your application could randomly select a TOPIC_CACHE between 1..N when retrieving the topic cache.
There are numerous ways to approach this access pattern, and these options represent only a few possibilities. If it were my application, I would start by creating a secondary index using the Post topic as the partition key. It's the easiest to implement and would give me an opportunity to see how my application access patterns performed in a production environment. If the hot key issue started to become a problem, I'd dive deeper into some sort of caching solution.
I am trying to determine the best partition key for a CosmosDB table that has both a customer ID (unique value for each customer) and customer city (in North America, which yields thousands of possible values).
Reading the Azure documentation, I see a lot of conflicting information between which one is best. Some of the documents specify that the more unique value will provide a better spread of items across partitions. While other documents state that using city would be best.
So my question(s) are:
Is each partition key hashed and does each partition contain items with keys with a range of hashes? Ie - if Customer ID is the partition key, would one partition have ID's 1 through 1000, another partition 1000 through 2000, etc? Same with city, would one partition have multiple cities? Or, would each partition be mapped 1:1 to a specific partition key - ie ID or city?
Based on the above, which one would be better (more performant, cost less)? Having as granular partition key as possible (id customer ID)? Or customer city?
Thank you!
yes, partition keys are hashed and those hashes determine where logical partitions are physically stored
no, partitions will only ever contain records with the same partition key (that's basically the point, co-locate associated records). So in your example, they would be mapped 1:1
cost is irrelevant because you aren't charged for partitions (although they do have a size limit), so the question comes down to performance, and again that all depends on how your application queries the data.
A good analogy for understanding how partitioning works is to think about finding someone's address:
If I gave you the key to my house (Item ID) but nothing else, you would need to try every door in the world until you happen to stumble upon the right one (aka cross-partition query). If I told you the country (partition key), then you can immediately eliminate a millions of doors, but you'd still have millions of doors to check, so still not very efficient. If I gave you the city, less again but still a lot to check....but if I gave you my postcode, then we've just optimized a query from billions of records to 15-20.
I’ve been dabbling with CosmosDb and am now starting to get in the range of over 10k documents instead of just a few.
I’m struggling with how best to partition.
Some background
• I will have 10-50k documents in CosmosDb (maybe more in later phases)
• I have an index on top of those in Azure Search, for a small subset of these document’s properties)
• I will NOT be performing complex searches in CosmosDb
except:
• I will be fetching documents from cosmosDb by their Id (most likely coming from Azure Search results, when the user clicks one of the results)
o Initially only 1 document will be requested
o Possibly, in the future, I might ask for e.g. 10 documents at the same time, all by their Id.
I currently have 1 partition, which feels like a waste of a good system.
I could partition on e.g. the last digit of the document number, which would give a nice spread of documents across 10 partitions.
My concrete question:
If I spread data equally (almost randomly, to be honest) across 10 partitions, does that speed up fetching documents by Id (assuming many simultaneous calls to the system, each fetching 1 document by Id).
My reasoning: The last digit would determine the partition, so only 1 partition would be accessed to find the document, which is better than searching all partitions at the same time?
Spreading data across partitions does not make things faster on the read path in a partitioned data store. Where it helps is on the write path because you are spreading the load out horizontally across many computers simultaneously. And this only matters where the amount of throughput overloads what a single partition can achieve. For Cosmos DB this is 10,000 RU.
The key to fast reads is to indicate the partition key value in your read. The partition key is basically a router to where your data is stored. Once there it uses the index (or id in your case) to find the data.
There's some articles that provide some details on partitioning that are helpful.
Partitioning in Azure Cosmos DB
How to model and partition data on Azure Cosmos DB using a real-world example
Hope this helps.
I am planning to create a merchant table, which will have store locations of the merchant. Most merchants are small businesses and they only have a few stores. However, there is the odd multi-chain/franchise who may have hundreds of locations.
What would be my solution if I want to put include location attributes within the merchant table? If I have to split it into multiple tables, how do I achieve that?
Thank you!
EDIT: How about splitting the table. To cater for the majority, say up to 5 locations I can place them inside the same table. But beyond 5, it will spill over to a normalised table with an indicator on the main table to say there are more than 5 locations. Any thoughts on how to achieve that?
You have a couple of options depending on your access patterns:
Compress the data and store the binary object in DynamoDB.
Store basic details in DynamoDB along with a link to S3 for the larger things. There's no transactional support across DynamoDB and S3 so there's a chance your data could become inconsistent.
Rather than embed location attributes, you could normalise your tables and put that data in a separate table with the equivalent of a foreign key to your merchant table. But, you may then need two queries to retrieve data for each merchant, which would count towards your throughput costs.
Catering for a spill-over table would have to be handled in the application code rather than at the database level: if (store_count > 5) then execute another query to retrieve more data
If you don't need the performance and scalability of DynamoDB, perhaps RDS is a better solution.
A bit late to the party, but I believe the right schema would be to have partitionKey as merchantId with sortKey as storeId. This would create individual, separate records for each store and you can store the geo location. This way
You would not cross the 400KB threshold
Queries become efficient if you want to fetch the location for just 1 of the stores of the merchant. If you want to fetch all the stores, there is no impact with this schema.
PS : I am a Software Engineer working on Amazon Dynamodb.
I've been reading some DynamoDB index docs and they've left me more confused than anything. Let's clear the air with a concrete example.
I have a simple calendar application, where I have an events table. Here are the columns I have:
id: guid,
name: string,
startTimestamp: integer,
calendarId: guid (foreign key in a traditional RDBMS model)
ownerId: guid (foreign key in a traditional RDBMS model)
I'd like to perform queries such as:
Get an event by ID
Get all events where calendarId = x and ownerId = y
Get all events where startTimestamp is between x and y and calendarId = z
DynamoDB docs seem to heavily suggest avoiding using the event's ID as a partition/sort key here, so what's the recommended schema?
This is a problem that everyone wrestles with when they start with (and indeed when they are experienced with) DynamoDB.
Pricing and throughput
Let's start with how DynamoDB is priced (its related - honestly). Ignoring the free tier for a moment, you pay $0.25 per GB per month for data at rest. You also pay $0.47 per Write Capacity Unit (WCU) per month and $0.09 per Read Capacity Unit (RCU) per month. Throughput is the number of WCUs and RCUs on your table. You have to specify throughput up front on your table - the volume of writes and reads you can perform on your table is limited by your throughput provision. Pay more money and you can do more reads and writes per second. The exact details of how DynamoDB partitions tables can be found in this answer.
Keys
Now we need to consider table partitioning. Tables must have a primary key. A primary key must have a hash key (aka a partition key) and may optionally have a sort key (aka a range key). DynamoDB creates partitions based on your hash key values. Within a partition key value the data is sorted by range key, if you have specified one.
Data Access
If you have the exact primary key (hash key and range key if there is one), you can instantly access an item using GetItem. If you have multiple items to get, you can use BatchGetItem.
DynamoDB can only 'search' data in two ways. A Query can only take data from one partition in one call, because it uses the partition key (and optionally a sort key) it is quick. A Scan always evaluates every item in table, so its typically slow and doesn't scale well on large tables.
Throughput distribution
This is where is gets interesting. DynamoDB takes all the throughput you have purchased and evenly spreads it over all of you table partitions. Imagine you have 10 WCUs and 10 RCUs on your table, and 5 partitions, that means you have 2 WCUs and 2 RCUs per partition. That's fine if you access each partition evenly, you get to use all of your purchased throughput. But imagine you only ever access one partition. Now you've purchased 10 WCUs and RCUs but you are only using 2. Your table is going to be much slower than you thought. One option is to just buy more throughput, that will work, but its probably not very satisfactory to most engineers.
Uniform Access v Natural Access
Based on the above we know we want to design a table where each partition gets accessed evenly. However, in my experience people get too hung up about this, which is not surprising if you read the article I just linked (which you also linked).
Remember that partition keys is what we use in a Query to get our data fast, and avoid regular Scans. Some people get too focussed making their partition access perfectly uniform, and end up with a table they can't query quickly.
The answer
I like to refer to Best Practices for Tables guide. And particularly the table where it says User ID is a good partition key so long many user access your application regularly. (It actually says where you have many users - which is not correct, the size of the table is irrelevant).
Its a balance between uniform access and being able to use intuitive, natural queries for your application, but what I am saying is, if you are new to DyanmoDB, the right answer probably is to design your table based on intuitive access. After you've done that successfully, have a think about uniform access and hot partitions, but just remember access doesn't have to be perfectly uniform. There are various design patterns to achieve both intuitive and uniform access, but these can be complicated for those starting out and in many cases can probably discourage people using DynamoDB if they get too focussed on the uniform access idea.
Tips
Most applications will have users. For most queries, in most applications, the most common query you will do is get data for a user. So the first option for most application's primary partition key will often be a user id. That's fine, as long as you don't have a few very high hitting users and many users that never log in.
Another tip. If your table is called vegetables, your primary partition key will probably be vegetable id. If your table is called shoes, your primary partition key will probably be shoe id.
Most applications will have many items for each user (or vegetable or shoe). The primary key has to be unique. A good option often is to add a date range (sort) key - perhaps the datetime the item was created. This then orders the items within the user partition by creation date, and also gives each item a unique composite primary key (i.e. hash key + range key). It's also fine to use a generated UUID as a range key, you wont use the ordering it gives you, but you can then have many items per user and still use the Query function.
Indexes are not a solution
Aha! But I can just make my partition key totally random, then apply an index with a partition key of the attribute I really want to query on. That way I get uniform access AND fast intutive queries.
Sadly not. Indexes have their own throughput and partitioning, separate to the table the index is built on. Just imagine indexes as a whole new table - that's basically what they are. Indexes are not a work around to uneven partition access.
Finally - your schema
Primary Key
Hash Key: Event ID
Range Key: None
Global Secondary index
Hash Key: Calendar ID
Range Key: startTimestamp
Assuming Event ID is uniformly accessed, it would be a great hash key. You would really need to describe how your data is distributed to discuss this much more. Other things that come in to play are how fast you want queries to work and how much you are willing to pay (e.g. secondary indexes are expensive).
And your queries:
Get an event by ID
GetItem using Event ID
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
Get all events where startTimestamp is between x and y and calendarId = z
Query by GSI parition key, add a condition on range key
I just want to add something to the accepted anwser:
Get all events where calendarId = x and ownerId = y
Query by GSI parition key, add a condition on ownerId
This method is not reliable. I guess that when you say "add a condition on ownerId", you mean "add a Filter expression on ownerId" (Definition by Alex DeBrie)
But the 1MB read limit by DynamoDB makes it unreliable.
It is better explained in the link above, but here is the sumup:
If you calendar has a lot of events, that represent data with size over 1MB, the results on which you apply the condition ownerId==X will be truncated to the first 1MB, excluding the rest of the data.