how to create dynamoDB efficiently with my table? - amazon-dynamodb

If each of my database's an overview has only two types (state: pending, appended), is it efficient to designate these two types as partition keys? Or is it effective to index this state value?

It would be more effective to use a sparse index. In your case, you might add an attribute called isPending. You can add this attribute to items that are pending, and remove it once they are appended. If you create a GSI with tid as the hash key and isPending as the sort key, then only items that are pending will be in the GSI.

It will depend on how would you search for these records!
For example, if you will always search by record ID, it never minds. But if you will search every time by the set of records pending, or appended, you should think in use partitions.
You also could research in this Best practice guide from AWS: https://docs.aws.amazon.com/en_us/amazondynamodb/latest/developerguide/best-practices.html
Updating:
In this section of best practice guide, it recommends the following:
Keep related data together. Research on routing-table optimization
20 years ago found that "locality of reference" was the single most
important factor in speeding up response time: keeping related data
together in one place. This is equally true in NoSQL systems today,
where keeping related data in close proximity has a major impact on
cost and performance. Instead of distributing related data items
across multiple tables, you should keep related items in your NoSQL
system as close together as possible.
As a general rule, you should maintain as few tables as possible in a
DynamoDB application. As emphasized earlier, most well designed
applications require only one table, unless there is a specific reason
for using multiple tables.
Exceptions are cases where high-volume time series data are involved,
or datasets that have very different access patterns—but these are
exceptions. A single table with inverted indexes can usually enable
simple queries to create and retrieve the complex hierarchical data
structures required by your application.
Use sort order. Related items can be grouped together and queried
efficiently if their key design causes them to sort together. This is
an important NoSQL design strategy.
Distribute queries. It is also important that a high volume of
queries not be focused on one part of the database, where they can
exceed I/O capacity. Instead, you should design data keys to
distribute traffic evenly across partitions as much as possible,
avoiding "hot spots."
Use global secondary indexes. By creating specific global secondary
indexes, you can enable different queries than your main table can
support, and that are still fast and relatively inexpensive.
I hope I could help you!

Related

DynamoDB: Querying all similar items of a certain type

Keeping in mind the best practices of having a single table and to evenly distribute items across partitions using as unique partition keys as possible in DynamoDB, I am stuck at one problem.
Say my table stores items such as users, items and devices. I am storing the id for each of these items as the partition key. Each id is prefixed with its type such as user-XXXX, item-XXXX & device-XXXX.
Now the problem is how can I query only a certain type of object? For example I want to retrieve all users, how do I do that? It would have been possible if the begin_with operator was allowed for partition keys so I could search for the prefix but the partition keys only allow the equality operator.
If now I use my types as partition keys, for example, user as partition key and then the user-id as the sort key, it would work but it would result in only a few partition keys and thus resulting in the hot keys issue. And creating multiple tables is a bad practice.
Any suggestions are welcome.
This is a great question. I'm also interested to hear what others are doing to solve this problem.
If you're storing your data with a Partition Key of <type>-<id>, you're supporting the access pattern "retrieve an item by ID". You've correctly noted that you cannot use begins_with on a Partition Key, leaving you without a clear cut way to get a collection of items of that type.
I think you're on the right track with creating a Partition Key of <type> (e.g. Users, Devices, etc) with a meaningful Sort Key. However, since your items aren't evenly distributed across the table, you're faced with the possibility of a hot partition.
One way to solve the problem of a hot partition is to use an external cache, which would prevent your DB from being hit every time. This comes with added complexity that you may not want to introduce to your application, but it's an option.
You also have the option of distributing the data across partitions in DynamoDB, effectively implementing your own cache. For example, lets say you have a web application that has a list of "top 10 devices" directly on the homepage. You could create partitions DEVICES#1,DEVICES#2,DEVICES#3,...,DEVICES#N that each stores the top 10 devices. When your application needs to fetch the top 10 devices, it could randomly select one of these partitions to get the data. This may not work for a partition as large as Users, but is a pretty neat pattern to consider.
Extending this idea further, you could partition Devices by some other meaningful metric (e.g. <manufactured_date> or <created_at>). This would more uniformly distribution your Device items throughout the database. Your application would be responsible for querying all the partitions and merging the results, but you'd reduce/eliminate the hot partition problem. The AWS DynamoDB docs discuss this pattern in greater depth.
There's hardly a one size fits all approach to DynamoDB data modeling, which can make the data modeling super tricky! Your specific access patterns will dictate which solution fits best for your scenario.
Keeping in mind the best practices of having a single table and to evenly distribute items across partitions
Quickly highlighting the two things mentioned here.
Definitely even distribution of partitions keys is a best practice.
Having the records in a single table, in a generic sense is to avoid having to Normalize like in a relational database. In other words its fine to build with duplicate/redundant information. So its not necessarily a notion to club all possible data into a single table.
Now the problem is how can I query only a certain type of object? For
example I want to retrieve all users, how do I do that?
Let's imagine that you had this table with only "user" data in it. Would this allow to retrieve all users? Ofcourse not, unless there is a single partition with type called user and rest of it say behind a sort key of userid.
And creating multiple tables is a bad practice
I don't think so its considered bad to have more than one table. Its bad if we store just like normalized tables and having to use JOIN to get the data together.
Having said that, what would be a better approach to follow.
The fundamental difference is to think about the queries first to derive at the table design. That will even suggest if DynamoDB is the right choice. For example, the requirement to select every user might be a bad use case altogether for DynamoDB to solve.
The query patterns will further suggest, what is the best partition key in hand. The choice of DynamoDB here is it because of high ingest and mostly immutable writes?
Do I always have the partition key in hand to perform the select that I need to perform?
What would the update statements look like, will it have again the partition key to perform updates?
Do I need to further filter by additional columns and can that be the default sort order?
As you start answering some of these questions, a better model might appear altogether.

Does DynamoDB GSI overloading give performance benefits or just flexibility

Does GSI Overloading provide any performance benefits, e.g. by allowing cached partition keys to be more efficiently routed? Or is it mostly about preventing you from running out of GSIs? Or maybe opening up other query patterns that might not be so immediately obvious.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-gsi-overloading.html
e.g. I you have a base table and you want to partition it so you can query a specific attribute (which becomes the PK of the GSI) over two dimensions, does it make any difference if you create 1 overloaded GSI, or 2 non-overloaded GSIs.
For an example of what I'm referring to see the attached image:
https://drive.google.com/file/d/1fsI50oUOFIx-CFp7zcYMij7KQc5hJGIa/view?usp=sharing
The base table has documents which can be in a published or draft state. Each document is owned by a single user. I want to be able to query by user to find:
Published documents by date
Draft documents by date
I'm asking in relation to the more recent DynamoDB best practice that implies that all applications only require one table. Some of the techniques being shown in this documentation show how a reasonably complex relational model can be squashed into 1 DynamoDB table and 2 GSIs and yet still support 10-15 query patterns.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-relational-modeling.html
I'm trying to understand why someone would go down this route as it seems incredibly complicated.
The idea – in a nutshell – is to not have the overhead of doing joins on the database layer or having to go back to the database to effectively try to do the join on the application layer. By having the data sliced already in the format that your application requires, all you really need to do is basically do one select * from table where x = y call which returns multiple entities in one call (in your example that could be Users and Documents). This means that it will be extremely efficient and scalable on the db level. But also means that you'll be less flexible as you need to know the access patterns in advance and model your data accordingly.
See Rick Houlihan's excellent talk on this https://www.youtube.com/watch?v=HaEPXoXVf2k for why you'd want to do this.
I don't think it has any performance benefits, at least none that's not called out – which makes sense since it's the same query and storage engine.
That being said, I think there are some practical reasons for why you'd want to go with a single table as it allows you to keep your infrastructure somewhat simple: you don't have to keep track of metrics and/or provisioning settings for separate tables.
My opinion would be cost of storage and provisioned throughput.
Apart from that not sure with new limit of 20

Can we avoid scan in dynamodb

I am new the noSQL data modelling so please excuse me if my question is trivial. One advise I found in dynamodb is always supply 'PartitionId' while querying otherwise, it will scan the whole table. But there could be cases where we need listing our items, for instance in case of ecom website, where we need to list our products on list page (with pagination).
How should we perform this listing by avoiding scan or using is efficiently?
Basically, there are three ways of reading data from DynamoDB:
GetItem – Retrieves a single item from a table. This is the most efficient way to read a single item, because it provides direct access to the physical location of the item.
Query – Retrieves all of the items that have a specific partition key. Within those items, you can apply a condition to the sort key and retrieve only a subset of the data. Query provides quick, efficient access to the partitions where the data is stored.
Scan – Retrieves all of the items in the specified table. (This operation should not be used with large tables, because it can consume large amounts of system resources.
And that's it. As you see, you should always prefer GetItem (BatchGetItem) to Query, and Query — to Scan.
You could use queries if you add a sort key to your data. I.e. you can use category as a hash key and product name as a sort key, so that the page showing items for a particular category could use querying by that category and product name. But that design is fragile, as you may need other keys for other pages, for example, you may need a vendor + price query if the user looks for a particular mobile phones. Indexes can help here, but they come with their own tradeofs and limitations.
Moreover, filtering by arbitrary expressions is applied after the query / scan operation completes but before you get the results, so you're charged for the whole query / scan. It's literally like filtering the data yourself in the application and not on the database side.
I would say that DynamoDB just is not intended for many kinds of workloads. Probably, it's not suited for your case too. Think of it as of a rich key-value (key to object) store, and not a "classic" RDBMS where indexes come at a lower cost and with less limitations and who provide developers rich querying capabilities.
There is a good article describing potential issues with DynamoDB, take a look. It contains an awesome decision tree that guides you through the DynamoDB argumentation. I'm pasting it here, but please note, that the original author is Forrest Brazeal.
Another article worth reading.
Finally, check out this short answer on SO about DynamoDB usecases and issues.
P.S. There is nothing criminal in doing scans (and I actually do them by schedule once per day in one of my projects), but that's an exceptional case and I regret about the decision to use DynamoDB in that case. It's not efficient in terms of speed, money, support and "dirtiness". I had to increase the capacity before the job and reduce it after, but that's another story…

CosmosDB/DocumentDB partitioning with multiple types in same collection

Official recommendation from the team is, to my knowledge, to put all datatypes into single collection that have something like type=someType field on documents to distinguish types.
Now, if we assume large databases with partitioning where different object types can be:
Completely different fields (so no common field for partitioning)
Related (through reference)
How to organize things so that things that should go together end up in same partition?
For example, lets say we have:
User
BlogPost
BlogPostComment
If we store them as separate types with type=user|blogPost|blogPostComment, in same collection, how do we ensure that user, his blogposts and all the corresponding comments end up in same partition?
Is there some best practice for this?
[UPDATE]
Can you ever avoid cross-partition queries completely? Should that be a goal? Or you just try to minimize them?
For example, you can partition your data perfectly for 99% of cases/queries but then you need some dashboard to show aggregates from all-the-data. Is that something you just accept as inevitable and try to minimize or is it possible to avoid it completely?
I've written about this somewhat extensively in other similar questions regarding Cosmos.
Basically, when dealing with many different logical entity types in a single Cosmos collection the easiest option is to put a generic (or abstract, as you refer to it) partition key on all your documents. At this point it's the concern of the application to make sure that at runtime the appropriate value is chosen. I usually name this document property either partitionKey, routingKey or something similar.
This is extremely important when designing for optimal query efficiency as your choice of partition keys can have a huge impact on query and throughput performance. A generic key like this lets you design the optimal storage of your data as it benefits whatever application you're building.
Even something like tenant does not make sense as different tenants might have wildly different data size and access patterns. Instead you could include the tenantId at runtime as part of your partition key as a kind of composite.
UPDATE:
For certain query patterns it might be possible to serve them entirely out of a single partition. It's definitely not the end of the world if things end up going cross partition though. The system is still quick. If possible, limiting the amount of partitions that need to be touched for a given query is ideal but you're never going to get away from it 100% of the time.
A partition should hold data related to a group that is expected to grow, for instance a Tenant which will group many documents (which can be of different types as you have mentioned) So the Partition Key in this instance should be the TenantId. The partitioning is more about the data relating to a group than the type of data. If the data is related to a User then you could use the UserId, however many users may comment on the same posts so it doesn't seem like a good candidate for a partition key unless there is some de-normalization of the user info so it doest have to relate back to the other users directly.. if that makes sense?

How to strike a performance balance with documentDB collection for multiple tenants?

Say I have:
My data stored in documetDB's collection for all of my tenants. (i.e. multiple tenants).
I configured the collection in such a way that all of my data is distributed uniformly across all partitions.
But partitions are NOT by each tenant. I use some other scheme.
Because of this data for a particular tenant is distributed across multiple partitions.
Here are my questions:
Is this the right thing to do to maximum performance for both reading and writing data?
What if I want to query for a particular tenant? What are the caveats in writing this query?
Any other things that I need to consider?
I would avoid queries across partitions, they come with quite a cost (basically multiply index and parsing costs with number of partitions - defaults to 25). It's fairly easy to try out.
I would prefer a solution where one can query on a specific partition, typically partitioning by tenant ID.
Remember that with partitioned collections, there's stil limits on each partition (10K RU and 10GB) - I have written about it here http://blog.ulriksen.net/notes-on-documentdb-partitioning/
It depends upon your usage patterns as well as the variation in tenant size.
In general for multi-tenant systems, 99% of all operations are within a single tenant. If you make the tenantID your partition key, then those operations will only touch a single partition. This won't make a single operation any faster (latency) but could provide huge throughput gains when under load by multiple tenants. However, if you only have 5 tenants and 1 of them is 10x bigger than all the others, then using the tenantID as your key will lead to a very unbalanced system.
We use the tenantID as the partition key for our system and it seems to work well. We've talked about what we would do if it became very unbalanced and one idea is to make the partition key be the tenantID + to split the large tenants up. We haven't had to do that yet though so we haven't worked out all of those details to know if that would actually be possible and performant, but we think it would work.
What you have described is a sensible solution, where you avoid data skews and load-balance across partitions well. Since the query for a particular tenant needs to touch all partitions, please remember to set FeedOptions.EnableCrossPartitionQuery to true (x-ms-documentdb-query-enablecrosspartition in the REST API).
DocumentDB site also has an excellent article on partitioned collections and tips for choosing a partition key in general. https://azure.microsoft.com/en-us/documentation/articles/documentdb-partition-data/

Resources