DynamoDB - Querying Partitions that have Records with different Partitions Key Values - amazon-dynamodb

I have been reading the DynamoDB documentation and understand that the partition key determines the physical partition a record will be stored. However, if the partition key has n values, that does not imply there will be n partitions. This is illustrated by the following:
"Fish" and "Lizard" reside in the same partition despite having different partition key values. So how does DynamoDB deal with queries for "Fish" records? Will it just sort all the records in this partition by the sort key (Name) and perform a binary search? But in that case there can be multiple records with the same Name (Sort key). Is it fair to say the time complexity of a read on a composite key query is always O(log n) where n is the size of the partition?

DynamoDB first stores items which share the same partition key (Item collection) physically close together and sorted by a sort key.
When you do a look up by partition key it's a constant O(1) to look up the first item because DynamoDB can point straight to that location on the disk (SSD). No other partition key needs to be read during the lookup.

Related

AWS DynamoDB table Partition Key

Question about Partition Key in Dynamodb table.
It says Partition key – A simple primary key, composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
Quetion:
So If I have 1 million records in a Orders table with Orderid being the partition key. Does it mean each record of my Orders table is stored in 1 million servers? How is that is possible?
The hash output determines the physical partition for placement. Say you have four partitions backing the table. If the hash output value is in the first quarter of the keyspace it goes into the first partition. And so on. The hash value output will determine into which of the four it goes.
Then partitions can split as needed, each one taking a subset of the keyspace of the old.

Differentiate between partition keys & partition key ranges in Azure Cosmos DB

I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges
You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.

DynamoDB: Keys and what they mean

I'm confused as to how to use DynamoDB table keys. The documentation mentions HASH (which seem to also be referred to as Partition) keys and RANGE (or SORT?) keys. I'm trying to roughly align these with my previous understanding of database indexing theories.
My current, mostly guess-based understanding is that a HASH key is essentially a primary key - it must be unique and is automatically indexed for fast-reading - and a RANGE key is basically something you should apply to any other field you plan on querying on (either in a WHERE-like or sorting context).
This is then somewhat confused by the introductions of Local and Global Secondary Indexes. How do they play into things?
If anyone could nudge me in the right direction, bearing in mind my current, probably flawed understanding has come from the docs, I'd be super grateful.
Thanks!
Basically, the DynamoDB table is partitioned based on partition key (otherwise called hash key).
1) If the table has only partition key, then it has to be unique. The DynamoDB table performance based pretty much on the partition key. The good partition key should be a well scattered value (should not have a sequence number as partition key like RDBMS primary key in legacy systems).
2) If the table has both partition key and sort key (otherwise called RANGE key), then the combination of them needs to be unique. It is a kind of concatenation key in RDBMS terms.
However, the usage differs in DynamoDB table. DynamoDB doesn't have a sorting functionality (i.e. ORDER BY clause) across the partition keys. For example, if you have 10 items with same partition key value and different sort key values, then you can sort the result based on the sort key attribute. You can't apply sorting on any other attributes including partition key.
All sort key values of a partition key will be maintained in the same partition for better performance (i.e. physically co-located).
LSI - There can be only one LSI for the table. It should be defined when you create the table. This is kind of alternate sort key for the table
GSI - In order to understand GSI, you need to understand the difference between SCAN and QUERY API in DynamoDB.
SCAN - is used when you don't know the partition key (i.e. full table scan to get the item)
QUERY - is used when you know the partition key (i.e. sort key is optional)
As DynamoDB costing is based on read/write capacity units and for better performance, scan is not the best option for most of the use cases. So, there is an option to create the GSI with alternate partition keys based on the Query Access Pattern (QAP).
GSI Example

Does a sort key in dynamo sort even with different partition keys?

Getting acclimated to DynamoDB : )
If I have a table with a unique partition key, like a unique id, and I use a time stamp as a sort key, how will Dynamo sort my data?
Will I have the most recent things in one partition, and the older things in other partitions?
I ask because I want to know how to assign throughput, and I'm certain my recently created and edited items will be most likely to be accessed, and the old stuff can pretty much be archived.
Dynamodb keeps all the items of a particular partition key in one partition. For eg, if there are 10 items available for a specific partition key with different timestamps, all the 10 items will be present on single partition. So that when the data is retrieved for a partition key, all the items can be retrieved from single partition. This makes the retrieval process faster.
Reg the sorting, Dynamodb sorts the data for the particular partition key. You can use the ScanIndexForward parameter to sort the data by ascending or descending order.

Partitioned Collection paritionkey

I´m confused what to choose for PartitionKey and what effect it has. If I use Partitioned Collection then I must define a Partition Key that can be used by DocumentDB to distribute the data among multiple servers. But lets say that I choose a partitionKey that is always the same for all documents. Will I still be able to get up to 250k RU/s for a single Partitioned Collection?
In my case the main query is get all documents with paging but in a timeline (newest first)
SELECT TOP 10 c.id, c.someValue, u.id FROM c
JOIN u IN c.users ORDER BY c.createdDate DESC
A minified version of the document looks like this
{
id: "1",
someValue: "Foo"
createdDate: "2016-14-4-14:38:00.00"
//Max 100 users
users: [{id: "1", id: "2"}]
}
No, you need to have multiple distinct partition key values in order to achieve high throughput levels in DocumentDB.
A partition in DocumentDB supports up to 10,000 RU/s, so you need at least 25* distinct partition key values to reach 250 RU/s. DocumentDB divides the partition keys evenly across the available partitions, i.e. a partition might contain documents with multiple partition keys, but the data for a partition key is guaranteed to stay within a single partition. You must also structure your workload in a manner that distributes reads/writes across these partition keys.
*You may need a slightly higher number of partition keys than 25 (50-100) in practice since some of the partition keys might hash to the same partition
So, we have a partitioned (10 partitions) collection with a throughput of 10000 RU/s. Partition Key is CountryCode and we only have data for 5 countries. Data for two countries were hashed into the same physical partition. As per documentation found in the following link, we were expecting data to be reorganized to the empty partitions once the 10GB limit was hit for the said partition. That didn't happen and we could no longer add data for those two countries.
Obviously, the right thing to do would be to choose a partition key that ensures low cardinality, but the documentation is misleading.
https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data
When a physical partition p reaches its storage limit, Cosmos DB seamlessly splits p into two new partitions p1 and p2 and distributes values corresponding to roughly half the keys to each of the partitions. This split operation is invisible to your application.

Resources