AWS DynamoDB table Partition Key - amazon-dynamodb

Question about Partition Key in Dynamodb table.
It says Partition key – A simple primary key, composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
Quetion:
So If I have 1 million records in a Orders table with Orderid being the partition key. Does it mean each record of my Orders table is stored in 1 million servers? How is that is possible?

The hash output determines the physical partition for placement. Say you have four partitions backing the table. If the hash output value is in the first quarter of the keyspace it goes into the first partition. And so on. The hash value output will determine into which of the four it goes.
Then partitions can split as needed, each one taking a subset of the keyspace of the old.

Related

DynamoDB - Querying Partitions that have Records with different Partitions Key Values

I have been reading the DynamoDB documentation and understand that the partition key determines the physical partition a record will be stored. However, if the partition key has n values, that does not imply there will be n partitions. This is illustrated by the following:
"Fish" and "Lizard" reside in the same partition despite having different partition key values. So how does DynamoDB deal with queries for "Fish" records? Will it just sort all the records in this partition by the sort key (Name) and perform a binary search? But in that case there can be multiple records with the same Name (Sort key). Is it fair to say the time complexity of a read on a composite key query is always O(log n) where n is the size of the partition?
DynamoDB first stores items which share the same partition key (Item collection) physically close together and sorted by a sort key.
When you do a look up by partition key it's a constant O(1) to look up the first item because DynamoDB can point straight to that location on the disk (SSD). No other partition key needs to be read during the lookup.

Differentiate between partition keys & partition key ranges in Azure Cosmos DB

I'm having difficulty understanding the difference between the partition keys & the partition key ranges in Cosmos DB. I understand generally that a partition key in cosmos db is a JSON property/path within each document that is used to evenly distribute data among multiple partitions to avoid any uneven "hot partitions" -- and partition key decides the physical placement of documents.
But its not clear to me what the partition key range is...is this just a range of literal partition keys starting from first to last grouped by each individual partition in the collection? I know the ranges can be found by performing a GET request to the endpoint https://{databaseaccount}.documents.azure.com/dbs/{db-id}/colls/{coll-id}/pkranges but just conceptionally want to be sure I understand. Also still not clear on how to granularly view the specific partition key that a specific document belongs to.
https://learn.microsoft.com/en-us/rest/api/cosmos-db/get-partition-key-ranges
You define property on your documents that you want to use as a partition key.
Cosmos db hashes value of that property for all documents in collection and maps different partition keys to different physical partitions.
Over time, your collection will grow and you might end up having, for example, 100 logical partition distributed over 5 physical partitions.
Partition key ranges are just collections of partition keys grouped by physical partitions they are mapped to.
So, in this example, you would get 5 pkranges with min/max partition key value for each.
Notice that pkranges might change because in future, as your collection grows, physical partitions will get split causing some partition keys to be moved to new physical partition causing part of the previous range to be moved to new location.

What is the difference between an AWS DynamoDB local vs. global secondary index?

From the DynamoDB documentation:
Global secondary index — an index with a partition key and a sort key that can be different from those on the base table. A global secondary index is considered "global" because queries on the index can span all of the data in the base table, across all partitions.
Local secondary index — an index that has the same partition key as the base table, but a different sort key. A local secondary index is "local" in the sense that every partition of a local secondary index is scoped to a base table partition that has the same partition key value.
This just isn't making sense to me and no amount of searches is able to aptly explain it to me.
Could someone help me w/understanding this?
When you insert data to DynamoDB, it internally partitions the data and store in different storage nodes internally. This is based on the Partition Key.
Lets say you want to Query an Item based on a non-key (Neither partition nor sort key) attribute you need to use a Scan (Which is expensive since it checks all the items in the table).
This is where GSI snd LSI comes in. Lets take an example of a Student table with StudentsId as sort key and SchoolId as partition key.
LSI is useful if your application have queries like getting all the students of grade 5 of a given school.
If you need to query all grade 5 students across all the schools (Across all school partitions) you will need a GSI.
Local secondary index(LSI)
can only be created when creating a table
share the capacity units with a table
index's partition key has to be the same as table's partition key
a table can have 5 LSI
Global secondary index(GSI)
can be created anytime but takes times to set one up (due to copying original table items into index table, it cost read capacity units of table)
have a separate set of capacity unit
any attribute can be the partition key
a table can have 5 GSI

DynamoDB: Keys and what they mean

I'm confused as to how to use DynamoDB table keys. The documentation mentions HASH (which seem to also be referred to as Partition) keys and RANGE (or SORT?) keys. I'm trying to roughly align these with my previous understanding of database indexing theories.
My current, mostly guess-based understanding is that a HASH key is essentially a primary key - it must be unique and is automatically indexed for fast-reading - and a RANGE key is basically something you should apply to any other field you plan on querying on (either in a WHERE-like or sorting context).
This is then somewhat confused by the introductions of Local and Global Secondary Indexes. How do they play into things?
If anyone could nudge me in the right direction, bearing in mind my current, probably flawed understanding has come from the docs, I'd be super grateful.
Thanks!
Basically, the DynamoDB table is partitioned based on partition key (otherwise called hash key).
1) If the table has only partition key, then it has to be unique. The DynamoDB table performance based pretty much on the partition key. The good partition key should be a well scattered value (should not have a sequence number as partition key like RDBMS primary key in legacy systems).
2) If the table has both partition key and sort key (otherwise called RANGE key), then the combination of them needs to be unique. It is a kind of concatenation key in RDBMS terms.
However, the usage differs in DynamoDB table. DynamoDB doesn't have a sorting functionality (i.e. ORDER BY clause) across the partition keys. For example, if you have 10 items with same partition key value and different sort key values, then you can sort the result based on the sort key attribute. You can't apply sorting on any other attributes including partition key.
All sort key values of a partition key will be maintained in the same partition for better performance (i.e. physically co-located).
LSI - There can be only one LSI for the table. It should be defined when you create the table. This is kind of alternate sort key for the table
GSI - In order to understand GSI, you need to understand the difference between SCAN and QUERY API in DynamoDB.
SCAN - is used when you don't know the partition key (i.e. full table scan to get the item)
QUERY - is used when you know the partition key (i.e. sort key is optional)
As DynamoDB costing is based on read/write capacity units and for better performance, scan is not the best option for most of the use cases. So, there is an option to create the GSI with alternate partition keys based on the Query Access Pattern (QAP).
GSI Example

Does a sort key in dynamo sort even with different partition keys?

Getting acclimated to DynamoDB : )
If I have a table with a unique partition key, like a unique id, and I use a time stamp as a sort key, how will Dynamo sort my data?
Will I have the most recent things in one partition, and the older things in other partitions?
I ask because I want to know how to assign throughput, and I'm certain my recently created and edited items will be most likely to be accessed, and the old stuff can pretty much be archived.
Dynamodb keeps all the items of a particular partition key in one partition. For eg, if there are 10 items available for a specific partition key with different timestamps, all the 10 items will be present on single partition. So that when the data is retrieved for a partition key, all the items can be retrieved from single partition. This makes the retrieval process faster.
Reg the sorting, Dynamodb sorts the data for the particular partition key. You can use the ScanIndexForward parameter to sort the data by ascending or descending order.

Resources