May two DynamoDB scan segments contain the same hash key? - amazon-dynamodb

I'm scanning a huge table (> 1B docs) so I'm using a parallel scan (using one segment per worker).
The table has a hash key and a sort key.
Intuitively a segment should contain a set of hash keys (including all their sort keys), so one hash key shouldn't appear in more than one segment, but I haven't found any documentation indicating this.
Does anyone know how does DynamoDB behave in this scenario?
Thanks

This is an interesting question. I thought it would be easy to find a document stating that each segment contains a disjoint range of hash keys, and the same hash key cannot appear in more than one segment - but I too failed to find any such document. I am curious if anyone else can find such a document. In the meantime, I can try to offer additional intuitions on why your conjecture is likely correct - but also might be wrong:
My first intuition would be that you are right:
DynamoDB uses the hash key, also known as a partition key to decide on which of the many storage nodes to store copy of this data. All of the items sharing the same partition key (with different sort key values) are stored together, in sort-key order, so they can be Queryed together in order. DynamoDB uses a hash function on the partition key to decide the placement of each item (hence the name "hash key").
Now, if DynamoDB needs to divide the task of scanning all the data into "segments", the most sensible thing for it to do is to divide the space of hash values (i.e., hash function of the hash keys) to different equal-sized pieces. This division is easy to do (just a numeric division by TotalSegments), it ensures roughly the same amount of items in each segment (assuming there are many different partitions), and it ensures that the scanning of each segment involves a different storage node, so the parallel scan can proceed faster than what a single storage node is capable of.
However, there is one indication that this might not be the entire story.
The DynamoDB documentation claims that
In general, there is no practical limit on the number of distinct sort key values per partition key value.
This means that in theory at least, your entire database, perhaps one petabyte of it, may be in a single partition with billions of different sort keys. Since Amazon's single storage node do have a size limit, it means DynamoDB must (unless the above statement is false) support splitting of a single huge partition into multiple storage nodes. This means that when GetItem is looking for a particular item, DynamoDB needs to know which sort key is on which storage node. It also means that a parallel scan might - possibly - divide this huge partition into pieces, all scanning the same partition but different sort-key ranges in it. I am not sure we can completely rule out this possibility. I am guessing it will never happen when you only have smallish partitions.

Every DynamoDB table has a "hashspace" and data is partitioned as per the hash value of the partition key. When a ParallelScan is intended and the TotalSegments and Segment values are provided, the table's complete hashspace is logically divided into these "Segments" such that TotalSegments cover the complete hash space, without overlapping. It is quite possible some segments here do not actually have any data corresponding to them, since there may not be any data in the hashspace allocated to the segment. This can be observed if the TotalSegments value chosen is very high for instance.
And for each Segment value passed in the Scan request (with TotalSegments value being constant), each Segment would return distinct items without any overlap.
FAQs
Q. Ideal Number for TotalSegments ?
-> You might need to experiment with values, find the sweet spot for your table, and the number of workers you use, until your application achieves its best performance.
Q. One or more segments do not return any records. Why?
-> This is possible if the hash range that is allocated as per the TotalSegments value does not have any items. In this case, the TotalSegments value can be decreased, for better performance.
Q. Scan for a segment failed midway. Can a Scan for that segment alone be retried now ?
-> As long as the TotalSegments value remains the same, a Scan for one of the segments can be re-run, since it would have the same hash range allocated at any given time.
Q. Can I perform a Scan for a single segment, without performing the Scan for other segments as per TotalSegments value?
-> Yes. Multiple Scan operations for different Segments are not linked/do not depend on previous/other Segment Scans.

Related

Are logical nodes responsible for a continuous range of keys or a random set of keys

I was reading the DynamoDB whitepaper. In it, it is explained how keys obtained from a hash function create a (circular) range. Then, logical nodes are responsible for continuous segments of that range.
Dynamo’s partitioning scheme relies on consistent hashing to
distribute the load across multiple storage hosts. In consistent
hashing [10], the output range of a hash function is treated as a
fixed circular space or “ring” (i.e. the largest hash value wraps
around to the smallest hash value). Each node in the system is
assigned a random value within this space which represents its
“position” on the ring. Each data item identified by a key is assigned
to a node by hashing the data item’s key to yield its position on the
ring, and then walking the ring clockwise to find the first node with
a position larger than the item’s position.
However, under Uniform Load Distribution, some strategies are detailed:
Strategy 1: T random tokens per node and partition by token value
Strategy 2: T random tokens per node and equal sized partitions
So then these tokens(which I'm assuming are keys?) are distributed randomly to nodes?
So logical nodes are responsible for a continuous range of keys or a random set of keys?
Disclaimer: I've just the read paper, I'm no expert.
I understand tokens and keys to both occupy key space (i.e. position on the key ring), however they are not the same thing.
Dynamo uses the concept of “virtual nodes”. A virtual node looks like
a single node in the system, but each node can be responsible for more
than one virtual node. Effectively, when a new node is added to the
system, it is assigned multiple positions (henceforth, “tokens”) in
the ring.
So, in the "basic consistent hashing algorithm" approach, you take each node in your system and randomly assign it a position in the key ring. Therefore each node is responsible for a single continuous range in key space. A wedge of the circle if you will.
The authors note this has some problems around uniformity of access. So instead they came up with a "variant of the consistent hashing algorithm".
In the alternative scheme each node is given a set of 'Tokens'. A token is a virtual node. Conceptually you can imagine lots of small pieces of key space, taken from all around the ring, and assigned to the node. Or in my head - lots of tiny wedges from all around the circle.
In the actual scheme they went for, each virtual node (token) is a continuous set of keys. However each actual node has multiple non-continuous virtual nodes.
Therefore each node has many sections of continuous key space, but taken from all over the total key space. Not quite random and not quite continuous either!

Is a fixed partition key bad practice?

I have a DynamoDB table with:
Timestamp (HASH)
Text (String)
I want to be able to get the latest item via a query, but doing so requires that I sort by Timestamp rather than partition by it. I was considering doing this instead:
Partition (HASH, hard-coded as whatever)
Timestamp (RANGE)
Text (String)
That way I can query and pass a hard-coded partition in.
But is this bad practice?
It depends.
The main thing to consider is that partitions have a finite throughput for both reads and writes. This is independent from the provisioned throughput for the table. Partition throughput is constrained by the hard disk's read and write speeds. Remember that all items with the same hash value will live on the same partition and therefore will be written to the same disk (discounting replication).
So, it depends on your scale. It will work for a small scale, low throughput use case but it won't be able to scale beyond a single disk.
It is usually bad practice to use a single, hard coded value for your hash key. Rather than a hard coded hash key value, you should consider using year_month_day (or some variation) as your hash key for this use case. It's still not great, but it's much better than a single value.
If you do want to use hard coded hash key values, consider using multiple hard coded values to shard your data across partitions.

Key Value Store for large list of integer values

My application requires a key value store. Following are some of the details regarding key values:
1) Number of keys (data type: string) can either be 256, 1024 or 4096.
2) Data type of values against each key is a list of integers.
3) The list of integers (value) against each key can vary in size
4) The largest size of the value can be around 10,000,000 integers
5) Some keys might contain very small list of integers
The application needs fast access to the list of integers against a specified key . However, this step is not frequent in the working of the application.
I need suggestions for best Key value stores for my case. I need fast retrieval of values against key and value size can be around 512 MB or more.
I checked Redis but it requires the store to be stored in memory. However, in the given scenario I think I should look for disk based key value stores.
LevelDB can fit your use case very well, as you have limited number of keys (given you have enough disk space for your requirements), and might not need a distributed solution.
One thing you need to specify is if (and how) you wish to modify the lists once in the db, as levelDB and many other general key-val stores do not have such atomic transactions.
If you are looking for a distributed db, cassandra is good, as it will also let you insert/remove individual list elements.

ratio between unique hash key and range key in dynamo db

Is it a problem if I choose my hash key and range key so that the number of unique hash keys is very low (maximum: 1000), while there are many more unique range keys?
Does the ratio between the number of unique hash and range keys affect the performance of retrieval of information?
It should not be a problem to have few hash keys with many range keys for each if:
The number of hash keys is not too low
Your access is randomly spread across the hash keys
You don't need to scale to extreme levels
According to the AWS Developer Guidelines for Working with Tables:
Provisioned throughput is dependent on the primary key selection, and
the workload patterns on individual items. When storing data, DynamoDB
divides a table's items into multiple partitions, and distributes the
data primarily based on the hash key element. The provisioned
throughput associated with a table is also divided evenly among the
partitions, with no sharing of provisioned throughput across
partitions.
Essentially, each hash key resides on a single node (i.e. server). Actually, it is redundantly stored to prevent data loss, but that can be ignored for this discussion. When you provision throughput you are indirectly determining the number of nodes to spread the hash keys across. However, no matter how much throughput you provision, it is limited for a single hash key by what a single node can handle.
To explain my three caveats:
1. The number of hash keys is not too low
You mention a max of 1000 hash keys, but the concern is what the minimum is. If for example there were only 10 hash keys then you would quickly reach the throughput limit for each key and would not actually realize the provisioned throughput.
2. Your access is randomly spread across the hash keys
It doesn't matter how many hash keys you have if there are a small number of keys that are "hot". That is if you are frequently reading or writing to only a small subset of the hash keys then you will reach the throughput limit of the nodes those keys are stored on.
3. You don't need to scale to extreme levels
Even assuming you have 1000 distinct hash keys and your access is randomly spread across them, if you need to scale to extreme levels you will eventually reach a point where each hash key is on a separate node. That is, if you provision enough throughput that each hash key is allocated to a separate node (i.e. you have 1000+ nodes), then any throughput provisioned beyond that level will not be realized because you will reach the limit of each node for each key.
The ratio of range keys to hash keys should have little to no affect on get, scan and query performance.
It is my understanding that the range keys for each hash key are efficiently stored in some kind of index that will scale well. However, remember that all the rows for a given hash key are stored together on the same node, so you can reach a point where there is too much data for a given hash key. The AWS Limits in DynamoDB states:
For a table with local secondary indexes, there is a limit on item
collection sizes: For every distinct hash key value, the total sizes
of all table and index items cannot exceed 10 GB. Depending on your
item sizes, this may constrain the number of range keys per hash
value.
As far as I know, this doesn't matter. The load distribution depends on the "frequency" of access and not on the "possible combinations". If your access is uniformly distributed across the 1000 keys you are taking about, then it is OK - This means the probability of fetching by key1 should me similar to probability of fetching key10 or key100. Internally I guess they would be bucketing your 1000 keys into say 3 groups and each of these groups "might" be served by 3 machines. You need to ensure that your access is nearly uniform so that all 3 machines get uniform load share.

What's the point of a hash table?

I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.

Resources