I'm scanning a huge table (> 1B docs) so I'm using a parallel scan (using one segment per worker).
The table has a hash key and a sort key.
Intuitively a segment should contain a set of hash keys (including all their sort keys), so one hash key shouldn't appear in more than one segment, but I haven't found any documentation indicating this.
Does anyone know how does DynamoDB behave in this scenario?
Thanks
This is an interesting question. I thought it would be easy to find a document stating that each segment contains a disjoint range of hash keys, and the same hash key cannot appear in more than one segment - but I too failed to find any such document. I am curious if anyone else can find such a document. In the meantime, I can try to offer additional intuitions on why your conjecture is likely correct - but also might be wrong:
My first intuition would be that you are right:
DynamoDB uses the hash key, also known as a partition key to decide on which of the many storage nodes to store copy of this data. All of the items sharing the same partition key (with different sort key values) are stored together, in sort-key order, so they can be Queryed together in order. DynamoDB uses a hash function on the partition key to decide the placement of each item (hence the name "hash key").
Now, if DynamoDB needs to divide the task of scanning all the data into "segments", the most sensible thing for it to do is to divide the space of hash values (i.e., hash function of the hash keys) to different equal-sized pieces. This division is easy to do (just a numeric division by TotalSegments), it ensures roughly the same amount of items in each segment (assuming there are many different partitions), and it ensures that the scanning of each segment involves a different storage node, so the parallel scan can proceed faster than what a single storage node is capable of.
However, there is one indication that this might not be the entire story.
The DynamoDB documentation claims that
In general, there is no practical limit on the number of distinct sort key values per partition key value.
This means that in theory at least, your entire database, perhaps one petabyte of it, may be in a single partition with billions of different sort keys. Since Amazon's single storage node do have a size limit, it means DynamoDB must (unless the above statement is false) support splitting of a single huge partition into multiple storage nodes. This means that when GetItem is looking for a particular item, DynamoDB needs to know which sort key is on which storage node. It also means that a parallel scan might - possibly - divide this huge partition into pieces, all scanning the same partition but different sort-key ranges in it. I am not sure we can completely rule out this possibility. I am guessing it will never happen when you only have smallish partitions.
Every DynamoDB table has a "hashspace" and data is partitioned as per the hash value of the partition key. When a ParallelScan is intended and the TotalSegments and Segment values are provided, the table's complete hashspace is logically divided into these "Segments" such that TotalSegments cover the complete hash space, without overlapping. It is quite possible some segments here do not actually have any data corresponding to them, since there may not be any data in the hashspace allocated to the segment. This can be observed if the TotalSegments value chosen is very high for instance.
And for each Segment value passed in the Scan request (with TotalSegments value being constant), each Segment would return distinct items without any overlap.
FAQs
Q. Ideal Number for TotalSegments ?
-> You might need to experiment with values, find the sweet spot for your table, and the number of workers you use, until your application achieves its best performance.
Q. One or more segments do not return any records. Why?
-> This is possible if the hash range that is allocated as per the TotalSegments value does not have any items. In this case, the TotalSegments value can be decreased, for better performance.
Q. Scan for a segment failed midway. Can a Scan for that segment alone be retried now ?
-> As long as the TotalSegments value remains the same, a Scan for one of the segments can be re-run, since it would have the same hash range allocated at any given time.
Q. Can I perform a Scan for a single segment, without performing the Scan for other segments as per TotalSegments value?
-> Yes. Multiple Scan operations for different Segments are not linked/do not depend on previous/other Segment Scans.
I am trying to design the system where I need to store users' secret values in database (private and public key strings). The storing of secrets itself will be done with the help of HashiCorp Vault. But I have one more requirement that disallows to store two equal pairs (private key + public key).
As far as I am not able to check keys uniqueness before storing I have to store hash of the original secrets. My idea to to calculate SHA hash from secret data and compare it with already saved hashes. So, I wonder is it working solution and can I use this digest as an external ID for accessing data (because hash imply the uniqueness of the data entry). Hope for your help.
My idea to to calculate SHA hash from secret data and compare it with already saved hashes
I'd assume cryptographic hash is best option you have when there is no other unique identifier
(because hash imply the uniqueness of the data entry)
And that's wrong assumption. Regardless cryptographic hashes are designed to have negligible collision probability (probability that two inputs are having the same hash value), principially there is still some (very small) probability.
For controlled (formated) inputs I'd say the collision probability is so miniscule, that you could boldly use the hashes as unique identifiers, but prepare to handle a very seldom case that a collision occurs (probably you could post it and become famous)
calculate SHA hash from secret data
Concerning security - it is very hard (=impossible) to compute the input value based on its hash (assuming cryptographic hash currently considered as secure)
Beware of the space size - if you have say 1000 known values, it is trival to check which secret value has certain hash. Assuming you store keypairs, it should be ok
I have a DynamoDB table with:
Timestamp (HASH)
Text (String)
I want to be able to get the latest item via a query, but doing so requires that I sort by Timestamp rather than partition by it. I was considering doing this instead:
Partition (HASH, hard-coded as whatever)
Timestamp (RANGE)
Text (String)
That way I can query and pass a hard-coded partition in.
But is this bad practice?
It depends.
The main thing to consider is that partitions have a finite throughput for both reads and writes. This is independent from the provisioned throughput for the table. Partition throughput is constrained by the hard disk's read and write speeds. Remember that all items with the same hash value will live on the same partition and therefore will be written to the same disk (discounting replication).
So, it depends on your scale. It will work for a small scale, low throughput use case but it won't be able to scale beyond a single disk.
It is usually bad practice to use a single, hard coded value for your hash key. Rather than a hard coded hash key value, you should consider using year_month_day (or some variation) as your hash key for this use case. It's still not great, but it's much better than a single value.
If you do want to use hard coded hash key values, consider using multiple hard coded values to shard your data across partitions.
Is it a problem if I choose my hash key and range key so that the number of unique hash keys is very low (maximum: 1000), while there are many more unique range keys?
Does the ratio between the number of unique hash and range keys affect the performance of retrieval of information?
It should not be a problem to have few hash keys with many range keys for each if:
The number of hash keys is not too low
Your access is randomly spread across the hash keys
You don't need to scale to extreme levels
According to the AWS Developer Guidelines for Working with Tables:
Provisioned throughput is dependent on the primary key selection, and
the workload patterns on individual items. When storing data, DynamoDB
divides a table's items into multiple partitions, and distributes the
data primarily based on the hash key element. The provisioned
throughput associated with a table is also divided evenly among the
partitions, with no sharing of provisioned throughput across
partitions.
Essentially, each hash key resides on a single node (i.e. server). Actually, it is redundantly stored to prevent data loss, but that can be ignored for this discussion. When you provision throughput you are indirectly determining the number of nodes to spread the hash keys across. However, no matter how much throughput you provision, it is limited for a single hash key by what a single node can handle.
To explain my three caveats:
1. The number of hash keys is not too low
You mention a max of 1000 hash keys, but the concern is what the minimum is. If for example there were only 10 hash keys then you would quickly reach the throughput limit for each key and would not actually realize the provisioned throughput.
2. Your access is randomly spread across the hash keys
It doesn't matter how many hash keys you have if there are a small number of keys that are "hot". That is if you are frequently reading or writing to only a small subset of the hash keys then you will reach the throughput limit of the nodes those keys are stored on.
3. You don't need to scale to extreme levels
Even assuming you have 1000 distinct hash keys and your access is randomly spread across them, if you need to scale to extreme levels you will eventually reach a point where each hash key is on a separate node. That is, if you provision enough throughput that each hash key is allocated to a separate node (i.e. you have 1000+ nodes), then any throughput provisioned beyond that level will not be realized because you will reach the limit of each node for each key.
The ratio of range keys to hash keys should have little to no affect on get, scan and query performance.
It is my understanding that the range keys for each hash key are efficiently stored in some kind of index that will scale well. However, remember that all the rows for a given hash key are stored together on the same node, so you can reach a point where there is too much data for a given hash key. The AWS Limits in DynamoDB states:
For a table with local secondary indexes, there is a limit on item
collection sizes: For every distinct hash key value, the total sizes
of all table and index items cannot exceed 10 GB. Depending on your
item sizes, this may constrain the number of range keys per hash
value.
As far as I know, this doesn't matter. The load distribution depends on the "frequency" of access and not on the "possible combinations". If your access is uniformly distributed across the 1000 keys you are taking about, then it is OK - This means the probability of fetching by key1 should me similar to probability of fetching key10 or key100. Internally I guess they would be bucketing your 1000 keys into say 3 groups and each of these groups "might" be served by 3 machines. You need to ensure that your access is nearly uniform so that all 3 machines get uniform load share.
I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.