Optimal data struct for k-k-v mapping - dictionary

I have a large mapping table with 1.4 billion records. The data struct is now like {<Key1, Key2>: List<Value>}.
Key1 and Value are from same set, let's say A, with ~0.1 billion unique elements.
Key2 are from another set, let's say B, with only 32 unique elements.
List<Value> is variable length list with up to 200 maximum elements.
Can someone recommend any better data structure or retrieval algorithm for quick online retrieval and proper space consumption.

You could use an extendible hash table for this:
https://en.wikipedia.org/wiki/Extendible_hashing
If you don't want to implement it yourself, then you could try using something like Redis or Memcached to serve as an external implementation of a persistable hash table.
To create the hashing key, just combine Key1 and Key2 (concatenate? xor?) and use that as a hash key.
If in RAM use a hash table with a dynamic array for your list. That should work well.
Unless you care about the order of the keys, hash tables should do the job.
If you want to get all Key1s associated to a Key2 you can do that as well by maintaining a separate hash table for that. Or if you're actually implementing this you could link the keys so that all keys that have Key2 in them, form a linked list.

Related

Best way to model high score data in DynamoDB

I believe this would be easier with PostgreSQL or MongoDB, both of which I'm familiar with, but I'm using DynamoDB with my project for the sake of learning how to use it and getting comfortable with it. I've never used it before.
I want to use DynamoDB to store high scores for my typing test project. There are 4 data attributes to be stored:
name (doesn't need to be unique)
WPM
number of errors
test type (because I have 2 different kinds of typing tests)
At first, my partition key was testType, and my sort key was WPM. Then I realized that if anyone got the same WPM as a previous user, it would overwrite the previous user's data, because testType and WPM, the two key components, were identical. So ties did not work.
So, now, name is my partition key, and WPM is my sort key. In order to filter by testType, I just use JS array filter methods. This still doesn't seem optimal though for multiple reasons. For my small typing test project, I think it's ok, but I can see that it's possible for 2 people to input the same name and get the same WPM and overwrite each other.
What would be a better way to set this up with DynamoDB?
Assuming you want the top X many WPM results for a given test type:
Set the partition key to be the test type. Set the sort key as <WPM>#<username>. Make sure to zero-pad the WPM so it’s always 3 digits even if the score is below 100. That keeps it numerically sorted.
With this key structure you have a sorted list (in the sort key) of all the scores for a given test type. You can Query against the test type and use ScanIndexForward=false to get descending high scores.
Notice how multiple identical scores by different usernames won’t overwrite each other. The username can be pulled from the returned sort key or from an attribute on the item, along with other metadata about the high score event.
If you have multiple users with the same username, well, that’s kinda weird. Presumably you have an internal identifier. You can use that as the suffix in the sort key instead of the username.

AWS DynamoDB Query based on non-primary keys

I'm new to AWS DynamoDB and wanted to clarify something. Is it possible to query a table and filter base on a non-primary key attribute. My table looks like the following
Store
Id: PrimaryKey
Name: simple string
Location: simple string
Now I want to query on the Name, but I think I have to give the key as well from what I know? Apart from that I can use the scan but then I will be loading all the data.
From the docs:
The Query operation finds items based on primary key values. You can query any table or secondary index that has a composite primary key (a partition key and a sort key).
DynamoDB requires queries to always use the partition key.
In your case your options are:
create a Global Secondary Index that uses Name as a primary key
use a Scan + Filter if the table is relatively small, or if you expect the result set will include the majority of the records in the table
There are few designs principals that you can follow while you are using DynamoDB. If you are coming from a relational background, you have already witnessed the query limitations from primary key attributes.
Design your tables, for querying and separating hot and cold data.
Create Indexes for Querying from Non Key attributes (You have two options, Global Secondary Index which you can define at any time and Local Secondary Index which you need to specify at table creation time).
With the Global Secondary Index you can promote any NonKey attribute as the Partition Key for the Index and select another attribute for Sort Key for querying. For Local Secondary Index, you can promote any Non Key attribute as the Sort Key keeping the same Partition Key.
Using Indexes for query is important also to improve the efficiency in using provisioned throughput.
Although having indexes consumes the read throughput from the table, it also saves read through put from in a way that, if you project the right amount of attributes to read, it can give a huge benefit in reading. Check the following example.
Lets say you have a DynamoDB table that has items of 40KB. If you read directly from the table to list 10 items, it consumes 100 Read Throughput Units (For one item 10 Units since one unit can read 4KB and multiply it by 10). If you have an index defined just to project the attributes needed to list which will be having 4KB per item, then it will be consuming only 10 Read Throughput Units(One Unit per item) which makes a huge difference in terms of cost.
With DynamoDB its really important how you define Indexes to optimize for Querying not only from Query capability but also in terms of throughput.
You can not query based non-primary key attribute in Dynamo Db.
If you wanted to still do that you can do it using scan query,but scan is costly operation in DyanmoDB and if table is large, then it will affect performance and not recommended because it will scan each item in table and AWS cost you for all item it scan for that query.
There are two ways to achieve it
Keep Store Id as your PrimaryKey/ Partaion key of Dyanmo DB table and add Name/Location as sort Key (only one as Dyanmo DB accept only one Attribute as sort key by design.
Create Global Secondary Indexes for Querying from Non Key attributes which you are more frequenly required.
There are 3 ways to created GSI in Dyanamo DB, In your case select GSI with option INCLUDE and add Name , Location and store ID in Idex.
KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values. The KEYS_ONLY option results in the smallest possible secondary index.
INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.
ALL – The secondary index includes all of the attributes from the source table. Because all of the table data is duplicated in the index, an ALL projection results in the largest possible secondary index.

What is a hash table that doesn't know the keys?

For example, if I create a dictionary in python I can use d.keys() to retrieve the keys.
What is a hash table/dictionary without this kind of access? Storage might be an issue and the keys may be of least importance.
Edit (clarification): I want a data structure that can access values through the key but doesn't know the key, only the hash. For example:
Hash Value
-----------------------------------------------------------------------
2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae|hey!
c9fc5d06292274fd98bcb57882657bf71de1eda4df902c519d915fc585b10190|hello!
If I try and access the data structure with the key "this is a key", it will hash that and get "hello!". If I try to access it with the key "foo", I will get "hey!".
We cannot retrieve the keys from this hash table, but we can access the data. This would be useful in cases where storage is important.
Normally, this would be the table:
Hash Value Key
-------------------------------------------------------------------------------------
2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae|hey! |foo
c9fc5d06292274fd98bcb57882657bf71de1eda4df902c519d915fc585b10190|hello!|this is a key
This is called a Set - in this case the value is the key, and implementations generally use the hashcode and equality operations on the items before adding them to the set.
Some implementations of Set can be sorted, generally those are referred to as SortedSet. Think of Set<T> as an equivalent to Dictionary<T,T> (and SortedSet<T> being approximate of SortedDictionary<T,T> in C# parlance.
Sorted variants are generally implemented using binary trees, whereas unsorted implementations use hashing tables. As the key is the value, most implementations only store the value itself.
Which platform / language are you using? Java?

limit offset, sorting and aggregation challenges in DynamoDB

I am using DynamoDB to store my device events (in JSON format) into table for further analysis and using scan APIs to display the result set on UI, which requires
To define limit offset of records,say 10 records per page, means
result set should be paginated(e.g. page-1 has 0-10 records, page-2
has 11-20 records and so on), i got an API like scanRequest.withLimit(10) but it has different meaning of limit offset, does DynamoDB API comes with support of limit offset?
I also need to sort result set on basis of user input fields like sorting on Date, Serial Number etc, but still didn't get any sorting/order by APIs.
I may look for aggregation e.g. on Device Name, Date etc. which also doesn't seems to be available in DynamoDB.
The above situation led me to think about some others noSQL database solutions, Please assist me on above mentioned issues.
The right way to think about DynamoDB is as a key-value store with support for indexes.
"Amazon DynamoDB supports key-value data structures. Each item (row) is a key-value pair where the primary key is the only required attribute for items in a table and uniquely identifies each item. DynamoDB is schema-less. Each item can have any number of attributes (columns). In addition to querying the primary key, you can query non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes."
https://aws.amazon.com/dynamodb/details/
A table can have 2 types of keys:
Hash Type Primary Key—The primary key is made of one attribute, a
hash attribute. DynamoDB builds an unordered hash index on this
primary key attribute. Each item in the table is uniquely identified
by its hash key value.
Hash and Range Type Primary Key—The primary
key is made of two attributes. The first attribute is the hash
attribute and the second one is the range attribute. DynamoDB builds
an unordered hash index on the hash primary key attribute, and a
sorted range index on the range primary key attribute. Each item in
the table is uniquely identified by the combination of its hash and
range key values. It is possible for two items to have the same hash
key value, but those two items must have different range key values.
What kind of primary key have you set up for your Device Events table? I would suggest that you denormalize your data (i.e. pull specific attributes out of the json) and build additional indexes on those attributes that you want to sort and aggregate on: Date, Serial Number, etc. If I know what kind of primary key you have set up on your table, I can point you in the right direction to build these indices so that you can get what you need via the query method. The scan method will be inefficient for you because it reads every row in the table.
Lastly, with regard to your "limit offset" question, I think that you're looking for the ExclusiveStartKey, which will be returned by DynamoDB in the response to your query.
The ExclusiveStartKey is what will help you do pagination. It's not necessary to depend on the LastEvaluatedKey from the response. You'll get LastEvaluatedKey only if you are getting more than a MB worth data. If LIMIT page size is such that total returned data size is less than 1 MB, you'll not get back LastEvaluatedKey. But that does not stop you from using ExclusiveStartKey as an offset.

What exactly are hashtables?

What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.

Resources