For example, if I create a dictionary in python I can use d.keys() to retrieve the keys.
What is a hash table/dictionary without this kind of access? Storage might be an issue and the keys may be of least importance.
Edit (clarification): I want a data structure that can access values through the key but doesn't know the key, only the hash. For example:
Hash Value
-----------------------------------------------------------------------
2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae|hey!
c9fc5d06292274fd98bcb57882657bf71de1eda4df902c519d915fc585b10190|hello!
If I try and access the data structure with the key "this is a key", it will hash that and get "hello!". If I try to access it with the key "foo", I will get "hey!".
We cannot retrieve the keys from this hash table, but we can access the data. This would be useful in cases where storage is important.
Normally, this would be the table:
Hash Value Key
-------------------------------------------------------------------------------------
2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae|hey! |foo
c9fc5d06292274fd98bcb57882657bf71de1eda4df902c519d915fc585b10190|hello!|this is a key
This is called a Set - in this case the value is the key, and implementations generally use the hashcode and equality operations on the items before adding them to the set.
Some implementations of Set can be sorted, generally those are referred to as SortedSet. Think of Set<T> as an equivalent to Dictionary<T,T> (and SortedSet<T> being approximate of SortedDictionary<T,T> in C# parlance.
Sorted variants are generally implemented using binary trees, whereas unsorted implementations use hashing tables. As the key is the value, most implementations only store the value itself.
Which platform / language are you using? Java?
Related
I have a large mapping table with 1.4 billion records. The data struct is now like {<Key1, Key2>: List<Value>}.
Key1 and Value are from same set, let's say A, with ~0.1 billion unique elements.
Key2 are from another set, let's say B, with only 32 unique elements.
List<Value> is variable length list with up to 200 maximum elements.
Can someone recommend any better data structure or retrieval algorithm for quick online retrieval and proper space consumption.
You could use an extendible hash table for this:
https://en.wikipedia.org/wiki/Extendible_hashing
If you don't want to implement it yourself, then you could try using something like Redis or Memcached to serve as an external implementation of a persistable hash table.
To create the hashing key, just combine Key1 and Key2 (concatenate? xor?) and use that as a hash key.
If in RAM use a hash table with a dynamic array for your list. That should work well.
Unless you care about the order of the keys, hash tables should do the job.
If you want to get all Key1s associated to a Key2 you can do that as well by maintaining a separate hash table for that. Or if you're actually implementing this you could link the keys so that all keys that have Key2 in them, form a linked list.
I have a DynamoDB table with a primary hash key, and a range key. Range key will have two attributes. Say those attribute names are: name1, name2, with values value1, value2
Plan A: combine two attributes as string, use comma as delimiter
Primary hash key: id
Range key: value1,value2
Cons
1. comma may not work if some wired values contain this delimiter
Plan B: convert map as String for range key
Primary hash key: id
Range key: “{\“name1\”: \“value1\”, \“name2\”: \“value2\”}”
Cons
1. different SDK may result into different JSON String based on the same value? (Not sure), need to support multiple SDK read/write. Like Java and Ruby
So, which solution works better? Or there are any better suggestions?
Thanks!
Ray
You're on the right track. The AWS docs regarding key design promote your first suggestion, but it also has some warnings about the situation that you refered as cons.
I don't think that you could have problemas with different sdk parsers, but I also think that a little bit of precautions here would be a good ideia. So instead of directly parse a json to string using the sdk, I would manually concatenate the values using a custom function to generate a deterministic value like "name1-value1-name2-value2" or "name1:value1-name2:value2".
I have set the partition key of one of my Cosmos DBs to /partition.
For example: We have a Chat document that contains a list of Subscribers, then we have ChatMessages that contain a text, a reference to the author and some other properties. Both documents have a partition property that contains the type 'chat' and the chats id.
Chat example:
{
"id" : "955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"name" : "SO questions",
"isChat" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"subscribers" : [
...
]
}
We then have Message documents like this:
{
"id" : "4d1c7b8c-bf89-47e0-83e1-a8cf0d71ce5a",
"authorId" : "some guid",
"isMessage" : true,
"partition" : "chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c",
"text" : "What should I do?"
}
It is now very convenient to return all messages for a specific chat, I just need to query all documents of the partition chat_955f3eca-d28d-4f83-976a-f5ff26d0cf2c with the property isMessage = true. All good...
But if I now want to query my db for a specific message by id, I usually just know the id, but not the partition and therefor have to run a slow crosspartition query. Which then led me to the question if I should not add the partitionKey to the message id so I can split the id when querying the db for a faster lookup. I saw that the _rid property of a document looks like a combination of the id of a db and the id of the collection and then a document specific id. What I mean by this is (simplified):
Chat.Id = "abc"
Chat.Partition = "chat_abc" //[type]_[chatId]
Message.Id = "chat_abc|123" //[Chat.Partition]|[Message.Id]
Message.Partition = chat_abc //[Chat.Partition]
Lets assume that I now want to get the Message document by the id, I just split the id by the | symbol and then query the document with the 1st part of the id as partition and the full id as the key.
Does that make sense? Are there better ways to do this? Should I just always also pass the partitionKey of a document along, not just it's id? Should I just use the _rid properties instead?
Any experience is highly appreciated!
UPDATE
I have found the following answer here:
Some applications encode partition key as part of the ID, e.g.
partition key would be customer ID, and ID = "customer_id.order_id",
so you can extract the partition key from the ID value.
I have further asked the cosmos team by email if this is a recommended pattern and post an answer, in case I get any.
Yes, your proposal to extract partition key from id (via a convention like a prefix/delimiter) makes sense. This is common among applications that have a single key and want to refactor it to use Cosmos DB from a different storage system.
If you're building your application from scratch, you should consider wiring the composite key (partition key + item key ("id")) through your API/application.
First, if you know your data (and index) size) will remain within the 10gb limit and you RU/sec limit is ok, then a fixed partition-less collection will bypass this problem. Probably OP has knowlingly made the decision that partitioning is required, but it is an important consideration to note for generalization purposes. If possible, KISS ;)
If partitioning is a must, then AFAIK you cannot avoid crosspartition split and its overhead unless you know the partition key.
Imho the OP suggestion of merging the duplicated partition key into id field is a rather ugly solution, because:
Name id implies it is unique key, partition key is not part of it or necessary for this key and its uniqueness. Anyone using this key upstream would incur the forced excess cost of longer key, blocked from using the simpler Guid type, etc.
It will become a mess should your partitioning key change in future.
The internal structure of merged id would not be intuitive without documentation - it's parts are not named and even if they look like to have a pattern new devs would not know for sure without finding external documentation to reliably understand what's going on.
Your data model does not require this duplication on semantic level, it would be for your application querying comfort and hence such hacks should belong to your application code, not data model. Such leaking concerns should be avoided if possible.
Data duplication within document would unnecessarily increase document size, bandwidth, etc. (may or may not be notable, depending on scale and usage). in-document duplication is necessary at times, but imho not necessarily in this case.
A better design would be to ensure the partition key is always present in logic context and could be passed along to lookups. If you don't have it available, then maybe you should refactor you application code (not data design) to explicitly pass around the chatId along with id where needed. That is WITHOUT merging them together into some opaque string format.
Also, I don't see a good way to use _rid for this as if I remember correctly, it did not contain any internal reference to a partition or partition key.
Disclaimer: I don't have any access or deep insight into internal CosmosDB index design or _rid logic on partitioned collections. I may have misunderstood how it works.
I am working on a project that requires key/value caching function but the application will exist in a very limited environment that does not support any of the go-to industry standard memory caching methods such as ASP.NET Cache, memcached, AppFabric.
The only option we have in this restrictive environment is a MS SQL database. We have to create a simple key/value table to meet our key/value caching needs. We will most likely serialize the data as JSON but I am not sure what would be the best data type for the key. It will obviously need to be a unique key and need to be readable by the programmer getting and setting the cache. It also needs to be a fast look up since we will already be loosing performance not having access to an "in memory" cache solution.
I am use to having my primary key column be an int or bigint value. In this case should the primary key (the cache key) be a char or varchar data type as all queries will be:
SELECT value FROM CacheTable WHERE key = 'keyname'
I also saw posts about using an md5 hash but other posts pointed out that hashing cannot be relied on to produce unique keys all the time. I'm basically after some advice on the data type and rather or not the 'key' column should be the primary key or if I should still create an int or bigint primary key (even though it probably will not be used).
The end result we are after is creating a caching class similar to .NET's native caching where we can create a static class that pulls from the database table such as:
CustomDatabaseCache.Set(string key, object value);
CustomDatabaseCache.Get(string key)
I think in your scenario having a clustered primary key on your keyname column would work fine. However, it's worth experimenting with fill factors, because you want a fill factor that is low enough that you don't cause excessive page splits, but one that is high enough to keep the number of page reads low.
A clustered IDENTITY index works better in terms of eliminating page splits on the clustered index - and you could use a unique index on keyname which used an INCLUDE clause to include your values. However - in your case, I don't see the benefit in doing that because you'd have exactly the same page-split problem on your unique index, and the clustered index on keyname would be no more expensive to read because you wouldn't have any extra columns. Plus you would then have index update cost on two indexes for write.
Hope that helps.
What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.