What does it mean when people say 'search a hash table'? [duplicate] - hashtable

This question already has answers here:
How does a hash table work?
(17 answers)
How do I search a hash table?
(2 answers)
Closed 3 years ago.
Let's say you have a hash table which contains employee names as keys and in each bucket the employee salary.
When people say to write something to search this hash table what are they referring to? What is the input to the search function?
Do they mean search up what the salary is for a specific employee? How would collisions be handled in this case?
My question is different because, I'm asking for clarification about a colloquial term I hear often, specifically in the context of an example problem.
I'm not really asking for a code implementation. The question about collisions is probably the closest thing related to an implementation like in the possible duplicate answers provided but even then that question can be answered theoretically.

You are asking for a guess as to what people mean by inexact terminology. Hash tables are not really "searched" in any real sense other than to determine if an associated value was already stored.
In your example, the keys are the employee name.
The hash of a name as input computes a value that resolves to a
bucket.
That bucket can be empty.
If the bucket isn't empty, then return the value stored in the bucket.
This has an obvious problem to it, and you might notice that associative arrays that are part of the standard datatypes or class libraries of many popular languages, can not deal with this problem.
If you have "John Smith" as an employee and hire a 2nd employee named "John Smith" there is no way for a hash table to tell the difference between the two people. So to be clear, if you try and add a new John Smith, it will either reject the new John Smith or overwrite the salary value for the one that was first stored. This is an implementation detail that might fall within the scope of your question, and explains why we have id#'s associated with our identities or in essentially any database system.
The main issue that hash tables have to deal with is hash collisions. This is when there are 2 or more inputs that hash to the same value.
Typically what happens is that after the hash computation occurs, it looks into the bucket data structure. There are 2 possibilities:
Bucket is empty
Bucket has one or more entries
So what is needed to resolve multiple entries? Some of the other answers enumerate or go into detail as to the schemes that have been employed to deal with collisions. The obvious and simple answer is that often the stored data structures include the key: value pair together.
Assume that "Bob Jones" and "Sam Smith" both hash to bucket 'Z'. In bucket Z you might find a linked list of values like this:
Bucket Z: ["Sam Smith": 23000] -> ["Bob Jones": 48500]
The -> represents a pointer that the implementation uses to find the next entry.
Hopefully this illustrates the way a bucket can actually store multiple salary values for different names and still return the correct salary associated with the original key. This could be considered an actual "search" because the system needs to search for a the value associated with the original key, but the hash itself is simply a computation. Look at well known hash algorithms like MD5 or SHA1 for examples. A Hash is a computation and not a search. A Hash table is a data structure that combines hashed values as the index for finding a previously stored value, or storing a new value.

Related

How to sort and query DynamoDB by non unique values? I.E. Names

Let's say I make a GSI for 'Name' and I have two people in my database who just happen to have the same name:
Tim Cook
Tim Cook
Now this will fail a consistency constraint on insert for duplicate values hence we need another approach.
I was thinking about hashing the name values at the end so that the BEGINS_WITH operator can still be used to search / match on but that puts you in a weird position. What do you salt with? How many characters? The longer the salt the more memory and potentially compute you waste cleaning up the salt before returning the results to the user. The shorter the salt the more likely you are to have collisions. After all there are some incredibly common names out there.
Here's an example of the values salted:
Tim Cook#ABCDEF
Tim Cook#ZYXWVU
This is great as I can insert both values now and now I can create a 'search user by name' endpoint for the user via the BEGINS_WITH('Tim Cook') operation but it feels weird.
I did a bit of searching though on sorting and searching by names in DynamoDB and didn't come up with anything meaningful on how to proceed from here. Wondering what you guys think.
My one and final issue is that names are not evenly spread out so you're inevitably going to have hotter partitions but I just don't see another way around this. Minus of course exfiltrating the data to another data store and querying it there like a full text search store.
You can’t insert to a GSI. So your concern is kind of misplaced.
You also can’t Get Item on a GSI, only Query, and that’s because there’s not necessarily one matching value for a given key.
Note: The GSI always projects the primary key over from the base table.
You can follow the following schema pattern to achieve your goal:
Partition key: Name
Sort/Range key: createdAt (The creation time of that row)
In this case, if the name is same for more than 1 people, you will be returned with all the names sorted automatically. This schema will also allow you to create a unique access pattern for each item of your table.
Partition key -> Sort key
Name -> createdAt
Tim Cook -> "HH:mm:ss"
Each row will have a different creation time and will provide unique composite key values for each item of the table.
For some reason I thought GSI's had the same uniqueness constraint as partition keys however that's not the case - you can have duplicates.
In a DynamoDB table, each key value must be unique. However, the key values in a global secondary index do not need to be unique.
Source
So a GSI is a perfectly good way to store duplicated information. Not sure this question is helpful now since it came about through ignorance so it might be worth deleting now.

Use a person's name to create a unique, non-reversible ID

Is there a way (an algorithm) to take someone's name (first name and surname) and turn it into a unique, non-reversible ID? I know I could just start at unique id = 1 and just add one. I was wondering if there was a way that I could generate IDs using someone's name.
I'm just after a pointer to the way to do it and I'll code it myself.
What you're asking for is called a hash function. Hash functions will always produce the same fixed-length output for a given input.
For non-cryptographic endeavours, SHA256 is often a good choice. It will produce a 256-bit output and is available natively in most standard libraries.
No.
You say you want to generate a non-reversible ID from a name. While you can generate an ID from a name, this will always be reversible. That is because there is not enough information in a name.
Consider that any name is chosen from a set of 7 billion names (that's how many people there are), then the maximum possible Shannon entropy of an ID is about 32.7 bits. In other words, it will take about 2^32.7 tries before you have tried all names on Earth. For a computer, that's easy.
For example, if we would create the IDs by hashing with SHA256, as Luke Joshua Park suggests, we can easily find the name corresponding to an ID using a regular brute force attack. My laptop needs only 0.5µs to compute a SHA256 hash. So to find the IDs off all humans on earth, I would need less than an hour if I use a somewhat beefy CPU.

(CHORD) Peer-2-Peer How does it work/What does it do?

https://en.wikipedia.org/wiki/Chord_(peer-to-peer)
I've looked into Chord and i'm having trouble understanding exactly what it does.
It's a protocol for a distributed hash table which stores various keys/values for later usage? Is it just an efficient way to look up in the hash table what value for a given key?
Any help such as a basic example would be much appreciated
An example question is say if I hashed inserting string "Hi" to 3 and there were no peers at 3 it would go to the next available peer and store it there right? Or where does it store it's values to?
I already answered a similar question for bittorrent/kademlia, so just to summarize in a more general sense:
DHTs store the values with some redundancy on N nodes whose ID is closest to the target hash.
Considering the vastness of >= 128bit keyspaces it would be extremely unlikely for nodes to exactly match the key. At least in routing schemes where nodes don't adjust their IDs based on content, and chord is one of those.
It's pretty much the same as regular hash tables, hence distributed hash table. You have a limited set of buckets into which the entries are hashed, where the bucket space is much smaller than the potential input keyspace and thus does not precisely match the keys either.

Alternative to GUID with Scalablity in mind and Friendly URL

I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.

What is a hash map in programming and where can it be used

I have often heard people talking about hashing and hash maps and hash tables. I wanted to know what they are and where you can best use them for.
First you shoud maybe read this article.
When you use lists and you are looking for a special item you normally have to iterate over the complete list. This is very expensive when you have large lists.
A hashtable can be a lot faster, under best circumstances you will get the item you are looking for with only one access.
How is it working? Like a dictionary ... when you are looking for the word "hashtable" in a dictionary, you are not starting with the first word under 'a'. But rather you go straight forward to the letter 'h'. Then to 'ha', 'has' and so on, until you found your word. You are using an index within your dictionary to speed up your search.
A hashtable does basically the same. Every item gets an unique index (the so called hash). You use this hash for lookups. The hash may be an index in a normal linked list. For instance your hash could be a number like 2130 which means that you should look at position 2130 in your list. A lookup at a known index within a normal list is very easy and fast.
The problem of the whole approach is the so called hash function which assigns this index to each item. When you are looking for an item you should be able to calculate the index in advance. Just like in a real dictionary, where you see that the word 'hashtable' starts with the letter 'h' and therefore you know the approximate position.
A good hash function provides hashcodes that are evenly distrubuted over the space of all possible hashcodes. And of course it tries to avoid collisions. A collision happens when two different items get the same hashcode.
In C# for instance every object has a GetHashcode() method which provides a hash for it (not necessarily unique). This can be used for lookups and sorting with in your dictionary.
When you start using hashtables you should always keep in mind, that you handle collisions correctly. It can happen quite easily in large hashtables that two objects got the same hash (maybe your overload of GetHashcode() is faulty, maybe something else happened).
Basically, a HashMap allows you to store items with identifiers. They are stored in a table format with the identifier being hashed using a hashing algorithm.
Typically they are more efficient to retrieve items than search trees etc.
You may find this helpful: http://www.relisoft.com/book/lang/pointer/8hash.html
Hope it helps,
Chris
Hashing (in the noncryptographic sense) is a blanket term for taking an input and then producing an output to identify it with. A trivial example of a hash is adding the sum of the letters of a string, i.e:
f(abc) = 6
Note that this trivial hash scheme would create a collision between the strings abc, bca, ae, etc. An effective hash scheme would produce different values for each string, naturally.
Hashmaps and hashtables are datastructures (like arrays and lists), that use hashing to store data. In a hashtable, a hash is produced (either from a provided key, or from the object itself) that determines where in the table the object is stored. This means that as long as the user of the hashtable is aware of the key, retrieving the object is extremely fast.
In a list, in comparison, you would need to in some way search through the list in order to find your sought object. This also represents the backside of hashtables, which is that it is very complicated to find an object in it without knowing the key, because where the object is stored in the table has no relevance to its value nor when it was inputed.
Hashmaps are similar to hashtables, but only one example of each object is stored in it (hence no key needs to be provided, the object itself is the key).
This is of course a very simple explanation, so I suggest you read in depth from this point on. I hope I didn't make any silly mistakes. =)
Hashmap is used for storing data in key value pairs. We can use a hashmap for storing objects in a application and use it further in the same application for storing, updating, deleting values. Hashmap key and values are stored in a bucket to a specific entry, this entry location is determined using Hashcode function. This hashcode function determines the hash where the value is stored. The detailed explanantion of how hashmap works is described in this video: https://youtu.be/iqYC1odZSNo
Hash maps saves a lot of time as compared to other search criteria. We have a hash key that corresponds to a hash code which further helps to find its index value. In terms of implementation, hash maps takes a string converts it into an integer and remaps it to convert it into an index of an array which helps to find the required value.
To go in detail we can look for handling collisions in hash maps. Like instead of using array we can go with the linked list.
There is a short video available to understand it.
Available here :
Implementation example --> https://www.youtube.com/watch?v=shs0KM3wKv8
Sample:
int hashCode(String s)
{
logic
}

Resources