Since it is possible for two criteria to hash to the same bucket, does that mean that the function you end up with might not have all the criteria? - bigdata

our lecturer said that for each criterium, we design a hash function which will hash each input to a number between 0 and n. If a selection criterium hashes to a particular bit, we set that bit to one. And several criteria can be sent to the same bit. So does that mean that in the end, the function does not check all criteria, only a few of them?

Related

Partition By & Clustered & Distributed By in USql - Need to know their meaning and when to use them

I can see that while creating table in USQL we can use Partition By & Clustered & Distributed By clauses.
As per my understanding partition will store data of same key (on which we have partition) together or closer (may be in same structured stream at background), so that our query will be more faster when we use that key in joins, filter.
Clustering is - I guess it stores data of those columns together or closer inside each partition.
And Distribution is some method like Hash or Round Robin - the way of storing data inside each partition. If you have integer column and you frequently query within some range , use range else use hash. If your data is not distributed equally then you may face data skew issue, so in that case use round robin.
Question 2: Please let me know whether my understanding is correct or not?
Question 1: There is INTO clause - I want to know how we should identify value for this INTO clause for DISTRIBUTION?
Question 3: Also want to know that which one is vertical partitioning and which one is horizontal?
Question 4: I don't see any good online document to learn these concepts with examples. If you know please send me links.
Peter and Bob have given you links to documentation.
To quickly answer your questions here:
Partitions and distributions both partition the data based on the partitioning scheme and both provide data scale out and partition elimination.
Partitions are optional and individually manageable for data life cycle management (besides giving you the ability to get partition elimination) and currently only support a value-based partition based on the same column values.
Each Partition then gets further partitioned based on the distribution scheme. Here you have different schemes (HASH, RANGE etc). The system decides on the number of distribution buckets based on some heuristic. In the case of HASH partitions, you can also specify the number of buckets with the INTO clause.
The clustering will then specify the order of the rows within a distribution bucket and allows you to further improve query performance (you can to a range scan instead of a full scan for example).
Vertical and horizontal partitioning are terms sometimes used to separate these two levels of partitioning. I try to avoid it, since it can be confusing to remember which one is which.

why HashSet is good for search operations?

hashset underlaying data structure is hashtable .how it will identify duplicates and why it is good for if our frequent operation is search operation ?
It uses hash code of the object which is quickly computed integer. This hash code tries to be as even distributed over all potential object values as possible.
As a result it can distribute the inserted values into a array (hashtable) with very low probability of conflict. Then the search operation is quite quick - get the hash code, access the array, compare and get the value - usually constant time. The same actually happens for finding duplicates.
The conflicts of hash code are resolved as well - there can be potentially more values for the same entry within the hash table - there comes the equal into play. But they are rather rare so they don't affect average performance significantly.

Confusion about finding information in a hash table when there is a collision

I understand that if there is a collision in a hashtable you have a few options of storing the data. You could use some prime number to linearly traverse the array until you find a free spot. You could also rehash the entire table into a larger array. I'm sure there are other ways. What I don't understand is if there is a collision in the first place how would you know which row of data is the one you were looking for? Would I just not allow duplicate keys to be used?
There's a big difference between a hash and a key (although they could sometimes be the same).
The key could be a very large number, a complex object consisting of many fields or anything really.
You apply your hash function to this key to get a hash.
So even if you disallow duplicate keys, you could still have duplicate hashes.
You often can't use your key as a hash directly because array indices are consecutive integers starting at 0, so it won't work if your key is too large, negative or not an integer, and you'll have to apply some sort of hash function.
If you want to store numbers between 1 and 10000, you would let the key be the number itself and could make the hash the remainder of the number divided by 1000 (and you'd thus have an array of size 1000 for the hash table).
Inserting 1001 will put it at index 1. If you try to insert 2001, it will also try to go to index 1 and you'll have a collision.
* The key could either be the entire value you want to store or only an identifier for it.

ASP.NET: Generating Order ID?

I am getting ready to launch a website I designed in ASP.NET.
The problem is, I don't want my customers to have a super low order id(example:#00000001).
How would I generate a Unique(and random) Order ID, so the customer would get an order number like K20434034?
Set your Identity Seed for your OrderId to a large number. Then when you present an order number to the user, you could have a constant that you prepend to the order id (like all orders start with K), or you could generate a random character string and store that on the order record as well.
There are multiple options from both the business tier and database:
Consider
a random number has a chance of collision
it is probably best not to expose an internal ID, especially a sequential one
a long value will annoy users if they ever have to type or speak it
Options
Generate a cryptographically random number (an Int64 generated with RNGCryptoServiceProvider has a very low chance of collision or predictability)
begin an auto-incremented column which begins at some arbitrary number other than zero
use UNIQUEIDENTIFIER (or System.Guid) and base 62 encode the bytes
I suggest you just start the identity seed at some higher number if all you care about is that they don't think the number is low. The problem with random is that there is always the chance for collisions, and it gets more and more expensive to check for duplicates as the number of existing order IDs piles up.
Make column data type as UNIQUEIDENTIFIER . This data type will provide you the ID in the below mentioned format. Hope this fulfills the need.
B85E62C3-DC56-40C0-852A-49F759AC68FB.

Big O of Hash Table vs. Binary Search Tree

Which would take longer?
print all items stored in a binary search tree in sorted order or print all items stored in a hash table in sorted order.
It would take longer to print the items of a hash table out in sorted order because a hash table is never sorted correct? and a BST is?
You are correct. Hashtables are sorted by some hash function, not by their natural sort order, so you'd have to extract all the entries O(N) and sort them O(NlogN) whereas you can traverse a binary search tree in natural order in O(N).
Note however that in Java, for instance, there is a LinkedHashSet and LinkedHashMap which gives you some of the advantages of Hash but which can be traversed in the order it was added to, so you could sort it and be able to traverse it in that sorted order as well as extracting items by hash.
Correct, a hash table is not "sorted" in the way you probably want. Elements in hash tables are not quite fully sorted, usually, although the arrangement is often kind of in the neighborhood of a sort. But they are arranged according to the hash function, which is usually wildly different for similar phrases. It's not a sort by any metric a human would use.
If the main thing you are doing with your collection is printing it in sorted order, you're best off using some type of BST.
A binary search tree is stored in a way that if you do a depth first traversal, you will find the items in sorted order(assuming you have a consistent compare function). The Big O of simply returning items already in the tree would be the Big O of traversing the tree.
You are correct about hash tables, they are not sorted. In fact, in order to enumerate everything in a plain hash table, you have to check every bucket to see what is in there, pull it out, then sort what you get. Lots of work to get a sorted list out of that.
Correct, printing sorted data stored in a hash table would be slower because a hash table is not sorted data. It just gives you a quick way to find a particular item. In "Big O Notation" it is said that the item can be found in constant time, i.e. O(1) time.
On the other hand, you can find an item in a binary search tree in "logarithmic time" (O(log n)) because the data has already been sorted for you.
So if you goal is to print a sorted list, you are much better off having the data stored in a sorted order (i.e. a binary tree).
This brings up a couple of interesting questions. Is a search tree still faster considering the following?
Incorporating the setup time for both the Hash Table and the BST?
If the hash algorithm produces a sorted list of words. Technically, you could create a hash table which uses an algorithm which does. In which case the the speed of the BST vs the Hash table would have to come down to the amount of time it takes to fill the hash table in the sorted order.
Also check out related considerations of Skip List vs. Binary Tree: Skip List vs. Binary Tree

Resources