Which would take longer?
print all items stored in a binary search tree in sorted order or print all items stored in a hash table in sorted order.
It would take longer to print the items of a hash table out in sorted order because a hash table is never sorted correct? and a BST is?
You are correct. Hashtables are sorted by some hash function, not by their natural sort order, so you'd have to extract all the entries O(N) and sort them O(NlogN) whereas you can traverse a binary search tree in natural order in O(N).
Note however that in Java, for instance, there is a LinkedHashSet and LinkedHashMap which gives you some of the advantages of Hash but which can be traversed in the order it was added to, so you could sort it and be able to traverse it in that sorted order as well as extracting items by hash.
Correct, a hash table is not "sorted" in the way you probably want. Elements in hash tables are not quite fully sorted, usually, although the arrangement is often kind of in the neighborhood of a sort. But they are arranged according to the hash function, which is usually wildly different for similar phrases. It's not a sort by any metric a human would use.
If the main thing you are doing with your collection is printing it in sorted order, you're best off using some type of BST.
A binary search tree is stored in a way that if you do a depth first traversal, you will find the items in sorted order(assuming you have a consistent compare function). The Big O of simply returning items already in the tree would be the Big O of traversing the tree.
You are correct about hash tables, they are not sorted. In fact, in order to enumerate everything in a plain hash table, you have to check every bucket to see what is in there, pull it out, then sort what you get. Lots of work to get a sorted list out of that.
Correct, printing sorted data stored in a hash table would be slower because a hash table is not sorted data. It just gives you a quick way to find a particular item. In "Big O Notation" it is said that the item can be found in constant time, i.e. O(1) time.
On the other hand, you can find an item in a binary search tree in "logarithmic time" (O(log n)) because the data has already been sorted for you.
So if you goal is to print a sorted list, you are much better off having the data stored in a sorted order (i.e. a binary tree).
This brings up a couple of interesting questions. Is a search tree still faster considering the following?
Incorporating the setup time for both the Hash Table and the BST?
If the hash algorithm produces a sorted list of words. Technically, you could create a hash table which uses an algorithm which does. In which case the the speed of the BST vs the Hash table would have to come down to the amount of time it takes to fill the hash table in the sorted order.
Also check out related considerations of Skip List vs. Binary Tree: Skip List vs. Binary Tree
Related
I have a use case where I have to return all elements of a table in Dynamo DB.
Suppose my table has a partition key (Column X) having same value in all rows say "monitor" and sort key (Column Y) with distinct elements.
Will there be any difference in execution time in the below approaches or is it the same?
Scanning whole table.
Querying data based on the partition key having "monitor".
You should use the parallell scans concept. Basically you're doing multiple scans at once on different segments of the Table. Watch out for higher RCU usage though.
Avoid using scan as far as possible.
Scan will fetch all the rows from a table, you will have to use pagination also to iterate over all the rows. It is more like a select * from table; sql operation.
Use query if you want to fetch all the rows based on the partition key. If you know which partition key you want the results for, you should use query, because it will kind of use indexes to fetch rows only with the specific partition key
Direct answer
To the best of my knowledge, in the specific case you are describing, scan will be marginally slower (esp. in first response). This is when assuming you do not do any filtering (i.e., FilterExpression is empty).
Further thoughts
DynamoDB can potentially store huge amounts of data. By "huge" I mean "more than can fit in any machine's RAM". If you need to 'return all elements of a table' you should ask yourself: what happens if that table grows such that all elements will no longer fit in memory? you do not have to handle this right now (I believe that as of now the table is rather small) but you do need to keep in mind the possibility of going back to this code and fixing it such that it addresses this concern.
questions I would ask myself if I were in your position:
(1) can I somehow set a limit on the number of items I need to read (say,
read only the first 1000 items)?
(2) how is this information (the list of
items) used? is it sent back to a JS application running inside a
browser which displays it to a user? if the answer is yes, then what
will the user do with a huge list of items?
(3) can you work on the items one at a time (or 10 or 100 at a time)? if the answer is yes then you only need to store one (or 10 or 100) items in memory but not the entire list of items
In general, in DDB scan operations are used as described in (3): read one item (or several items) at a time, do some processing and then moving on to the next item.
I already have an index set up with the second sort key set to what I want (an integer timestamp). The API keeps complaining that I'm not giving it a KeyConditionExpression. Then if I give it one, it says id must be specified. I've tried forcing it to just give me everything using id <> null and it STILL won't do it. Is this even possible?? Maybe its time to get rid of dynamo if it can't do this utterly simple task.
For the love of god, all I'm trying to do is query the entire table AND have it use my sort key. I would have had this going in SQL hours ago..
First of all, DynamoDB is a NOSQL database, so it's intentionally NOT SQL. Perhaps you shouldn't expect to be able to perform SQL like queries that you are used to, and be frustrated by the fact that these are two completely different types of databases, each with its strengths and weaknesses.
Records in DynamoDB are partitioned using the hash key, and may optionally be sorted within each partition.
The hash key should be picked so that items are as evenly distributed over partitions as possible. The use of partitions is what makes DynamoDB extremely scalable and fast. But if what you need is to scan over all your items and get them in sorted order, then you probably either are using the wrong tool for the job, or you need to sort the items on the client side.
The scan operation will simply go through all partitions, returning all items from each partition. At this point, the items can only be sorted within their respective partition.
As an example, consider a set of data being partitioned into 3 partitions:
Partition A Partition B Partition B
Sort key Sort key Sort key
A D C
C E K
P G L
As you can see, you can easily query each partition and get the items in it in sorted order. But if you scan, you will probably get items sorted as
[A, C, P, D, E, G, C, K, L], if the sort order is at all deterministic. At this point you would have to sort the items yourself.
A "trick" that is sometimes seen is to use a "dummy" hash key with an equal value for all items, like you mentioned in your own answer. This way you can query for "dummy = 1" and get the items sorted according to the sort key. However, this completely defeats the purpose of the hash key as all items will be put in the same partition, thus not making the table scale at all. But if you find yourself using DynamoDB even though you have a really small dataset, by all means it would work. But again, with a small data set and use-cases like this, you should probably be using another tool such as RDS in the first place.
Just to elaborate on #JHH though. In general I'd say he is correct that you shouldn't need to sort all elements in DynamoDB. I also have a requirement similar to this, as I need to get the top N number of elements, which could all be in different partitions.
DynamoDB does have a way of doing this, it just isn't out of the box. I don't think that it's so correct to say you should then need an SQL database, as arguably you'd never use a NoSQL database because you will always have one of these limitations. Also if you only ever use NoSQL for large data-sets then you will always have to rework your application later.
What to do then? Well you do have a few options, and it depends on your use-case, lets' assume that you are at least having sorting within your partitions, this makes it easier. We'll also assume you are looking for the max.
The simplest way would be if you would get the first value from every partition. And find the max. If you needed say the top 10 values you could still utilise this strategy but would get too complicated.
Next option is to make use of DynamoDB Streams. Say we want to keep a list of the top 100 elements. These would sit ready and waiting on their own top values partition, sorted and ready for instant retrieval. You would need to maintain this list yourself by checking when items are inserted or updated, that they are greater than the 100th element. If that is the case you would insert the element into the top values partition, and delete the last value. This I think would be the most likely way to approach this problem.
So in NoSQL if there is some sort of query, you would love to do which is oh so easy in SQL, and you cant use your Table/GSI/LSI, then you pretty much need to compute the result manually, and have it ready for consumption.
Now if you weren't going to make use of these top values very often, then you might go with the first method, and scan every partition top values till you had the list you wanted, but depending on how much the values are scattered across partitions this could take many capacity units.
Hope that helps.
Turns out, you can also add an IndexName to a scan. That helps. Furthermore, if you create an index with a sort key, all primary indices MUST be identical for the sort to occur.
I don't understand what an index is or does in SQLite. (NOT SQL) I think it allows for sorting in acending and decending order and access to data quicker. But I'm just guessing here.
Why not SQL? The answer is the same, though the internal details will differ between implementations.
Putting an index on a column tells the database engine to build, unsurprisingly, an index that allows it to quickly locate rows when you search for certain values in a column, without having to scan every row in the table.
A simple (and probably suboptimal) index might be built with an ordinary binary search tree.
Yes, indexes are all about improved data access performance (but at the cost of storage)
http://en.wikipedia.org/wiki/Index_(database)
An index (in any database) is a list of some kind which associates a sorted (or at least, quickly searchable) list of keys with information about where to find the rest of the data associated with the key.
You may not be finding information about this on the Internet because you're assuming it's a SQLite concept, but it's not - it's a general computer engineering concept.
Think about an address book. If you are searching the phone number of Rossi Mario, you know that surnames are ordered alphabetically so you can go to the letter R, then search for the letter o and so on. Index do the same, are a collections of references to entries that speed up a lot some operations.
Searching in an unordered address book would be much more slower, you should start from the first name on the first page and search in all the pages until you find the name you are looking for.
I think it allows for sorting in
acending and decending order and
access to data quicker.
Yes, that's what's it's for. Indexes create the abstraction of having sorted data, which speeds up searches significantly. With an index using a balanced binary search tree, searches take O(log N) instead of O(N) time.
What the other answers haven't mentioned that most databases use indexes in order to implement UNIQUE (and therefore also PRIMARY KEY) constraints. Because in order to ensure uniqueness, you have to be able to detect whether the key is already there, and this means you want fast searches for it.
Take a look in your SQLite database. Those sqlite_autoindex_ indices were created to enforce UNIQUE constraints.
The same as an index in any SQL (YES SQL) RDBMS.
You can see the SQLite query optimizer considers indexes: http://www.sqlite.org/optoverview.html
Speed up searching and sorting
Different types of SQLite indices speed up searching and sorting in different ways.
The following tutorial explains this in a great way.
What property makes Hash table, Hash list and Hash tree different from each other? Which one is used when? When is table superior than tree.
Hashtable: it's a data structure in which you can insert pairs of (key, value) in which the key is used to compute an hashcode that is needed to decide where to store the value associated with its key. This kind of structure is useful because calculating a hashcode is O(1), so you can find or place an item in constant time. (Mind that there are caveats and different implementations that change this performance slightly)
Hashlist: it is just a list of hashcodes calculated on various chunks of data. Eg: you split a file in many parts and you calculate a hashcode for each part, then you store all of them in a list. Then you can use that list to verify integrity of the data.
Hashtree: it is similar to a hashlist but instead of having a list of hashes you have got a tree, so every node in the tree is a hashcode that is calculated on its children. Of course leaves will be the data from which you start calculating the hashcodes.
Hashtable is often useful (they are also called hashmaps) while hashlists and hashtrees are somewhat more specific and useful for exact purposes..
I have often heard people talking about hashing and hash maps and hash tables. I wanted to know what they are and where you can best use them for.
First you shoud maybe read this article.
When you use lists and you are looking for a special item you normally have to iterate over the complete list. This is very expensive when you have large lists.
A hashtable can be a lot faster, under best circumstances you will get the item you are looking for with only one access.
How is it working? Like a dictionary ... when you are looking for the word "hashtable" in a dictionary, you are not starting with the first word under 'a'. But rather you go straight forward to the letter 'h'. Then to 'ha', 'has' and so on, until you found your word. You are using an index within your dictionary to speed up your search.
A hashtable does basically the same. Every item gets an unique index (the so called hash). You use this hash for lookups. The hash may be an index in a normal linked list. For instance your hash could be a number like 2130 which means that you should look at position 2130 in your list. A lookup at a known index within a normal list is very easy and fast.
The problem of the whole approach is the so called hash function which assigns this index to each item. When you are looking for an item you should be able to calculate the index in advance. Just like in a real dictionary, where you see that the word 'hashtable' starts with the letter 'h' and therefore you know the approximate position.
A good hash function provides hashcodes that are evenly distrubuted over the space of all possible hashcodes. And of course it tries to avoid collisions. A collision happens when two different items get the same hashcode.
In C# for instance every object has a GetHashcode() method which provides a hash for it (not necessarily unique). This can be used for lookups and sorting with in your dictionary.
When you start using hashtables you should always keep in mind, that you handle collisions correctly. It can happen quite easily in large hashtables that two objects got the same hash (maybe your overload of GetHashcode() is faulty, maybe something else happened).
Basically, a HashMap allows you to store items with identifiers. They are stored in a table format with the identifier being hashed using a hashing algorithm.
Typically they are more efficient to retrieve items than search trees etc.
You may find this helpful: http://www.relisoft.com/book/lang/pointer/8hash.html
Hope it helps,
Chris
Hashing (in the noncryptographic sense) is a blanket term for taking an input and then producing an output to identify it with. A trivial example of a hash is adding the sum of the letters of a string, i.e:
f(abc) = 6
Note that this trivial hash scheme would create a collision between the strings abc, bca, ae, etc. An effective hash scheme would produce different values for each string, naturally.
Hashmaps and hashtables are datastructures (like arrays and lists), that use hashing to store data. In a hashtable, a hash is produced (either from a provided key, or from the object itself) that determines where in the table the object is stored. This means that as long as the user of the hashtable is aware of the key, retrieving the object is extremely fast.
In a list, in comparison, you would need to in some way search through the list in order to find your sought object. This also represents the backside of hashtables, which is that it is very complicated to find an object in it without knowing the key, because where the object is stored in the table has no relevance to its value nor when it was inputed.
Hashmaps are similar to hashtables, but only one example of each object is stored in it (hence no key needs to be provided, the object itself is the key).
This is of course a very simple explanation, so I suggest you read in depth from this point on. I hope I didn't make any silly mistakes. =)
Hashmap is used for storing data in key value pairs. We can use a hashmap for storing objects in a application and use it further in the same application for storing, updating, deleting values. Hashmap key and values are stored in a bucket to a specific entry, this entry location is determined using Hashcode function. This hashcode function determines the hash where the value is stored. The detailed explanantion of how hashmap works is described in this video: https://youtu.be/iqYC1odZSNo
Hash maps saves a lot of time as compared to other search criteria. We have a hash key that corresponds to a hash code which further helps to find its index value. In terms of implementation, hash maps takes a string converts it into an integer and remaps it to convert it into an index of an array which helps to find the required value.
To go in detail we can look for handling collisions in hash maps. Like instead of using array we can go with the linked list.
There is a short video available to understand it.
Available here :
Implementation example --> https://www.youtube.com/watch?v=shs0KM3wKv8
Sample:
int hashCode(String s)
{
logic
}