Can someone explain the difference between dictionaries and hashtables? In Java, I've read that dictionaries are a superset of hashtables, but I always thought it was the other way around. Other languages seem to treat the two as the same. When should one be used over the other, and what's the difference?
The Oxford Dictionary of Computing defines a dictionary as...
Any data structure representing a set of elements that can support the insertion and deletion of elements as well as a test for membership.
As such, dictionaries are an abstract idea that can be reasonably efficiently implemented as e.g. binary trees or hash tables, tries, or even direct array indexing if the keys are numeric and not too sparse. That said, python uses a closed-hashing hash table for its dict implementation, and C# seems to use some kind of hash table too (hence the need for a separate SortedDictionary type).
A hash table is a much more specific and concrete data structures: there are several implementations options (closed vs. open hashing being perhaps the most fundamental), but they're all characterised by O(1) amortised insertion, lookup and deletion, and there's no excuse for begin->end iteration worse than O(n + #buckets), while implementations may achieve better (e.g. GCC's C++ library has O(n) container iteration. The implementations necessarily depend on a hash function leading to an indexed probe in an array.
The way i see it, a hashtable is one way of implementing a dictionary. specifying that the key is hashfunction(x) and the value is any Object. The Java Dictionary can use any key as long as .equals(y) has been implemented for that object.
The 'answer' will also change depending on the language (C#? Java? JS?) you're using. in JS the 'dictionary' is implemented as a hashtable and there is no difference. ---- in another language (i believe it's C#), the Dictionary MUST be strongly typed fixed type key and fixed type value, while the Hashtable's value can be any type, and the two are not extended from one another.
Related
I know this question is very similar to this one I asked some time ago: Why F#'s default set collection is sorted while C#'s isn't?
However, I'd like to confirm if the reason given there is the same for this case? And I wonder if there's an implementation of an immutable Map in F# out there, written by someone, which is less strict and doesn't require K to be comparable? I'd happily use it because I don't care so much about performance.
Why F#'s idiomatic dictionary collection Map<K,V> needs the type K
to implement comparable while C#'s Dictionary<K,V> doesn't?
F# Map<Key, Value> requires keys to be comparable, because map is implemented as a tree structure (you have to decide which subtree to go depending on comparison result). C# Dictionary<Key, Value> is implemented as a bucket of linked lists. You get a bucket by hash code of key and then iterate list until you (not)find the equal key. In both data structures keys are compared. The only difference is that for dictionary equality comparison is enough.
So, the question is why F# Map has explicit comparison constraint, but C# Dictionary has implicit equality requirement?
Let's start with C#. What would be if the dictionary would have an IEquatable key constraint? Well, you would have to manually implement this interface for every custom data type used as dictionary key. But what if you want different implementations of equality? E.g. in some dictionaries you want your key string case insensitive. Of course, you can pass IEqualityComparer implementation to be used for keys comparison (not only to the dictionary, but anywhere when you need comparison). But why then enforce key to be comparable if external comparer will be used? Note that there is always a default comparer which is used if you don't pass anything to the dictionary. Default comparer checks if key implements IComparable and uses that implementation.
Why then F# has an explicit comparable constraint on key data type? Because this constraint will not enforce you to manually implement IComparable for every custom data type used as map key. One big difference between C# and F# type system is that F# types comparable and equatable by default. F# compiler generates IComparable, IComparable<T>, and IStructuralComparable implementations unless you explicitly mark type with NoComparison attribute. So this constraint doesn't enforce you to write any additional code when you use F# data types.
One more benefit of using comparison/equality constrains - F# has a number of predefined generic operations on types that implement comparison or equality (=,<,<=,>=,=,max,min). Which makes code with generic comparable/equatable types much more readable.
I am looking for an Array-like type with the following properties:
stores elements on disk
elements can have composite type
elements are read into memory, not the whole array
it is possible to write individual elements without writing the whole array
supports setindex!, getindex, push!, pop!, shift!, unshift! and maybe vcat
is reasonably efficient
So far I have found the following leads:
https://docs.julialang.org/en/latest/stdlib/SharedArrays/
http://juliadb.org
https://github.com/JuliaIO/JLD.jl
The first one seems promising, but it seems the type of the elements has to be isbits (meaning a simple number, some structs but not, e.g., an Array{Float64,1}). And it's not clear if the whole array contents are loaded into memory.
If it does not exist yet, I will of course try to construct it myself.
NCDatasets.jl addresses part of the requirements:
stores elements on disk: yes
elements can have composite type: no (although some support for composite type is in NetCDF4, but not yet in NCDatasets.jl). Currently you can have only Arrays of basic types and Arrays of Vectors (of basic types).
elements are read into memory, not the whole array: yes
it is possible to write individual elements without writing the whole array supports setindex!, getindex, push!, pop!, shift!, unshift! and maybe vcat: just setindex!, getindex
is reasonably efficient: the efficency is reasonable for me :-)
The project making it yourself sounds very interesting. I think it would server certainly a gap in the current ecosystem.
Some storage technologies that might be good to have a look at are:
HDF5 (for storage, cross-platform and cross-language)
JLD2 (successor of JLD) https://github.com/simonster/JLD2.jl
rasdaman (a "database" for arrays) http://www.rasdaman.org/
possibly also BSON http://bsonspec.org/
Maybe you can also reach out to the JuliaIO group.
In some dynamic programming problems, I notice that my cache table is very sparse. In other words, if I define a table as DP[i][j], i<=10^6, j<=10^2, only a fraction of the table is used and the rest is -1.
So my question is, is it common practice to use a hashmap instead to store (i, j) pairs with their DP value and access them in average O(1) time rather than storing them in the sparse table to save memory?
First of all, Yes you can use hashmap instead of the array for dynamic programming problems. But there are some limitations as well as well as benefits for using a hashmap.
When you use a hashmap for this particular case(dynamic programming), it reduces memory complexity but simultaneously it will increase the constant factor of your code. That means if you can perform 10^{8} operations/second with the help of array, then you will be able to perform around 10^{7} operations/second when used hashmap due to its constant factor although with the same complexity of the algorithm.
So if possible to declare that much size of the array, use array otherwise use the hashmap.
Yes, is the definitely a common practice to use hashmaps. Particularly in the case of sparsity.
It is even possible to go beyond that... For even larger problems, approximate dynamic programming draws from tools such as function approximation.
Does containers in D have value or reference semantics by default? If they have reference semantics doesn't that fundamentally hinder the use of functional programming style in D (compared to C++11's Move Semantics) such as in the following (academic) example:
auto Y = reverse(sort(X));
where X is a container.
Whether containers have value semantics or reference semantics depends entirely on the container. The only built-in containers are dynamic arrays, static arrays, and associative arrays. Static arrays have strict value semantics, because they sit on the stack. Associative arrays have strict reference semantics. And dynamic arrays mostly have reference semantics. They're elements don't get copied, but they do, so they end up with semantics which are a bit particular. I'd advise reading this article on D arrays for more details.
As for containers which are official but not built-in, the containers in std.container all have reference semantics, and in general, that's how containers should be, because it's highly inefficient to do otherwise. But since anyone can implement their own containers, anyone can create containers which are value types if they want to.
However, like C++, D does not take the route of having algorithms operate on containers, so as far as algorithms go, whether containers have reference or value semantics is pretty much irrelevant. In C++, algorithms operate on iterators, so if you wanted to sort a container, you'd do something like sort(container.begin(), container.end()). In D, they operate on ranges, so you'd do sort(container[]). In neither language would you actually sort a container directly. Whether containers themselves have value or references semantics is therefore irrelevant to your typical algorithm.
However, D does better at functional programming with algorithms than C++ does, because ranges are better suited for it. Iterators have to be passed around in pairs, which doesn't work very well for chaining functions. Ranges, on the other hand, chain quite well, and Phobos takes advantage of this. It's one of its primary design principles that most of its functions operate on ranges to allow you to do in code what you typically end up doing on the unix command line with pipes, where you have a lot of generic tools/functions which generate output which you can pipe/pass to other tools/functions to operate on, allowing you to chain independent operations to do something specific to your needs rather than relying on someone to have written a program/function which did exactly what you want directly. Walter Bright discussed it recently in this article.
So, in D, it's easy to do something like:
auto stuff = sort(array(take(map!"a % 1000"(rndGen()), 100)));
or if you prefer UFCS (Universal Function Call Syntax):
auto stuff = rndGen().map!"a % 1000"().take(100).array().sort();
In either case, it generates a sorted list of 100 random numbers between 0 and 1000, and the code is in a functional style, which C++ would have a much harder time doing, and libraries which operate on containers rather than iterators or ranges would have an even harder time doing.
In-Built Containers
The only in-built containers in D are slices (also called arrays/dynamic arrays) and static arrays. The latter have value semantics (unlike in C and C++) - the entire array is (shallow) copied when passed around.
As for slices, they are value types with indirection, so you could say they have both value and reference semantics.
Imagine T[] as a struct like this:
struct Slice(T)
{
size_t length;
T* ptr;
}
Where ptr is a pointer to the first element of the slice, and length is the number of elements within the bounds of the slice. You can access the .ptr and .length fields of a slice, but while the data structure is identical to the above, it's actually a compiler built-in and thus not defined anywhere (the name Slice is just for demonstrative purposes).
Knowing this, you can see that copying a slice (assign to another variable, pass to a function etc.) just copies a length (no indrection - value semantics) and a pointer (has indirection - reference semantics).
In other words, a slice is a view into an array (located anywhere in memory), and there can be multiple views into the same array.
Algorithms
sort and reverse from std.algorithm work in-place to cater to as many users as possible. If the user wanted to put the result in a GC-allocated copy of the slice and leave the original unchanged, that can easily be done (X.dup). If the user wanted to put the result in a custom-allocated buffer, that can be done too. Finally, if the user wanted to sort in-place, this is an option. At any rate, any extra overhead is made explicit.
However, it's important to note that most algorithms in the standard library don't require mutation, instead returning lazily-evaluated range results, which is characteristic of functional programming.
User-Defined Containers
When it comes to user-defined containers, they can have whatever semantics they want - any configuration is possible in D.
The containers in std.container are reference types with .dup methods for making copies, thus slightly emulating slices.
Lisp programmers tend to use lists to represent all other data types.
However, I have heard that lists are not a good universal representation for data types.
What are the disadvantage of lists being used in this manner, in contrast to using records?
You mention "record". By this I take it that you're referring to fixed-element structs/objects/compound data. For instance, in HtDP syntax:
;; a packet is (make-packet destination source text) where destination is a number,
;; source is a number, and text is a string.
... and you're asking about the pros and cons of representing a packet as a list of length three,
rather than as a piece of compound data (or "record").
In instances where compound data is appropriate--the values have specific roles and names, and there are a fixed number of them--compound data is generally preferable; they help you to catch errors in your programs, which is the sine qua non of programming.
The disadvantage is that it isn't universal. Sometimes this is performance related: you want constant time lookups (array, hash table). Sometimes this is organization related: you want to name your data locations (Hash table, record ... although you could use name,value pairs in the list). It requires a little diligence on the part of the author to make the code self-documenting (more diligence than the record). Sometimes you want the type system to catch mistakes made by putting things in the wrong spot (record, typed tuples).
However, most issues can be addressed with OptimizeLater. The list is a versatile little data structure.
You're talking about what Peter Seibel addresses in Chapter 11 of Practical Common Lisp:
[Starting] discussion of Lisp's collections
with lists . . . often leads readers to the mistaken
conclusion that lists are Lisp's only collection type. To make matters
worse, because Lisp's lists are such a flexible data structure, it is
possible to use them for many of the things arrays and hash tables are
used for in other languages. But it's a mistake to focus too much on
lists; while they're a crucial data structure for representing Lisp
code as Lisp data, in many situations other data structures are more
appropriate.
Once you're familiar with all the data types Common Lisp offers,
you'll also see that lists can be useful for prototyping data
structures that will later be replaced with something more efficient
once it becomes clear how exactly the data is to be used.
Some reasons I see are:
A large hashtable, for example, has faster access than the equivalent alist
A vector of a single datatype is more compact and, again, faster to access
Vectors are more efficiently and easily accessed by index
Objects and structures allow you to access data by name, not position
It boils down to using the right datatype for the task at hand. When it's not obvious, you have two options: guess and fix it later, or figure it out now; either of those is sometimes the right approach.