Count the frequency of bytes in a purely functional language - functional-programming

If we had an assignment:
Given a block of binary data, count the frequency of the bytes within it.
And you were supposed to do this in C, the answer would be trivial and reasonably fast even for larger binary blocks. How would one go about implementing this in a purely functional language, without side effects?
For example, if you wrote a function that accepted freqency counts for each byte and the rest of the list of bytes, and returned modified frequency counts, it would have to do awful lot of work for data set of 100M bytes.
Also, if you sorted the data and then somehow counted the amount of subsequent same-valued bytes, the sort itself would take a lot of time.
Is there a reasonable way to implement this?

The straightforward way to do it is indeed to pass in and return data structures mapping bytes to counts. This would probably be implemented as some kind of tree (since that's what you get out of the standard library containers, as far as I know). In pure functional programming when you're passed in a tree and you need to return a new tree with a difference in only one node, the returned tree ends up sharing almost all of its structure and data with the original tree.
There is some overhead in traversing the tree to get to the count, but since you're counting bytes the tree is only ever smaller than 256 elements, so the overhead is log(255), which is a constant. It doesn't get larger for large data sets - it doesn't change the big-oh complexity of the algorithm. That's actually true even if you use the greatest possible overhead of copying around a full 256-entry array of counts with no sharing.
If you want to optimise this, you can take advantage of the fact that the "intermediate" frequency counts are never needed except as part of the computation of the next set of counts. That means you can use various techniques for getting the implementation to use destructive updates even while you're still semantically writing functional code. An STref in Haskell is basically letting you do this manually.
Theoretically the compiler could notice that you're replacing a never-needed-again value with a new one, so it could do the update in place for you. I don't know whether or not any actual production ready compilers are currently able to make this optimisation.

Related

Allocating datastore long ID's - but segmented so different Kinds have different ranges

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)
Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.
Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

Transfer files using checksums only?

Would it be possible to transfer large files using only a system of checksums, and then reconstruct the original file by calculations?
Say that you transfer the MD5 checksum of a file and the size of the file. By making a "virtual file" and calculating it's checksum, trying every single bit combination, you should eventually "reach" the original file. But on the way you would also get a lot of "collisions" where the checksum also match.
So we change the first byte of the original file to some specified value, calculate the checksum again, and send this too. If we make the same substitution in the virtual file we can test each "collision" to see if it still matches. This should narrow it down a bit, and we can do this several times.
Of course, the computing power to do this would be enormous. But is it theoretically possible, and how many checksums would you need to transfer something (say 1mb)? Or would perhaps the amount of data needed to transfer the checksums almost as large as the file, making it pointless?
The amount of data you need to transfer would most certainly be the same size as the file. Consider: If you could communicate a n byte file with n-1 bytes of data, that means you've got 256^(n-1) possible patterns of data you may have sent, but are selecting from a space of size 256^n. This means that one out of every 256 files won't be expressible using this method - this is often referred to as the pidegonhole principle.
Now, even if that wasn't a problem, there's no guarentee that you won't have a collision after any given amount of checksumming. Checksum algorithms are designed to avoid collisions, but for most checksum/hash algorithms there's no strong proof that after X hashes you can guarantee no collisions in a N-byte space.
Finally, hash algorithms, at least, are designed to be hard to reverse, so even if it were possible it would take an impossible huge amount of CPU power to do so.
That said, for a similar approach, you might be interested in reading about Forward Error Correction codes - they're not at all hash algorithms, but I think you may find them interesting.
What you have here is a problem of information. A checksum is not necessarily unique to a particular set of data, in fact to be so it would effectively need to have a many bits of information as the source. What it can indicate is that the data received is not the exact data that the checksum was generated from but in most cases it can't prove it.
In short "no".
To take a hypothetical example, consider a 24 bpp photo with 6 pixels -- there are 2^(24 * 6) (2^144) possible combinations of intensities for each colour channel on those six pixels, so you can gaurantee that if you were to evaluate every possibility, you are guaranteed an MD5 collision (as MD5 is a 128 bit number).
Short answer: Not in any meaningfull form.
Long answer:
Let us assume an arbitrary file file.bin with a 1000-byte size. There are 2^(8*1000) different combinations that could be its actual contents. By sending e.g. a 1000-bit checksum,
you still have about 2^(7*1000) colliding alternatives.
By sending a single additional bit, you might be able cut those down by half... and you still have 2^6999 collisions. By the time you eliminate the colisions, you will have sent at least 8000 bits i.e. an amount equal or greater to the file size.
The only way for this to be theoretically possible (Note: I did not say "feasible", let alone "practical") would be if the file did not really contain random data and you could use that knowledge to prune alternatives. In that case you'd be better off using compression, ayway. Content-aware compression algorithms (e.g. FLAC for audio) use a-priori knowledge on the properties of the input data to improve the compression ratio.
I think what you are thinking of is in fact an interesting topic, but you haven't hit upon the right method. If I can try and rephrase your question, you are asking if there is a way to apply a function to some data, transmit the result of the function, and then reconstruct the original data from the terser function result. For a single MD5 checksum the answer is no, but with other functions, provided you are willingly to send several function results, it is possible. In general this area of research is called compressed sensing. Sometimes exact reconstruction is possible, but more often it is used as a lossy compression scheme for images and other visual or sound data.

Hash tables v self-balancing search trees

I am curious to know what is the reasoning that could overweighs towards using a self-balancing tree technique to store items than using a hash table.
I see that hash tables cannot maintain the insertion-order, but I could always use a linked list on top to store the insertion-order sequence.
I see that for small number of values, there is an added cost of of the hash-function, but I could always save the hash-function together with the key for faster lookups.
I understand that hash tables are difficult to implement than the straight-forward implementation of a red-black tree, but in a practical implementation wouldn't one be willing to go an extra mile for the trouble?
I see that with hash tables it is normal for collisions to occur, but with open-addressing techniques like double hashing that allow to save the keys in the hash table itself, hasn't the problem been reduced to the effect of not tipping the favor towards red black trees for such implementations?
I am curious if I am strictly missing a disadvantage of hash table that still makes red black trees quite viable data structure in practical applications (like filesystems, etc.).
Here is what I can think of:
There are kinds of data which cannot be hashed (or is too expensive to hash), therefore cannot be stored in hash tables.
Trees keep data in the order you need (sorted), not insertion order. You can't (effectively) do that with hash table, even if you run a linked list through it.
Trees have better worst-case performace
Storage allocation is another consideration. Every time you fill all of the buckets in a hash-table, you need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. On the other hand, balanced trees don't suffer from this issue at all.
Just wanted to add :
Balanced binary trees have a predictable time of fetching a data [log n] independent of the type of data. Many times that may be important for your application to estimate the response times for your application. [hash tables may have unpredictable response times]. Remember for smaller n's as in most common use cases the difference in performance in an in-memory look up is hardly going to matter and the bottle neck of the system is going to be elsewhere and sometimes you just want to make the system much simpler to debug and analyze.
Trees are generally more memory efficient compared to hash tables and much simpler to implement without any analysis on the distribution of input keys and possible collisions etc.
In my humble opinion, self-balancing trees work pretty well as Academic topics. And I
do not know anything that can be qualified as a "straight-forward implementation of a
red-black tree".
In the real world, the memory wall makes them far less efficient than they are on paper.
With this in mind, hash tables are decent alternatives, especially if you don't practice
them the Academic style (forget about the table size constraint and you magically resolve
the table resize issue and almost all collision issues).
In a word: keep it simple. If that's simple for you then that's simple for your computer.
I think if you want to query for a range of keys instead of one key, self balanced tree structure will perform better than a hash table structure.
A few reasons I can think of:
Trees are dynamic (the space complexity is N), whereas hash tables are often implemented as arrays which are fixed size, which means they will often be initialized with K size, where K > N, so even if you only have 1 element in a hashmap, you might still have 100 empty slots that take up memory. Another effect of this is:
Increasing the size of an array-based hash table is costly (O(N) average time, O(N log N) worst case), whereas trees can grow in constant time (O(1)) + (time to locate insertion point (O(log N))
Elements in a tree can be gathered in sorted order (using ex: in-order-traversal). Thereby you often get a sorted list as a free perk with trees.
Trees can have a better worst-case performance vs a hashmap depending on how the hashmap is implemented (ex: hashmap with chaining will have O(N) worst case, whereas self-balanced trees can guarantee O(log N) worst case for all operations).
Both self-balanced trees and hashmaps have a worst-case efficiency of O(log N) in the best worst-case (assuming that the hashmap does handle colissions), but Hashmaps can have a better average-case performance (often close to O(1)), whereas Trees will have a constant O(log N). This is because even thou a hashmap can locate the insertion index in O(1), it has to account for hash colissions (more than one element hashing to the same array index), and thus in the best case degrades to a self-balanced tree (such as the Java implementation of hashmap), that is, each element in the hashmap can be implemented as a self-balanced tree, storing all elements which has hashed to the given array cell.

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

Which is faster to find an item in a hashtable or in a sorted list?

Which is faster to find an item in a hashtable or in a sorted list?
Algorithm complexity is a good thing to know, and hashtables are known to be O(1) while a sorted vector (in your case I guess it is better to use a sorted array than a list) will provide O(log n) access time.
But you should know that complexity notation gives you the access time for N going to the infinite. That means that if you know that your data will keep growing, complexity notation gives you some hint on the algorithm to chose.
When you know that your data will keep a rather low length: for instance having only a few entries in your array/hashtable, you must go with your watch and measure. So have a test.
For instance, in another problem: sorting an array. For a few entries bubble sort while O(N^2) may be quicker than .. the quick sort, while it is O(n log n).
Also, accordingly to other answers, and depending on your item, you must try to find the best hash function for your hashtable instance. Otherwise it may lead to dramatic bad performance for lookup in your hashtable (as pointed out in Hank Gay's answer).
Edit: Have a look to this article to understand the meaning of Big O notation .
Assuming that by 'sorted list' you mean 'random-accessible, sorted collection'. A list has the property that you can only traverse it element by element, which will result in a O(N) complexity.
The fastest way to find an element in a sorted indexable collection is by N-ary search, O(logN), while a hashtable without collissions has a find complexity of O(1).
Unless the hashing algorithm is extremely slow (and/or bad), the hashtable will be faster.
UPDATE: As commenters have pointed out, you could also be getting degraded performance from too many collisions not because your hash algorithm is bad but simply because the hashtable isn't big enough. Most library implementations (at least in high-level languages) will automatically grow your hashtable behind the scenes—which will cause slower-than-expected performance on the insert that triggers the growth—but if you're rolling your own, it's definitely something to consider.
The get operation in a SortedList is O(log n) while the same operation e a HashTable is O(1). So, normally, the HashTable would be much faster. But this depends on a number of factors:
The size of the list
Performance of the hashing algorithm
Number of collisions / quality of the hashing algorithm
It depends entirely on the amount of data you have stored.
Assuming you have enough memory to throw at it (so the hash table is big enough), the hash table will locate the target data in a fixed amount of time, but the need to calculate the hash will add some (also fixed) overhead.
Searching a sorted list won't have that hashing overhead, but the time required to do the work of actually locating the target data will increase as the list grows.
So, in general, a sorted list will generally be faster for small data sets. (For extremely small data sets which are frequently changed and/or infrequently searched, an unsorted list may be even faster, since it avoids the overhead of doing the sort.) As the data set becomes large, the growth of the list's search time overshadows the fixed overhead of hashing, and the hash table becomes faster.
Where that breakpoint is will vary depending on your specific hash table and sorted-list-search implementations. Run tests and benchmark performance on a number of typically-sized data sets to see which will actually perform better in your particular case. (Or, if the code already runs "fast enough", don't. Just use whichever you're more comfortable with and don't worry about optimizing something which doesn't need to be optimized.)
In some cases, it depends on the size of the collection (and to a lesser degree, implementation details). If your list is very small, 5-10 items maybe, I'd guess the list would be faster. Otherwise xtofl has it right.
HashTable would be more efficient for list containing more than 10 items. If the list has fewer than 10 items, the overhead due to hashing algo will be more.
In case you need a fast dictionary but also need to keep the items in an ordered fashion use the OrderedDictionary. (.Net 2.0 onwards)

Resources