BerkeleyDB - implications of incorrect sorting order? - berkeley-db

According to this FAQ, page fill factor can be adversely affected by not specifying a sorting function for binary data on little-endian systems. I understand that it will also result in cursors not returning data in the "correct" sorted order.
Other than excessive page usage, would this cause any other performance issues? For example, does a poor page fill factor adversely affect the speed of key lookups?
Furthermore, if I have data already stored in a BTREE without a sorting function, will anything break if I subsequently start using a sorting function to add new records? i.e. would a mismatch between the originally used sort order and a new sort function break key lookups?

Yes, incorrect endian-ness can reduce your fill factor and as a result your database will be bigger and slower to acess. Today I was inserting about 30 million records with a sequential integer key and noticed quite poor btree fill factor (60%). Then changed the endianness of the key (used htonl() function) and the fill factor jumped to 99%. At the same time database size was reduced from 1.3 GB to 700 MB.
Endianness is important when your key is sequential or shows some locality (common prefix for related data). For some keys changing the endianness could worsen the performance (I experienced this with mobile phone numbers).
BTW you don't have to provide a sorting function - you can just convert the keys to correct endianness when inserting and searching by key.

Related

Is a fixed partition key bad practice?

I have a DynamoDB table with:
Timestamp (HASH)
Text (String)
I want to be able to get the latest item via a query, but doing so requires that I sort by Timestamp rather than partition by it. I was considering doing this instead:
Partition (HASH, hard-coded as whatever)
Timestamp (RANGE)
Text (String)
That way I can query and pass a hard-coded partition in.
But is this bad practice?
It depends.
The main thing to consider is that partitions have a finite throughput for both reads and writes. This is independent from the provisioned throughput for the table. Partition throughput is constrained by the hard disk's read and write speeds. Remember that all items with the same hash value will live on the same partition and therefore will be written to the same disk (discounting replication).
So, it depends on your scale. It will work for a small scale, low throughput use case but it won't be able to scale beyond a single disk.
It is usually bad practice to use a single, hard coded value for your hash key. Rather than a hard coded hash key value, you should consider using year_month_day (or some variation) as your hash key for this use case. It's still not great, but it's much better than a single value.
If you do want to use hard coded hash key values, consider using multiple hard coded values to shard your data across partitions.

Load factor of hash tables with tombstones

So the question came up about whether tombstones should be included when calculating the load factor of a hash table.
I thought that, given that the load factor is used to determine when to expand capacity, tombstones should not be included. An obvious example is if you almost fill and then remove every value in a hash table. Here insertions are super easy (no collisions) so I believe the load factor shouldn't include them.
But you could look at this and think that with all the tombstones lookups will be slow (potentially searching almost the entire space).
So I thought I'd ask the question. Should the load factor of a hashtable include tombstones in the calculation?
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally. It's advantages:
Well, simplicity of understanding and implementation.
Hash table data structure shouldn't store many numbers with some thresholds -- likely only one number. This is meaningful when hash table is very small and the size of the header affects total data structure memory efficiency (in bytes to store an entry).
In certain (and common) case: append/update only hash table, more complex models of behaviour degenerate to the "just load factor" model, in other words, load factor model defines relatively optimal behaviour.
See also my answer on load factor model. I prefer [min load, target load, max load] + growth factor frame model.
If you develop general-purpose hash table with tombstones, I think you can just pick up my results (below). I spend maybe several weeks solely developing this model. Maybe you can make some improvements or further research, I would be glad.
Two main hash table dynamic behaviour patterns are targeted:
growing hash table (maybe in growing phase), with little or no removals
initial fill of hash table, when proper capacity was not specified (or unknown)
hash table that remains of the same or nearly the same size, number of removals is equal or nearly equal to number of insertions
caches with upper size bound, LRUs, tables with entry expires
Two thresholds are defined:
max size (i. e. number of alive entries), table size * max load
min number of free (i. e. empty, without alive entry nor tombstone) slots, computed by magic formula.
If hash table size exceeds max size, we assume we are in the "growing pattern", rehash to the table size to be able to store current size * growth factor entries, i. e. choose table size closest possible to current size * growth factor / target load.
If the number of free slots becomes below than min number of free slots, we are in "cache pattern", rehash "to the current size", i. e. to the table size closest possible to current size / target load.
Read the source where all the above logics are coded.
Also, article Tombstones purge from hashtable: theory and practice sheds some light.
If you develop specially purposed hash table, which dymanic properties are known (or could be studied), I recommend you to develop your own model, fitting your case. Don't rely on pure math and CS theory, evaluate your model in benchmarks.

Allocating datastore long ID's - but segmented so different Kinds have different ranges

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)
Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.
Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

What's the point of a hash table?

I don't have experience with hash tables outside of arrays/dictionaries in dynamic languages, so I recently found out that internally they're implemented by making a hash of the key and using that to store the value. What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key, instead of making a hash of it and storing that.
This is a near duplicate: Why do we use a hashcode in a hashtable instead of an index?
Long story short, you can check if a key is already stored VERY quickly, and equally rapidly store a new mapping. Otherwise you'd have to keep a sorted list of keys, which is much slower to store and retrieve mappings from.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
What I don't understand is why aren't the values stored with the key (string, number, whatever) as the, well, key
And how do you implement that?
Computers know only numbers. A hash table is a table, i.e. an array and when we get right down to it, an array can only addressed via an integral nonnegative index. Everything else is trickery. Dynamic languages that let you use string keys – they use trickery.
And one such trickery, and often the most elegant, is just computing a numerical, reproducible “hash” number of the key and using that as the index.
(There are other considerations such as compaction of the key range but that’s the foremost issue.)
In a nutshell: Hashing allows O(1) queries/inserts/deletes to the table. OTOH, a sorted structure (usually implemented as a balanced BST) makes the same operations take O(logn) time.
Why take a hash, you ask? How do you propose to store the key "as the key"? Ask yourself this, if you plan to store simply (key,value) pairs, how fast will your lookups/insertions/deletions be? Will you be running a O(n) loop over the entire array/list?
The whole point of having a hash value is that it allows all keys to be transformed into a finite set of hash values. This allows us to store keys in slots of a finite array (enabling fast operations - instead of searching the whole list you only search those keys that have the same hash value) even though the set of possible keys may be extremely large or infinite (e.g. keys can be strings, very large numbers, etc.) With a good hash function, very few keys will ever have the same hash values, and all operations are effectively O(1).
This will probably not make much sense if you are not familiar with hashing and how hashtables work. The best thing to do in that case is to consult the relevant chapter of a good algorithms/data structures book (I recommend CLRS).
The idea of a hash table is to provide a direct access to its items. So that is why the it calculates the "hash code" of the key and uses it to store the item, insted of the key itself.
The idea is to have only one hash code per key. Many times the hash function that generates the hash code is to divide a prime number and uses its remainer as the hash code.
For example, suppose you have a table with 13 positions, and an integer as the key, so you can use the following hash function
f(x) = x % 13
What I don't understand is why aren't
the values stored with the key
(string, number, whatever) as the,
well, key, instead of making a hash of
it and storing that.
Well, how do you propose to do that, with O(1) lookup?
The point of hashtables is basically to provide O(1) lookup by turning the key into an array index and then returning the content of the array at that index. To make that possible for arbitrary keys you need
A way to turn the key into an array index (this is the hash's purpose)
A way to deal with collisions (keys that have the same hash code)
A way to adjust the array size when it's too small (causing too many collisions) or too big (wasting space)
Generally the point of a hash table is to store some sparse value -- i.e. there is a large space of keys and a small number of things to store. Think about strings. There are an uncountable number of possible strings. If you are storing the variable names used in a program then there is a relatively small number of those possible strings that you are actually using, even though you don't know in advance what they are.
In some cases, it's possible that the key is very long or large, making it impractical to keep copies of these keys. Hashing them first allows for less memory usage as well as quicker lookup times.
A hashtable is used to store a set of values and their keys in a (for some amount of time) constant number of spots. In a simple case, let's say you wanted to save every integer from 0 to 10000 using the hash function of i % 10.
This would make a hashtable of 1000 blocks (often an array), each having a list 10 elements deep. So if you were to search for 1234, it would immediately know to search in the table entry for 123, then start comparing to find the exact match. Granted, this isn't much better than just using an array of 10000 elements, but it's just to demonstrate.
Hashtables are very useful for when you don't know exactly how many elements you'll have, but there will be a good number fewer collisions on the hash function than your total number of elements. (Which makes the hash function "hash(x) = 0" very, very bad.) You may have empty spots in your table, but ideally a majority of them will have some data.
The main advantage of using a hash for the purpose of finding items in the table, as opposed to using the original key of the key-value pair (which BTW, it typically stored in the table as well, since the hash is not reversible), is that..
...it allows mapping the whole namespace of the [original] keys to the relatively small namespace of the hash values, allowing the hash-table to provide O(1) performance for retrieving items.
This O(1) performance gets a bit eroded when considering the extra time to dealing with collisions and such, but on the whole the hash table is very fast for storing and retrieving items, as opposed to a system based solely on the [original] key value, which would then typically be O(log N), with for example a binary tree (although such tree is more efficient, space-wise)
Also consider speed. If your key is a string and your values are stored in an array, your hash can access any element in 'near' constant time. Compare that to searching for the string and its value.

Resources