Enhance a multidimensional sparse array data structure - multidimensional-array

I need an efficient data structure to store a multidimensional sparse array.
There are only 2 operations over the array:
batch insert of values, usually of a larger number of new values that existed in the array before. Very unlikely that there is a key collision on insert, however if it happens then the value is not updated.
query values in certain range (e.g. read range from index [2, 3, 10, 2] to [2, 3, 17, 6] in order)
From the start I know number of dimensions (usually between 3 to 10) and their sizes (each index can be stored in Int64 and product of all sizes doesn't exceed 2^256) and the upper limit on possible number of the array cells (usually 2^26-2^32).
Currently I use a balanced binary tree for storing the sparse array, the UInt256 key is formed as usual:
key = (...(index_0 * dim_size_1 + index_1) + ... + index_n-1) * dim_size_n + index_n
with operation time complexities (and I understand it can't be any better):
insert in O(log N)
search in O(log N)
Current implementation has problems:
expensive encoding of an index tuple into the key and a key back into the indexes
lack of locality of reference which would be beneficial during range queries
Is it a good idea to replace my tree with a skip list for the locality of reference?
When is it better to have a recursive (nested) structure of sparse arrays for each dimension instead of a single array with the composite key if the array sparseness is given?
I'm interested in any examples of efficient in-memory multidimensional array implementations and in specialized literature on the topic.

It depends on how sparse your matrix is. It's hard to give give numbers, but if it is "very" sparse then you may want to try using a PH-Tree (disclaimer: self advertisement). It is essentially a multidimensional radix-tree.
It natively supports 64bit integers (Java and C++). It is not balanced but depth is inherently limited to the number of bits per dimension (usually 64). It is natively a "map", i.e. it allows only one value per coordinate (there is also a multimap version that allows multiple values). The C++ version is limited to 62 dimensions.
Operations are in the order of O(log N) but should be significantly faster than a (balanced) binary tree.
Please note that the C++ version doesn't compile with MSVC at the moment but there is a patch coming. Let me know if you run into problems.

Related

Is it always necessary to make hash table number of buckets a prime number for performance reason?

https://www.quora.com/Why-should-the-size-of-a-hash-table-be-a-prime-number?share=1
I see that people mention that the number of buckets of a hash table is better to be prime numbers.
Is it always the case? When the hash values are already evenly distributed, there is no need to use prime numbers then?
https://github.com/rui314/chibicc/blob/main/hashmap.c
For example, the above hash table code does not use prime numbers as the number of buckets.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L37
But the hash values are generated from strings using fnv_hash.
https://github.com/rui314/chibicc/blob/main/hashmap.c#L17
So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?
The answer is "usually you don't need a table whose size is a prime number, but there are some implementation reasons why you might want to do this."
Fundamentally, hash tables work best when hash codes are spread out as close to uniformly at random as possible. That prevents items from clustering in any one location within the table. At some level, provided that you have a good enough hash function to make this happen, the size of the table doesn't matter.
So why do folks say to pick tables whose size is a prime? There are two main reasons for this, and they're due to specific cases that don't arise in all hash tables.
One reason why you sometimes see prime-sized tables is due to a specific way of building hash functions. You can build reasonable hash functions by picking functions of the form h(x) = (ax + b) mod p, where a is a number in {1, 2, ..., p-1} and b is a number in the {0, 1, 2, ..., p-1}, assuming that p is a prime. If p isn't prime, hash functions of this form don't spread items out uniformly. As a result, if you're using a hash function like this one, then it makes sense to pick a table whose size is a prime number.
The second reason you see advice about prime-sized tables is if you're using an open-addressing strategy like quadratic probing or double hashing. These hashing strategies work by hashing items to some initial location k. If that slot is full, we look at slot (k + r) mod T, where T is the table size and r is some offset. If that slot is full, we then check (k + 2r) mod T, then (k + 3r) mod T, etc. If the table size is a prime number and r isn't zero, this has the nice, desirable property that these indices will cycle through all the different positions in the table without ever repeating, ensuring that items are nicely distributed over the table. With non-prime table sizes, it's possible that this strategy gets stuck cycling through a small number of slots, which gives less flexibility in positions and can cause insertions to fail well before the table fills up.
So assuming you aren't using double hashing or quadratic probing, and assuming you have a strong enough hash function, feel free to size your table however you'd like.
templatetypedef has some excellent points as always - just adding a couple more and some examples...
Is it always necessary to make hash table number of buckets a prime number for performance reason?
No. Firstly, using prime numbers for bucket count tends to mean you need to spend more CPU cycles to fold/mod a hash value returned by the hash function into the current bucket count. A popular alternative is to use powers of two for the bucket count (e.g. 8, 16, 32, 64... as you resize), because then you can do a bitwise AND operation to map from a hash value to a bucket in 1 CPU cycle. That answers your "So there is a reason why it makes sense to use bucket sizes that are not necessarily prime numbers?"
Tuning a hash table for performance often means weighing the cost of a stronger hash function and modding by prime numbers against the cost of higher collisions.
Prime bucket counts often help reduce collisions when the hash function is unable to produce a very good distribution for the keys its fed.
For example, if you hashed a bunch of pointers to 64-bit doubles using an identity hash (basically, casting the pointer address to a size_t), then the hash values would all be multiples of 8 (due to alignment), and if you had a hash table size like say 1024 or 2048 (powers of 2), then all your pointers would hash onto 1/8th of the bucket indices (specifically, buckets 0, 8, 16, 25, 32 etc.). With a prime number of buckets, at least the pointer values - which if the load factor is high are inevitably spread out over a much larger range than the range of bucket indices - tend to wrap around the hash table hitting different indices.
When you use a very strong hash function - where the low order bits are effectively random but repeatable, you'll already get a good distribution across buckets regardless of the bucket count. There are also times when even with a terribly weak hash function - like an identity hash - h(x) == x - all the bits in the keys are so random that they produce as good a distribution as a cryptographic hash could produce, so there's no point spending extra time on a stronger hash - that may even increase collisions.
There a also times when the distribution isn't inherently great, but you can afford to use extra memory to keep the load factor low, so it's not worth using primes or a better hash function. Still, extra buckets puts more strain on the CPU caches too - so things can end up slower than hoped for.
Other times, keys with an identity hash have an inherent tendency to fall into distinct buckets (e.g. because they might have been generated by an incrementing counter, even if some of the values are no longer in use). In that case, a strong hash function increases collisions and worsens CPU cache access patterns. Whether you use powers of two or prime bucket counts makes little difference here.
When the hash values are already evenly distributed, there is no need to use prime numbers then?
That statement is trivially true but kind of pointless if you're talking about hash values after the mod-to-current-hash-table-size operation: even distribution there directly relates to few collisions.
If you're talking about the more interesting case of hash values evenly distributed in the hash function return type value space (e.g. a 64-bit integer), before those values are modded into whatever the current hash table bucket count is, then there's till room for prime numbers to help, but only when the hashed key space a larger range than the hash bucket indices. The pointer example above illustrated that: if you had say 800 distinct 8-byte-aligned pointers going into ~1000 bucket, then the difference between the numerically lowest pointer and the higher address would be at least 799*8 = 6392... you're wrapping around the table more than 6 times at a minimum (for close-as-possible pointers), and a prime number of buckets would increase the odds of each of "wrap" modding onto previously unused buckets.
Note that some of the above benefits to prime bucket counts apply to any kind of collision handling - separate chaining, linear probing, quadratic probing, double hashing, cuckoo hashing, robin hood hashing etc.

Simple function to generate random number sequence without knowing previous number but know current index (no variable assignment)?

Is there any (simple) random generation function that can work without variable assignment? Most functions I read look like this current = next(current). However currently I have a restriction (from SQLite) that I cannot use any variable at all.
Is there a way to generate a number sequence (for example, from 1 to max) with only n (current number index in the sequence) and seed?
Currently I am using this:
cast(((1103515245 * Seed * ROWID + 12345) % 2147483648) / 2147483648.0 * Max as int) + 1
with max being 47, ROWID being n. However for some seed, the repeat rate is too high (3 unique out of 47).
In my requirements, repetition is ok as long as it's not too much (<50%). Is there any better function that meets my need?
The question has sqlite tag but any language/pseudo-code is ok.
P.s: I have tried using Linear congruential generators with some a/c/m triplets and Seed * ROWID as Seed, but it does not work well, it's even worse.
EDIT: I currently use this one, but I do not know where it's from. The rate looks better than mine:
((((Seed * ROWID) % 79) * 53) % "Max") + 1
I am not sure if you still have the same problem but I might have a solution for you.
What you could do is use Pseudo Random M-sequence generators based on shifting registers. Where you just have to take high enough order of you primitive polynomial and you don't need to store any variables really.
For more info you can check the wiki page
What you would need to code is just the primitive polynomial shifting equation and I have checked in an online editor it should be very easy to do. I think the easiest way for you would be to use Binary base and use PRBS sequences and depending on how many elements you will have you can choose your sequence length. For example this is the implementation for length of 2^15 = 32768 (PRBS15), the primitive polynomial I took from the wiki page (There youcan find the primitive polynomials all the way to PRBS31 what would be 2^31=2.1475e+09)
Basically what you need to do is:
SELECT (((ROWID << 1) | (((ROWID >> 14) <> (ROWID >> 13)) & 1)) & 0x7fff)
The beauty of this approach is if you take the sequence of the PRBS with longer period than your ROWID largest value you will have unique random index. Very simple. :)
If you need help with searching for primitive polynomials you can see my github repo which deals exactly with finding primitive polynomials and unique m-sequences. It is currently written in Matlab, but I plan to write it in python in next few days.
Cheers!
What about using good hash function and map result into [1...max] range?
Along the lines (in pseudocode). sha1 was added to SQLite 3.17.
sha1(ROWID) % Max + 1
Or use any external C code for hash (murmur, chacha, ...) as shown here
A linear congruential generator with appropriately-chosen parameters (a, c, and modulus m) will be a full-period generator, such that it cycles pseudorandomly through every integer in its period before repeating. Although you may have tried this idea before, have you considered that m is equivalent to max in your case? For a list of parameter choices for such generators, see L'Ecuyer, P., "Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure", Mathematics of Computation 68(225), January 1999.
Note that there are some practical issues to implementing this in SQLite, especially if your SQLite version supports only 32-bit integers and 64-bit floating-point numbers (with 52 bits of precision). Namely, there may be a risk of—
overflow if an intermediate multiplication exceeds 32 bits for integers, and
precision loss if an intermediate multiplication results in a greater-than-52-bit number.
Also, consider why you are creating the random number sequence:
Is the sequence intended to be unpredictable? In that case, a linear congruential generator alone is not enough, and you should generate unique identifiers by other means, such as by combining unique numbers with cryptographically random numbers.
Will the numbers generated this way be exposed in any way to end users? If not, there is no need to obfuscate them by "shuffling" them.
Also, depending on the SQLite API you're using (for your programming language), there may be a way to write a custom function to convert the seed and ROWID to a random unique number. The details, however, depend heavily on the specific SQLite API. Another answer shows an example for Perl.

What should be size of map if different objects(say 3) have same hash code, and as a result, present in same bucket?

What should be size of map if different objects(say 3) have same hash code, and as a result, present in same bucket?
The resulting size of the hash table depends on what collision resolution scheme we are using.
In the simplest case, we are using something like separate chaining (with linked lists).
In this case, we will have an array of N buckets and each bucket contains a reference to a linked list.
If we proceed to insert 3 items into the hash table, all of which share the same hash code, then the single target linked list would grow to length 3.
Thus, at a high level, we need at least N "units" of space to store bucket references plus 3 "units" of space to store the elements of the (occupied) linked list.
The exact size of these "units", depends on implementation details, such as word size (32-bit vs. 64-bit) and the exact definition of the linked list (singly- vs. doubly-linked list).
Assuming that we use singly-linked lists (for each bucket) on a 32-bit machine, the total size would be (approximately) 32 * N + (32 + x) * 3, where x refers to the size of the data type we are storing (e.g. ints, doubles, string, etc.)
If you would like to learn more, I would suggest googling "hash table collision" for more info.

map.find() time complexity

If I have a map:
map myMap<string,vector<int>>
What would the best,average, and worst case time complexity be to find a key, and then iterate through the vector to find a specific int?
I know the map.find() method is O(log n), but does the fact that I have to then search for an int within a vector change the time complexity?
Thanks
It'd help if you had stated this was C++. std::map is usually implemented as a binary tree, so that's where the O(log(n)) on the find operation comes from.
It'd be O(log(n) + m), where n is the size of the map and m is the size of (each) vector. I'm assuming that you only do one lookup and then iterate through the whole vector that corresponds to your key.
Since log(n) grows slowly, unless you have extremely small vectors, log(n) should be < m, therefore your algorithm should be O(m).

Difference between number and integer datatype in oracle dictionary views

I used oracle dictionary views to find out column differences if any between two schema's. While syncing data type discrepancies I found that both NUMBER and INTEGER data types stored in all_tab_columns/user_tab_columns/dba_tab_columns as NUMBER only so it is difficult to sync data type discrepancies where one schema/column has number datatype and another schema/column has integer data type.
While comparison of schema's it show datatype mismatch. Please suggest if there is any other alternative apart form using dictionary views or if any specific properties from dictionary views can be used to identify if data type is integer.
the best explanation i've found is this:
What is the difference betwen INTEGER and NUMBER? When should we use NUMBER and when should we use INTEGER? I just wanted to update my comments here...
NUMBER always stores as we entered. Scale is -84 to 127. But INTEGER rounds to whole number. The scale for INTEGER is 0. INTEGER is equivalent to NUMBER(38,0). It means, INTEGER is constrained number. The decimal place will be rounded. But NUMBER is not constrained.
INTEGER(12.2) => 12
INTEGER(12.5) => 13
INTEGER(12.9) => 13
INTEGER(12.4) => 12
NUMBER(12.2) => 12.2
NUMBER(12.5) => 12.5
NUMBER(12.9) => 12.9
NUMBER(12.4) => 12.4
INTEGER is always slower then NUMBER. Since integer is a number with added constraint. It takes additional CPU cycles to enforce the constraint. I never watched any difference, but there might be a difference when we load several millions of records on the INTEGER column. If we need to ensure that the input is whole numbers, then INTEGER is best option to go. Otherwise, we can stick with NUMBER data type.
Here is the link
Integer is only there for the sql standard ie deprecated by Oracle.
You should use Number instead.
Integers get stored as Number anyway by Oracle behind the scenes.
Most commonly when ints are stored for IDs and such they are defined with no params - so in theory you could look at the scale and precision columns of the metadata views to see of no decimal values can be stored - however 99% of the time this will not help.
As was commented above you could look for number(38,0) columns or similar (ie columns with no decimal points allowed) but this will only tell you which columns cannot take decimals, and not what columns were defined so that INTS can be stored.
Suggestion:
do a data profile on the number columns. Something like this:
select max( case when trunc(column_name,0)=column_name then 0 else 1 end ) as has_dec_vals
from table_name
This is what I got from oracle documentation, but it is for oracle 10g release 2:
When you define a NUMBER variable, you can specify its precision (p) and scale (s) so that it is sufficiently, but not unnecessarily, large. Precision is the number of significant digits. Scale can be positive or negative. Positive scale identifies the number of digits to the right of the decimal point; negative scale identifies the number of digits to the left of the decimal point that can be rounded up or down.
The NUMBER data type is supported by Oracle Database standard libraries and operates the same way as it does in SQL. It is used for dimensions and surrogates when a text or INTEGER data type is not appropriate. It is typically assigned to variables that are not used for calculations (like forecasts and aggregations), and it is used for variables that must match the rounding behavior of the database or require a high degree of precision. When deciding whether to assign the NUMBER data type to a variable, keep the following facts in mind in order to maximize performance:
Analytic workspace calculations on NUMBER variables is slower than other numerical data types because NUMBER values are calculated in software (for accuracy) rather than in hardware (for speed).
When data is fetched from an analytic workspace to a relational column that has the NUMBER data type, performance is best when the data already has the NUMBER data type in the analytic workspace because a conversion step is not required.

Resources