Impala - maximum number of tuple inserts - cloudera

What is the max # of tuples you could insert at a time in Impala
INSERT INTO sample_table values ('john', 'high',....value 6, value 7, value 8 ......value 25), ('Kim', 'low',... value 6, value 7, value 8 ......value 25),
given that a tuple is
('john', 'high',....value 6, value 7, value 8 ......value 25)

Well. The limit of n should depends on how much stack size of the impala frondend's JVM has, since this style of insert statement causes jflex (which impala uses as SQL parser) to recurse at least n times, and all the tuples are stored in one deep parse tree. Suppose you've successfully constructed this nasty tree, the next thing shall be serializing it as a thrift message and pass it around. I can only imagine how slowly it could be.
I'd suggest using LOAD for large amount of insertions, which transforms to raw file motions, or using insert into select from, which internally applies distributed reads and writes over HDFS.

Related

Enhance a multidimensional sparse array data structure

I need an efficient data structure to store a multidimensional sparse array.
There are only 2 operations over the array:
batch insert of values, usually of a larger number of new values that existed in the array before. Very unlikely that there is a key collision on insert, however if it happens then the value is not updated.
query values in certain range (e.g. read range from index [2, 3, 10, 2] to [2, 3, 17, 6] in order)
From the start I know number of dimensions (usually between 3 to 10) and their sizes (each index can be stored in Int64 and product of all sizes doesn't exceed 2^256) and the upper limit on possible number of the array cells (usually 2^26-2^32).
Currently I use a balanced binary tree for storing the sparse array, the UInt256 key is formed as usual:
key = (...(index_0 * dim_size_1 + index_1) + ... + index_n-1) * dim_size_n + index_n
with operation time complexities (and I understand it can't be any better):
insert in O(log N)
search in O(log N)
Current implementation has problems:
expensive encoding of an index tuple into the key and a key back into the indexes
lack of locality of reference which would be beneficial during range queries
Is it a good idea to replace my tree with a skip list for the locality of reference?
When is it better to have a recursive (nested) structure of sparse arrays for each dimension instead of a single array with the composite key if the array sparseness is given?
I'm interested in any examples of efficient in-memory multidimensional array implementations and in specialized literature on the topic.
It depends on how sparse your matrix is. It's hard to give give numbers, but if it is "very" sparse then you may want to try using a PH-Tree (disclaimer: self advertisement). It is essentially a multidimensional radix-tree.
It natively supports 64bit integers (Java and C++). It is not balanced but depth is inherently limited to the number of bits per dimension (usually 64). It is natively a "map", i.e. it allows only one value per coordinate (there is also a multimap version that allows multiple values). The C++ version is limited to 62 dimensions.
Operations are in the order of O(log N) but should be significantly faster than a (balanced) binary tree.
Please note that the C++ version doesn't compile with MSVC at the moment but there is a patch coming. Let me know if you run into problems.

mclapply encounters errors depending on core id?

I have a set of genes for which I need to calculate some coefficients in parallel.
Coefficients are calculated inside GeneTo_GeneCoeffs_filtered that takes gene name as an input and returns the list of 2 data frames.
Having 100-length gene_array I ran this command with the different number of cores: 5, 6 and 7.
Coeffslist=mclapply(gene_array,GeneTo_GeneCoeffs_filtered,mc.cores = no_cores)
I encounter errors on different gene names depending on the number of cores assigned to mclapply.
Indexes of genes on which GeneTo_GeneCoeffs_filtered cannot return the list of data frames they have a pattern.
In the case of 7 cores assigned to mclapply, it is 4, 11, 18, 25, ... 95 elements of gene_array (every 7th), and when R works with 6 cores indexes are 2, 8, 14,..., 98 (every 6th) and the same way with 5 cores - every 5th.
The most important thing is that they are different for these processes and it means that the problem is not in particular genes.
I suspect there might be "broken" core that cannot properly run my functions and only it generates this errors. Is there a way to trace back its id and exclude it from the list of cores that can be used by R?
A close reading of mclapply's manpage reveals that this behavior is by design and it arises as result of interaction between:
(a)
"the input X is split into as many parts as there are cores (currently
the values are spread across the cores sequentially, i.e. first value
to core 1, second to core 2, ... (core + 1)-th value to core 1 etc.)
and then one process is forked to each core and the results are
collected."
(b)
a "try-error" object will be returned for all the values involved in
the failure, even if not all of them failed.
In your case, by virtue of (a), your gene_array is spread "round-robin" style across the cores (with a gap of mc.cores between the indexes of successive elements), and by virtue of (b), if any gene_array element raises an error, you get back an error for each gene_array element sent to that core (having a gap of mc.cores between the indices of those elements).
I refreshed my understanding of this in an exchange yesterday with Simon Urbanek: https://stat.ethz.ch/pipermail/r-sig-hpc/2019-September/002098.html in which I also provide an error-handling approach yielding errors only for the indices that generate an error.
You can also get errors only for the indices that generate an error by passing mc.preschedule=FALSE.

What should be size of map if different objects(say 3) have same hash code, and as a result, present in same bucket?

What should be size of map if different objects(say 3) have same hash code, and as a result, present in same bucket?
The resulting size of the hash table depends on what collision resolution scheme we are using.
In the simplest case, we are using something like separate chaining (with linked lists).
In this case, we will have an array of N buckets and each bucket contains a reference to a linked list.
If we proceed to insert 3 items into the hash table, all of which share the same hash code, then the single target linked list would grow to length 3.
Thus, at a high level, we need at least N "units" of space to store bucket references plus 3 "units" of space to store the elements of the (occupied) linked list.
The exact size of these "units", depends on implementation details, such as word size (32-bit vs. 64-bit) and the exact definition of the linked list (singly- vs. doubly-linked list).
Assuming that we use singly-linked lists (for each bucket) on a 32-bit machine, the total size would be (approximately) 32 * N + (32 + x) * 3, where x refers to the size of the data type we are storing (e.g. ints, doubles, string, etc.)
If you would like to learn more, I would suggest googling "hash table collision" for more info.

time complexity of closed hashing algorithm

Assume we have applied close hashing algorithm on (4, 2, 12, 3, 9, 11, 7, 8, 13, and 18). And assume the length of hash table is 7 initially.
How a search on such a hash table can be achieved in O(1) time in the worst case.
It really doesn't matter what you do. Because the data set is predetermined, there is a constant upper bound to the worst case lookup for any hash function (as long as the hash function is guaranteed to terminate). (If one element takes longer to find than the others, that is the upper bound.) The constant upper bound implies O(1) complexity. QED.

SQLITE_INTEGER value bytes

I have a question about the data types in sqlite3.
As a value of SQLITE_INTEGER can be stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value, if I only know that a column in SQlite database stores SQLITE_INTEGER, how can I know a value in this column is 4 bytes or 6-8 bytes integer, or which one should be used to get the value, sqlite3_column_int() or sqlite3_column_int64()?
Can I use sqlite3_column_bytes() in this case? but according to the documentation, sqlite3_column_bytes() is primarily used for TEXT or BLOB.
Thanks!
When SQLite steps into a record, all integer values are expanded to 64 bits.
sqlite3_column_int() returns the lower 32 bits of that value without checking for overflows.
When you call sqlite3_column_bytes(), SQLite will convert the value to text, and return the number of characters.
You cannot know how large an integer value is before reading it.
Afterwards, you can check the list in the record format documentation for the smallest possible format for that value, but if you want to be sure that integer values are never truncated to 32 bits, you have to always use sqlite3_column_int64(), or ensure that large values get never written to the DB in the first place (if that is possible).

Resources