Kusto: ingest from a query - azure-data-explorer

Hello all ,
I am currently trying to ingest data using a batch operation . I have my query written as such:
.set-or-append tableName with (folder = "rocky")<|
let _Scope = () {
let N = 4;
range p from 0 to N-1 step 1
| partition by p
{
functionName((list_of_ids()
| where hash(something, N) == toscalar(p)), datetime(2020-05-03))
| extend
batch_num = toscalar(p)
}
};
union (_Scope())
I want to understand if this would run in parallel for each partition or run sequential ?. If parallel how can i optimize this better ? . Any help is much appreciated.

The partition operator (which you use in your function) allows you to provide hints to control concurrency:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/partitionoperator
Regardless, depending on what functionName() does (it's not mentioned in the original question), you could consider using the distributed option:
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/data-ingestion/ingest-from-query
Setting the distributed flag to true is useful when the amount of data being produced by the query is large (exceeds 1GB of data) and the query doesn't require serialization (so that multiple nodes can produce output in parallel). When the query results are small it's not recommended to use this flag, as it might generate a lot of small data shards needlessly.

Related

How to migrate Presto `map` function to hive

Presto map() function is quite a bit easier to use than hive. A presto map() invocation takes two lists: first one for the keys second for the values
A hive map() takes a varargs variable length parameter set of alternating key,values.
Here is a query snippet that I need to migrate (backwards?) from presto to hive:
, map(
concat(map_keys(decision_feature_importance), array['id_queue', 'queue_disposition']),
concat(map_values(decision_feature_importance), array[CAST(id_queue AS VARCHAR), queue_disposition])) other_info
The core of it is that the map() accepts two parallel arrays. But hive objects rather strongly to that. What is the pattern to [reverse- ?] migrate the map() ?
There are several questions about zipping lists in hive: e.g hive create map or key/value pair from two arrays They are pretty complicated, may involve UDF's (that I do not have ability to create) or libraries (brickhouse) that I do not have ability to install (shared cluster for hundreds of users). Also they constitute only a portion of the problem here.
The following toy query shows how to build the hive format map entries from two parallel lists. Basically we need to zip the lists manually - since there is no such builtin function for hive.
Hive partial equivalent
with mydata as (
select 1 id, map('key11','val11','key12','val12','key13','val13') as mymap
union all
select 2 id, map('key21','val21','key22','val22','key13','val13') as mymap
)
select split(concat_ws(',',collect_list(concat(key,',',value ))),',') keyval from (
select * from mydata lateral view outer explode (mymap) m
) d;

tsqlt testing an output parameter from a stored procedure

I have a stored procedure that I am trying test for the proper generation of an output parameter. I experimented with tsqlt.ExpectException, but that did not work. I instead am trying tsqlt.AssertEqualsTable.
CREATE TABLE #actual (msg NVARCHAR(MAX));
CREATE TABLE #expected (msg NVARCHAR(MAX));
INSERT #expected (msg) VALUES (N'Location w1005 has LPNs that were produced against a different production order than 1')
--EXEC tSQLt.ExpectException #ExpectedMessage = N'Location 1 has LPNs that were produced agains a different production order than orderNumber';
EXEC dbo.wms_whse_check_location
#command = #Command, #operLocationHasOtherLPN=#operLocationHasOtherLPN OUTPUT;
INSERT #actual (msg) VALUES (#operLocationHasOtherLPN)
EXEC tsqlt.AssertEqualsTable #Expected = '#expected', #actual = '#actual'
The test fails, and the output from tsqlt is:
Unexpected/missing resultset rows!
|m|msg |
+---+--------------------------------------------------------------------------------------+
|< |Location w1005 has LPNs that were produced against a different production order than 1|
|> |Location w1005 has LPNs that were produced against a different production order than l|
It may be hard to see in the above snip, but the < (expected) row is identical to the > (actual) row -- tsqlt finds a difference that in fact doesn't exist. I'm not choosing the correct method it seems.
Has anyone written tests to check ouput parameters? What is the appropriate method? Thanks
p.s. Apologies for the messy formatting. I'm not a regular poster.
tSQLt.AssertEqualsString is in fact the appropriate test. I don't know where I went wrong, but when I concatenated the appropriate expected message in code (as opposed to typing it out), then ran the test, it succeeded.
Use tSQLt.AssertEqualsString, as you found out already.
Also, your two strings are not identical. The one ends in “1”, the other one doesn’t.

Multiple variables in return object of function in R. Want to run it for multiple argument cases

How do I retrieve outputs from objects in an array as described in the background?
I have a function in R that returns multiple variables. For eg. if my function is called function_ABC,then:
a<-function_ABC (input_var)
gives a such that a$var1, a$var2, and a$var3 exist.
I have multiple cases to run such that I have put then in an array:
input_var <- c(1, 2, ...15)
for storing the outputs, I declared var such that:
var <- c(v1, v2, v3, .... v15)
Then I run:
assign(v1[i],function(input_var(i)))
However, after that I am unable to access these variables as v1[1]$var1. I can access them as: v1$var1, or v3$var1, etc. But this means I need to write 15*3 commands to retrieve my output.
Is there an easier way to do this?
Push your whole input set into an array Arr[ ].
Open a multi threaded executor E of certain size N.
Using a for loop on the input array Arr[], submit your function calls as a Callable job to the executor E. While submitting each job, hold the reference to the FutureTask in another Array FTArr[ ].
When all the FutureTask jobs are executed, you may retrieve the output for each of them by running another for loop on FTArr[ ].
Note :
• make sure to add synchronized block in your func_ABC, where you are accessing shared resources to avoid deadlocks.
• Please refer to the below link, if you want to know more about the usage of a count-down-latch. A count-down-latch helps you to find out, when exactly, all the child threads have finished execution.
https://www.geeksforgeeks.org/countdownlatch-in-java/

Text streaming in Gensim

Gensim uses text streaming to minimize memory requirements. This is at the cost of performance due to endless disk IO. Is there a trick to on the fly copy the complete file from disk (one disk IO) to a temporary in-memory file?
I like to keep the code as is (no recoding into a list structures), but this is not a great way of debugging functionality
Expected result: much faster code
Some more background on the question
The original code is at https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb. The example code is taken from the phrase modelling section
I'm calculating the unigrams. All reviews are at
review_txt_filepath = os.path.join(intermediate_directory,'review_text_all.txt'),
all unigrams should go to
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
The crucial routines are
def punct_space(token):
return token.is_punct or token.is_space
def line_review(filename):
# generator function to read in reviews from the file
with codecs.open(filename, encoding='utf_8') as f:
for review in f:
yield review.replace('\\n', '\n')
def lemmatized_sentence_corpus(filename):
# generator function to use spaCy to parse reviews, lemmatize the text, and yield sentences
for parsed_review in nlp.pipe(line_review(filename),
batch_size=10000, n_threads=4):
for sent in parsed_review.sents:
yield u' '.join([token.lemma_ for token in sent
if not punct_space(token)])
The unigrams are calculated as
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(review_txt_filepath):
f.write(sentence + '\n')
Doing this for 5000 lines requires some patience, 1h30m ;-)
I'm not that familiar with iterables, but do I understand it correctly that I first have to read the
actual file (on disc) into a variable "list_of_data" and process that
with (review_txt_filepath, 'r', encoding='utf_8') as f:
list_of_data = f.read()
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
for sentence in lemmatized_sentence_corpus(list_of_data):
f.write(sentence + '\n')
So the strategy is
1. read all data into a list in memory
2. process the data
3. write the results to disc
4. delete the list from memory by setting list_with_data = ()
A problem with this is obviously that line_review is doing the file reading
Most gensim interfaces actually take iterable sequences. Examples which emphasize streaming-from-disk just happen to use iterables that read each item as needed, but you could use an in-memory list instead.
Essentially, if you do have enough RAM to have your whole dataset in memory, just use the IO-reading iterable to read things once into a list. Then, feed that list to the gensim class where it expects any iterable sequence.
This shouldn't involve any "recoding into a list structure" – but it is using the Python list type to hold things in memory. It's the most natural way to do it, and likely the most efficient, especially in algorithms which do multiple passes over tokenized text.
(The less-idiomatic approach of, say, loading the entire file into a raw byte array, then performing repeated reading of that, file-style, to the individual items needed by the algorithm is a clunkier approach. It may similarly save on repeated IO cost, but will likely waster effort on repeated re-parsing/tokenizing of items that will be processed repeatedly. You'll want to keep each item as a Python object in memory, if you have the memory, and that requires putting them in a list.)
To be more specific in answering, you'd need to provide more details in the question, like which specific algorithms/corpus-reading-styles you're using, ideally with example code.

Xquery result duplicated

I'm not getting the output I want. I don't understand why the result is duplicated. Can someone help me?
for $i in 1 to 2
let $rng:=random-number-generator()
let $rng1:=$rng('permute')(1 to 10)
let $rng:=$rng('next')()
let $rng2:=$rng('permute')(1 to 10)
let $rng:=$rng('next')()
let $rng3:=$rng('permute')(1 to 10)
return (string-join($rng1),string-join($rng2),string-join($rng3),",")
result:
23496815107
31018674529
31017684259
23496815107
31018674529
31017684259
The result is duplicated because of the initial for $i in 1 to 2, and because the variable $i is not actually used anywhere.
I edited the query based on your comment (getting 10 numbers). From what I understand, the difficulty here is to chain the calls (alternating between 'next' and 'permute'). Chaining calls can be done with a tail recursion.
declare function local:multiple-calls(
$rng as function(*),
$number-of-times as xs:integer) as item()* {
if($number-of-times le 0)
then ()
else
let $rng := $rng('next')
return ($rng('permute')(1 to 10),
local:multiple-calls($rng, $number-of-times - 1))
};
local:multiple-calls(random-number-generator(), 10)
Note: I am not sure if (1 to 10) is what needs to actually be passed to the call to $rng('permute'), or if it was an attempt to output ten numbers. In doubt, I haven't changed it.
The specification is here:
http://www.w3.org/TR/xpath-functions-31/#func-random-number-generator
It says:
Both forms of the function are ·deterministic·: calling the function
twice with the same arguments, within a single ·execution scope·,
produces the same results.
If you supply $i as the $seed argument to random-number-generator then the two sequences should be different.
I think I now understand what confuses you in this original query. One could indeed expect the random numbers to be generated differently for each iteration of $i.
However, XQuery is (to put it simply, with a few exceptions) deterministic. This means that the random generator probably gets initialized in each iteration with the same, default seed.
Thus, I have a second potential answer:
If you have a way to pass a different seed to $rng, you could slightly modify your initial query by constructing a seed based on $i and maybe current-dateTime() in each iteration before generating the numbers. But it will still be the same if you execute the query several times unless you involve the current date/time.

Resources