Does materialize function not work with tabular function argument in Kusto? - azure-data-explorer

I created a simple function MaterializeRemoteInputTable below which accepts the output of a table as an input.
To test if the input is materialized, I am using it twice to calculate two scalar values - x and y:
.create-or-alter function MaterializeRemoteInputTable(inputTable: (State:string)) {
let materializedInput = materialize(inputTable);
let x = toscalar(materializedInput | count );
let y = toscalar(materializedInput | where State contains "S" | count);
print x, y;
}
I am using the sample kusto database's function output as an input to the above function:
cluster('https://help.kusto.windows.net').database('Samples').
GetStatesWithPopulationSmallerThan(1000000)
| invoke MaterializeRemoteInputTable()
On querying the help.kusto.windows.net cluster with my ClientActivityId, I see two queries are executed:
.show commands-and-queries
| where ClientActivityId contains "KE.RunQuery;<guid...>"
Above query outputs two rows:
GetStatesWithPopulationSmallerThan(long(1000000))|__executeAndCache|count as Count|limit long(1)|project ["b2fb..."]=["Count"]
GetStatesWithPopulationSmallerThan(long(1000000))|__executeAndCache|where (["State"] contains ("S"))|count as Count|limit long(1)|project ["5d0e2..."]=["Count"]
Since I have materialized the input to my MaterializeRemoteInputTable, why are two queries executed on the remote cluster, once each for x and y?

There is no issue with materialize & tabular argument to a function.
The issue is a known limitation in materialize with cross cluster query that in some cases the materialization is not performed.
Please note that this is only a performance issue and not a logical issue (query results are guaranteed to be correct).

Related

Kusto: Handling null parameter in user defined funtion

I am writing a custom User defined function in kusto where I want to have some optional parameters to the function which will be used in the "where" clause. I want to dynamically handle this. For example: If the value is present, then my where clause should consider that column filter, else it should not be included in the "where" clause>
Eg (psuedo code where value is not null):
function Calculate(string:test)
{
T | where test == test | order by timestamp
}
Eg (psuedo code where value is null or empty. My final query should look like this):
function Calculate(string:test)
{
T | order by timestamp
}
What is the efficient way to implement this. I will call this function from my c# class.
you can define a function with a default value for its argument, and use the logical or operator and the isempty() function to implement the condition you've described.
for example:
(note: the following is demonstrated using a let statement, but can be applied similarly to stored functions)
let T = range x from 1 to 5 step 1 | project x = tostring(x)
;
let F = (_x: string = "") {
T
| where isempty(_x) or _x == x
| order by x desc
}
;
F("abc")
if you run F() or F("") (i.e. no argument, or an empty string as the argument) - it will return all values 1-5.
if you run F("3") - it will return a single record with the value 3.
if you run F("abc") - it will return no records.

Different array types, for comparison in the set_difference() function

I'm having some issues getting expected results from set_difference(). I assumed I was comparing two dynamic arrays, but I'm not sure where the gap is. The only additional insight I have is that when I compare the two arrays using the gettype() function, I get the following:
First array
Created using a make_list aggregation, e.g.
| summarize inv_list = make_list(Date)
When I run gettype() on the array:
"type_inv_list": array
Second array
Created through a scalar function
let period_check_range = todynamic(range(make_datetime(start_date), datetime_add('day',8,make_datetime(start_date)),1d));
When I run gettype() on the array:
"type_range___scalar_90e56a216d8942f28e6797e5abc35dd9": array
Any guidance on how to make these arrays work so I can use the set_difference() function?
You're missing toscalar() (see doc) in the first array. When you run | summarize ... you get a table as a result, but what you actually want is a single scalar, what's why toscalar() is needed.
Here's how to achieve what you want:
let StartDate = ago(10d);
let Array1 = toscalar(MyTable | summarize make_set(Timestamp));
let Array2 = todynamic(range(make_datetime(StartDate), datetime_add('day',8,make_datetime(StartDate)),1d));
print set_difference(Array2, Array1)
By the way, you probably want to use make_set and not make_list as you're not interested in duplicate values.

How do I represent sparse arrays in Pari/GP?

I have a function that returns integer values to integer input. The output values are relatively sparse; the function only returns around 2^14 unique outputs for input values 1....2^16. I want to create a dataset that lets me quickly find the inputs that produce any given output.
At present, I'm storing my dataset in a Map of Lists, with each output value serving as the key for a List of input values. This seems slow and appears to use a whole of stack space. Is there a more efficient way to create/store/access my dataset?
Added:
It turns out the time taken by my sparesearray() function varies hugely on the ratio of output values (i.e., keys) to input values (values stored in the lists). Here's the time taken for a function that requires many lists, each with only a few values:
? sparsearray(2^16,x->x\7);
time = 126 ms.
Here's the time taken for a function that requires only a few lists, each with many values:
? sparsearray(2^12,x->x%7);
time = 218 ms.
? sparsearray(2^13,x->x%7);
time = 892 ms.
? sparsearray(2^14,x->x%7);
time = 3,609 ms.
As you can see, the time increases exponentially!
Here's my code:
\\ sparsearray takes two arguments, an integer "n" and a closure "myfun",
\\ and returns a Map() in which each key a number, and each key is associated
\\ with a List() of the input numbers for which the closure produces that output.
\\ E.g.:
\\ ? sparsearray(10,x->x%3)
\\ %1 = Map([0, List([3, 6, 9]); 1, List([1, 4, 7, 10]); 2, List([2, 5, 8])])
sparsearray(n,myfun=(x)->x)=
{
my(m=Map(),output,oldvalue=List());
for(loop=1,n,
output=myfun(loop);
if(!mapisdefined(m,output),
/* then */
oldvalue=List(),
/* else */
oldvalue=mapget(m,output));
listput(oldvalue,loop);
mapput(m,output,oldvalue));
m
}
To some extent, the behavior you are seeing is to be expected. PARI appears to pass lists and maps by value rather than reference except to the special inbuilt functions for manipulating them. This can be seen by creating a wrapper function like mylistput(list,item)=listput(list,item);. When you try to use this function you will discover that it doesn't work because it is operating on a copy of the list. Arguably, this is a bug in PARI, but perhaps they have their reasons. The upshot of this behavior is each time you add an element to one of the lists stored in the map, the entire list is being copied, possibly twice.
The following is a solution that avoids this issue.
sparsearray(n,myfun=(x)->x)=
{
my(vi=vector(n, i, i)); \\ input values
my(vo=vector(n, i, myfun(vi[i]))); \\ output values
my(perm=vecsort(vo,,1)); \\ obtain order of output values as a permutation
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||vo[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=vo[perm[loop]]);
listput(bucket,vi[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
The output is a matrix in the same format as a map - if you would rather a map then it can be converted with Map(...), but you probably want a matrix for processing since there is no built in function on a map to get the list of keys.
I did a little bit of reworking of the above to try and make something more akin to GroupBy in C#. (a function that could have utility for many things)
VecGroupBy(v, f)={
my(g=vector(#v, i, f(v[i]))); \\ groups
my(perm=vecsort(g,,1));
my(list=List(), bucket=List(), key);
for(loop=1, #perm,
if(loop==1||g[perm[loop]]<>key,
if(#bucket, listput(list,[key,Vec(bucket)]);bucket=List()); key=g[perm[loop]]);
listput(bucket, v[perm[loop]])
);
if(#bucket, listput(list,[key,Vec(bucket)]));
Mat(Col(list))
}
You would use this like VecGroupBy([1..300],i->i%7).
There is no good native GP solution because of the way garbage collection occurs because passing arguments by reference has to be restricted in GP's memory model (from version 2.13 on, it is supported for function arguments using the ~ modifier, but not for map components).
Here is a solution using the libpari function vec_equiv(), which returns the equivalence classes of identical objects in a vector.
install(vec_equiv,G);
sparsearray(n, f=x->x)=
{
my(v = vector(n, x, f(x)), e = vec_equiv(v));
[vector(#e, i, v[e[i][1]]), e];
}
? sparsearray(10, x->x%3)
%1 = [[0, 1, 2], [Vecsmall([3, 6, 9]), Vecsmall([1, 4, 7, 10]), Vecsmall([2, 5, 8])]]
(you have 3 values corresponding to the 3 given sets of indices)
The behaviour is linear as expected
? sparsearray(2^20,x->x%7);
time = 307 ms.
? sparsearray(2^21,x->x%7);
time = 670 ms.
? sparsearray(2^22,x->x%7);
time = 1,353 ms.
Use mapput, mapget and mapisdefined methods on a map created with Map(). If multiple dimensions are required, then use a polynomial or vector key.
I guess that is what you are already doing, and I'm not sure there is a better way. Do you have some code? From personal experience, 2^16 values with 2^14 keys should not be an issue with regards to speed or memory - there may be some unnecessary copying going on in your implementation.

SELECT COUNT(*) doesn't work in QML

I'm trying to get the number of records with QML LocalStorage, which uses sqlite. Let's take this snippet in account:
function f() {
var db = LocalStorage.openDatabaseSync(...)
db.transaction (
function(tx) {
var b = tx.executeSql("SELECT * FROM t")
console.log(b.rows.length)
var c = tx.executeSql("SELECT COUNT(*) FROM t")
console.log(JSON.stringify(c))
}
)
}
The output is:
qml: 3
qml: {"rowsAffected":0,"insertId":"","rows":{}}
What am I doing wrong that the SELECT COUNT(*) doesn't output anything?
EDIT: rows only seems empty in the second command. Calling
console.log(JSON.stringify(c.rows.item(0)))
gives
qml: {"COUNT(*)":3}
Two questions now:
Why is rows shown as empty
How can I access the property inside c.rows.item(0)
In order to visit the items, you have to use:
b.rows.item(i)
Where i is the index of the item you want to get (in your first example, i belongs to [0, 1, 2] for you have 3 items, in the second one it is 0 and you can query it as c.rows.item(0)).
The rows field appears empty and it is a valid result, for the items are not part of the rows field itself (indeed you have to use a method to get them, as far as I know that method could also be a memento that completely enclose the response data) and the item method is probably defined as not enumerable (I cannot verify it, I'm on the beach and it's quite difficult to explore the Qt code now :-)). You can safely rely on the length parameter to know if there are returned values, thus you can iterate over them to print them out. I did something like that in a project of mine and it works fine.
The properties inside item(0) have the same names given for the query. I suggest to rewrite that query as:
select count(*) as cnt from t
Then, you can get the count as:
c.rows.item(0).cnt

pyparsing for querying a database of chemical elements

I would like to parse a query for a database of chemical elements.
The database is stored in a xml file. Parsing that file produces a nested dictionary that is stored in a singleton object that inherit from collections.OrderedDict.
Asking for an element will give me an ordered dictionary of its corresponding properties
(i.e. ELEMENTS['C'] --> {'name':'carbon','neutron' : 0,'proton':6, ...}).
Conversely, asking for a propery will give me an ordered dictionary of its values for all the elements (i.e. ELEMENTS['proton'] --> {'H' : 1, 'He' : 2} ...).
A typical query could be:
mass > 10 or (nucleon < 20 and atomic_radius < 5)
where each 'subquery' (i.e. mass > 10) will return the set of elements that matches it.
Then, the query will be converted and transformed internally to a string that will be evaluated further to produce a set of the indexes of the elements that matched it. In that context the operators and/or are not boolean operator but rather ensemble operator that acts upon python sets.
I recently sent a post for building such a query. Thanks to the useful answers I got, I think that I did more or less the job (I hope on a nice way !) but I still have some questions related to pyparsing.
Here is my code:
import numpy
from pyparsing import *
# This import a singleton object storing the datase dictionary as
# described earlier
from ElementsDatabase import ELEMENTS
and_operator = oneOf(['and','&'], caseless=True)
or_operator = oneOf(['or' ,'|'], caseless=True)
# ELEMENTS.properties is a property getter that returns the list of
# registered properties in the database
props = oneOf(ELEMENTS.properties, caseless=True)
# A property keyword can be quoted or not.
props = Suppress('"') + props + Suppress('"') | props
# When parsed, it must be replaced by the following expression that
# will be eval later.
props.setParseAction(lambda t : "numpy.array(ELEMENTS['%s'].values())" % t[0].lower())
quote = QuotedString('"')
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
float_ = Regex(r'[+-]?(\d+(\.\d*)?)?([eE][+-]?\d+)?').setParseAction(lambda t:float(t[0]))
comparison_operator = oneOf(['==','!=','>','>=','<', '<='])
comparison_expr = props + comparison_operator + (quote | float_ | integer)
comparison_expr.setParseAction(lambda t : "set(numpy.where(%s)%s%s)" % tuple(t))
grammar = Combine(operatorPrecedence(comparison_expr, [(and_operator, 2, opAssoc.LEFT) (or_operator, 2, opAssoc.LEFT)]))
# A test query
res = grammar.parseString('"mass " > 30 or (nucleon == 1)',parseAll=True)
print eval(' '.join(res._asStringList()))
My question are the following:
1 using 'transformString' instead of 'parseString' never triggers any
exception even when the string to be parsed does not match the grammar.
However, it is exactly the functionnality I need. Is there is a way to do so ?
2 I would like to reintroduce white spaces between my tokens in order
that my eval does not fail. The only way I found to do so it the one
implemented above. Would you see a better way using pyparsing ?
sorry for the long post but I wanted to introduce in deeper details its context. BTW, if you find this approach bad, do not hesitate to tell it me!
thank you very much for your help.
Eric
do not worry about my concern, I found a work around. I used the SimpleBool.py example shipped with pyparsing (thanks for the hint Paul).
Basically, I used the following approach:
1 for each subquery (i.e. mass > 10), using the setParseAction method,
I joined a function that returns the set of eleements that matched
the subquery
2 then, I joined the following functions for each logical operator (and,
or and not):
def not_operator(token):
_, s = token[0]
# ELEMENTS is the singleton described in my original post
return set(ELEMENTS.keys()).difference(s)
def and_operator(token):
s1, _, s2 = token[0]
return (s1 and s2)
def or_operator(token):
s1, _, s2 = token[0]
return (s1 or s2)
# Thanks for Paul for the hint.
grammar = operatorPrecedence(comparison_expr,
[(not_token, 1,opAssoc.RIGHT,not_operator),
(and_token, 2, opAssoc.LEFT,and_operator),
(or_token, 2, opAssoc.LEFT,or_operator)])
Please not that these operators acts upon python sets rather than
on booleans.
And that does the job.
I hope that this approach will help anyone of you.
Eric

Resources