Kusto: How summarize calculated data - azure-data-explorer

I have start and end calculated columns which I have read from Table1.
And comparing the how many events are happened in this between time .
Input Data:
let Mytable1=datatable (Vin:string,start_time:datetime ,End_time:datetime )
[ABC,datetime(2021-03-18 08:49:08.467), datetime(2021-03-18 13:32:28.000),
ABC,datetime(2021-03-18 13:41:59.323),datetime(2021-03-18 13:41:59.323),
ABC,datetime(2021-03-18 13:46:59.239),datetime(2021-03-18 14:58:02.000)];
let Mytable2=datatable(Vin:string,Timestamp:datetime)
[ABC,datetime(2021-03-18 08:49:08.467),ABC,datetime(2021-03-18 08:59:08.466),ABC,datetime(2021-03-18 09:04:08.460),ABC,datetime(2021-03-18 13:24:27.0000000)];
Query:
let Test=Table1
|where Vin =="ABC" | distinct Vin,Start_Time,End_Time;
let min1=toscalar(Test |summarize min1= min(Start_Time));
let max1=toscalar(Test |summarize max1=max(End_Time));
Table2
|where Vin =="ABC" and Timestamp between (todatetime(min1) ..todatetime(max1))
| join kind=fullouter Test
on $left.Vin == $right.Vin and $left.Timestamp== $right.Start_Time
|summarize Events= (count()) by Timestamp,Vin,Start_Time,End_Time
|project Timestamp,Start_Time,End_Time,Events
Output of above query is :
But My expected output is :
Means Events count from between two start and end time.

You should not have timestamp in your final aggregation. A working example could look like:
let measurement_range=datatable (vin:string,start_time:datetime ,end_time:datetime )
["ABC",datetime(2021-03-18 08:49:08.467),datetime(2021-03-18 13:32:28.000),
"ABC",datetime(2021-03-18 13:41:59.323),datetime(2021-03-18 13:44:59.323),
"ABC",datetime(2021-03-18 13:46:59.239),datetime(2021-03-18 14:58:02.000),
];
let measurement=datatable(vin:string,timestamp:datetime)
["ABC",datetime(2021-03-18 08:49:08.467),
"ABC",datetime(2021-03-18 08:59:08.466),
"ABC",datetime(2021-03-18 09:04:08.460),
"ABC",datetime(2021-03-18 13:42:27.0000000)];
measurement_range
| join kind=inner (measurement)
on vin
| where timestamp between (start_time..end_time)
| summarize event=(count()) by vin, start_time, end_time
With this you get a count for your measurement window. In this example you get a large intermediate resultset, as the timerange is considered in the where statement.
Please see the Azure Data Explorer Documentation how to optimize time window joins (the example is not efficient for larger datasets).

Related

table counterpart of column_ifexists()

We do have a function column_ifexists() which refers to a certain column if it exists, otherwise it refers to another option if we provide. Is there a similar function for table? I want to refer to a table and run some logic against it in the query , if the table exists , but if it doesn't exist, there shouldn't be a failure -- it should simply return no data.
e.g.
table_ifexists('sometable') | ...<logic>...
Please note that the fields referenced in the query should be defined in the dummy table, otherwise in case of non-existing table, the query will yield an exception
Failed to resolve scalar expression named '...'
In the following example these fields are StartTime, EndTime & EventType
Table exists
let requested_table = "StormEvents";
let dummy_table = datatable (StartTime:datetime, EndTime:datetime, EventType:string)[];
union isfuzzy=true table(requested_table), dummy_table
| where EndTime - StartTime > 30d
| summarize count() by EventType
EventType
count_
Drought
1635
Flood
20
Heat
14
Wildfire
4
Fiddle
Table does not exist
let requested_table = "StormEventsXXX";
let dummy_table = datatable (StartTime:datetime, EndTime:datetime, EventType:string)[];
union isfuzzy=true table(requested_table), dummy_table
| where EndTime - StartTime > 30d
| summarize count() by EventType
EventType
count_
Fiddle

Kusto Query Dynamic sort Order

I have started working on Azure Data Explorer( Kusto) recently.
My requirement to make sorting order of Kusto table in dynamic way.
// Variable declaration
let SortColumn ="run_date";
let OrderBy="desc";
// Actual Code
tblOleMeasurments
| take 10
|distinct column1,column2,column3,run_date
|order by SortColumn OrderBy
Here My code working fine till Sortcolumn but when I tried to add [OrderBy] after [SortColumn] kusto gives me error .
My requirement here is to pass Asc/desc value from Variable [OrderBy].
Kindly assist here with workarounds and solutions which help me .
The sort column and order cannot be an expression, it must be a literal ("asc" or "desc"). If you want to pass the sort column and sort order as a variable, create a union instead where the filter on the variables results with the desired outcome. Here is an example:
let OrderBy = "desc";
let sortColumn = "run_date";
let Query = tblOleMeasurments | take 10 |distinct column1,column2,column3,run_date;
union
(Query | where OrderBy == "desc" and sortColumn == "run_date" | order by run_date desc),
(Query | where OrderBy == "asc" and sortColumn == "run_date" | order by run_date asc)
The number of union legs would be the product of the number of candidate sort columns times two (the two sort order options).
An alternative would be sorting by a calculated column, which is based on your sort_order and sort_column. The example below works for numeric columns
let T = range x from 1 to 5 step 1 | extend y = -10 * x;
let sort_order = "asc";
let sort_column = "y";
T
| order by column_ifexists(sort_column, "") * case(sort_order == "asc", -1, 1)

KQL Join on max value

I need to join on a table to return the MAX value from that right-hand table. I have tried to mock it up using 'datatable' but have failed miserably :(. I'll try and describe with words.
T1 = datatable(ID:int, Properties:string, ConfigTime:datetime) [1,'a,b,c','2021-03-04 00:00:00']
T2 = datatable(ID:int, Properties:string, ConfigTime:datetime) [2,'a,b,c','2021-03-02 00:00:00', 3,'a,b','2021-03-01 00:00:00', 4,'c','2021-03-20 00:00:00']
I'm using this as an update policy on T2, which has a source of T1. So I want to select the rows from T1 and then join the rows from T2 that have the highest timestamp. My first attempt was below:
T1 | join kind=inner T2 on Id
| summarize arg_max(ConfigTime1, Id, Properties, Properties1, ConfigTime) by Id
| project Id, Properties, ConfigTime
In my actual update policy, I merge the properties from T1 and T2 then write to T2, but for simplicity, I've left that for now.
Currently, I'm not getting any output in my T2 from the update policy. Any guidance on another way I should be doing this would be appreciated. Thanks
It seems that you want to push the arg_max calculation into the T2 side of the join, something like this:
T1
| join kind=inner (
T2
| summarize arg_max(ConfigTime1, Id, Properties, Properties1, ConfigTime) by Id
| project Id, Properties, ConfigTime
) on Id
Note that to ensure acceptable performance you want to limit the timeframe for the arg_max search, so you should consider a time based filter before the arg_max.
I think what you're looking for is a union
let T1 = datatable(ID:int, Properties:string, ConfigTime:datetime) [
1,'a,b,c','2021-03-04 00:00:00'
];
let T2 = datatable(ID:int, Properties:string, ConfigTime:datetime) [
2,'a,b,c','2021-03-02 00:00:00',
3,'a,b','2021-03-01 00:00:00',
4,'c','2021-03-20 00:00:00'
];
Here is an example using a variable with summarize max:
let Latest = toscalar(T2 | summarize max(ConfigTime));
T1
| union (T2 | where ConfigTime == Latest)
The result will keep the entries from T1 and only the latest entries from T2.
If this doesn't reflect your expected results please show your expected output.

Calculate Count of users every month in Kusto query language

I have a table named tab1:
Timestamp Username. sessionid
12-12-2020. Ravi. abc123
12-12-2020. Hari. oipio878
12-12-2020. Ravi. ytut987
11-12-2020. Ram. def123
10-12-2020. Ravi. jhgj54
10-12-2020. Shiv. qwee090
10-12-2020. bob. rtet4535
30-12-2020. sita. jgjye56
I want to count the number of distinct Usernames per day, so that the output would be:
day. count
10-12-2020. 3
11-12-2020. 1
12-12-2020. 2
30-12-2020. 1
Tried query:
tab1
| where timestamp > datetime(01-08-2020)
| range timestamp from datetime(01-08-2020) to now() step 1d
| extend day = dayofmonth(timestamp)
| distinct Username
| count
| project day, count
To get a very close estimation of the number of Usernames per day, just run this (the number won't be accurate, see details here):
tab1
| summarize dcount(Username) by bin(Timestamp, 1d)
If you want accurate results, then you should do this (just note that the query will be less performant than the previous one, and will only work if you have up to 1,000,000 usernames / day):
tab1
| summarize make_set(Username) by bin(Timestamp, 1d)
| project Timestamp, Count = array_length(set_Username)

query with max and second factor [duplicate]

I have:
TABLE MESSAGES
message_id | conversation_id | from_user | timestamp | message
I want:
1. SELECT * WHERE from_user <> id
2. GROUP BY conversation_id
3. SELECT in every group row with MAX(timestamp) **(if there are two same timestamps in a group use second factor as highest message_id)** !!!
4. then results SORT BY timestamp
to have result:
2|145|xxx|10000|message
6|1743|yyy|999|message
7|14|bbb|899|message
with eliminated
1|145|xxx|10000|message <- has same timestamp(10000) as message(2) belongs to the same conversation(145) but message id is lowest
5|1743|me|1200|message <- has message_from == me
example group with same timestamp
i want from this group row 3 but i get row 2 from query
SELECT max(message_timestamp), message_id, message_text, message_conversationId
FROM MESSAGES
WHERE message_from <> 'me'
GROUP BY message_conversationId
ORDER by message_Timestamp DESC
what is on my mind to do union from message_id & timestamp and then get max???
Your query is based on non-standard use of GROUP BY (I think SQLite allows that only for compatibility with MySQL) and I'm not at all sure that it will produce determinate results all the time.
Plus it uses MAX() on concatenated columns. Unless you somehow ensure that the two (concatenated) columns have fixed widths, the results will not be accurate for that reason as well.
I would write the query like this:
SELECT
m.message_timestamp,
m.message_id,
m.message_text,
m.message_conversationId
FROM
( SELECT message_conversationId -- for every conversation
FROM messages as m
WHERE message_from <> 'me'
GROUP BY message_conversationId
) AS mc
JOIN
messages AS m -- join to the messages
ON m.message_id =
( SELECT mi.message_id -- and find one message id
FROM messages AS mi
WHERE mi.message_conversationId -- for that conversation
= mc.message_conversationId
AND mi.message_from <> 'me'
ORDER BY mi.message_timestamp DESC, -- according to the
mi.message_id DESC -- specified order
LIMIT 1 -- (this is the one part)
) ;
Try below sql to achieve your purpose by group by twice.
select m.*
from
Messages m
-- 3. and then joining to get wanted output columns
inner join
(
--2. then selecting from this max timestamp - and removing duplicates
select conversation_id, max(timestamp), message_id
from
(
-- 1. first select max message_id in remainings after the removal of duplicates from mix of cv_id & timestamp
select conversation_id, timestamp, max(message_id) message_id
from Messages
where message <> 'me'
group by conversation_id, timestamp
) max_mid
group by conversation_id
) max_mid_ts on max_mid_ts.message_id = m.message_id
order by m.message_id;
http://goo.gl/MyZjyU
ok it was more simple than I thought:
basically to change select from:
max(message_timestamp)
to:
max(message_timestamp || message_id)
or max(message_timestamp + message_id)
so it will search for max on concatenation of timestamp and message_id
ps. after a digging - it's working only if message id is growing with timestamp ( order of insertion is preserved )
edit:
edit2 :
so why it works ?
SELECT max(message_timestamp+message_id), message_timestamp, message_id, message_conversationId, message_from,message_text
FROM MESSAGES
WHERE message_conversationId = 1521521
AND message_from <> 'me'
ORDER by message_Timestamp DESC

Resources