unpivot (wide to long) with dynamic column names - azure-data-explorer

I need a function wide_to_long that turns a wide table into a long table and that accepts an argument id_vars for which the values have to be repeated (see example).
Sample input
let T_wide = datatable(name: string, timestamp: datetime, A: int, B: int) [
'abc','2022-01-01 12:00:00',1,2,
'def','2022-01-01 13:00:00',3,4
];
Desired output
Calling wide_to_long(T_wide, dynamic(['name', 'timestamp'])) should produce the following table.
let T_long = datatable(name: string, timestamp: datetime, variable: string, value: int) [
'abc','2022-01-01 12:00:00','A',1,
'abc','2022-01-01 12:00:00','B',2,
'def','2022-01-01 13:00:00','A',3,
'def','2022-01-01 13:00:00','B',4
];
Attempt
I've come pretty far with the following code.
let wide_to_long = (T:(*), id_vars: dynamic) {
// get names of keys to remove later
let all_columns = toscalar(T | getschema | summarize make_list(ColumnName));
let remove = set_difference(all_columns, id_vars);
// expand columns not contained in id_vars
T
| extend packed1 = pack_all()
| extend packed1 = bag_remove_keys(packed1, id_vars)
| mv-expand kind=array packed1
| extend variable = packed1[0], value = packed1[1]
// remove unwanted columns
| project packed2 = pack_all()
| project packed2 = bag_remove_keys(packed2, remove)
| evaluate bag_unpack(packed2)
| project-away packed1
};
The problems are that the solution feels clunky (is there a better way?) and the columns in the result are ordered randomly. The second issue is minor, but annoying.

Related

Filter columns by dashboard multi-select parameter

I'm trying to render a time series, but I have too many columns to show by default. To remedy this I figured I would present the user with a multi-select of all the columns and downselect the columns I render to that list, but I can't for the life of me figure out or find an answer on how to do it.
I have data with, say, columns Time, X1, X2, ... X120 and a multi-select parameter _columns of that table | getschema | project ColumnName | where ColumnName != "Time". I want to project to Time and the contents of _columns.
I can only find how to filter rows based on some column's value vs the multi-select. I feel like I'm missing something very simple.
Updated
There is also a simple solution for data that looks something like that:
This kind of data might be created by a make-series operator with multiple aggregation functions, e.g. -
make-series series_001 = count(), series_002 = min(x), series_003 = sum(x), series_004 = avg(x), series_005 = countif(type == 1), series_006 = countif(subtype == 123) on Timestamp from ago(7d) to now() step 1d
// Data sample generation, including series creation.
// Not part of the solution.
let p_series_num = 100;
let data = materialize
(
range i from 1 to p_series_num step 1
| project series_name = strcat("series_", substring(strcat("00", i), -3))
| mv-apply range(1, 7, 1) on (summarize make_list(rand()))
| evaluate pivot(series_name, take_any(list_))
| extend Timestamp = range(now() - 6d, now(), 1d)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _series, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = 'series_001';
data
| project Timestamp, column_ifexists(['_series'], real(null))
| render timechart
Timestamp
series_001
["2022-10-08T15:59:51.4634127Z","2022-10-09T15:59:51.4634127Z","2022-10-10T15:59:51.4634127Z","2022-10-11T15:59:51.4634127Z","2022-10-12T15:59:51.4634127Z","2022-10-13T15:59:51.4634127Z","2022-10-14T15:59:51.4634127Z"]
["0.35039128090096505","0.79027849410190631","0.023939659111657484","0.14207071795033441","0.91242141133745414","0.33016368441829869","0.50674771943297525"]
Fiddle
This solution supports multi-selection.
The original data looks something like that:
// Data sample generation. Not part of the solution.
let p_start_time = startofday(ago(1d));
let p_interval = 5m;
let p_rows = 15;
let p_cols = 120;
let data = materialize
(
range Timestamp from p_start_time to p_start_time + p_rows * p_interval step p_interval
| mv-expand MetricID = range(1, p_cols) to typeof(int)
| extend MetricVal = rand(), MetricName = strcat("x", tostring(MetricID))
| evaluate pivot(MetricName, take_any(MetricVal), Timestamp)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _MetricName, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = dynamic(['x1', 'x3', 'x7', 'x100', 'x120']);
data
| project Timestamp, pa = pack_all()
| project Timestamp, cols = bag_remove_keys(pa, set_difference(bag_keys(pa), _series))
| evaluate bag_unpack(cols)
| render timechart
Timestamp
x1
x100
x120
x3
x7
2022-10-20T00:00:00Z
0.40517703772298719
0.86952520047109094
0.67458442932790974
0.20662741864260561
0.19230161743580523
2022-10-20T00:05:00Z
0.098438188653858671
0.14095230636982198
0.10269711129443576
0.99361020447683746
0.093624077251808144
2022-10-20T00:10:00Z
0.3779198279036311
0.095647188329308852
0.38967218915903867
0.62601873006422182
0.18486009896828509
2022-10-20T00:15:00Z
0.141551736845493
0.64623737123678782
... etc.
Fiddle
It might be very simple, if our tabular data (post creation of the series) looks something like that:
This kind of data might be created by a make-series operator with by clause, e.g. -
make-series count() on Timestamp from ago(7d) to now() step 1d by series_name
In that case, all we need to do is add a filter on the series name, E.g. -
// Data sample generation, including series creation.
// Not part of the solution.
let p_series_num = 100;
let data = materialize
(
range i from 1 to 1000000 step 1
| extend Timestamp = ago(rand()*7d)
,series_name = strcat("series_", substring(strcat("00", tostring(toint(rand(p_series_num)))), -3))
| make-series count() on Timestamp from ago(7d) to now() step 1d by series_name
);
// Solution starts here
// We assume the creation of a parameter named _series, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_series'] = 'series_001';
data
| where series_name == _series
| render timechart
series_name
count_
Timestamp
series_001
[1434,1439,1430,1428,1422,1372,1475]
["2022-10-07T15:54:57.3677580Z","2022-10-08T15:54:57.3677580Z","2022-10-09T15:54:57.3677580Z","2022-10-10T15:54:57.3677580Z","2022-10-11T15:54:57.3677580Z","2022-10-12T15:54:57.3677580Z","2022-10-13T15:54:57.3677580Z"]
Fiddle
Here is a solution that match the data structure in your scenario.
* It is the same solution is the other solution I just modified, but since the source data structure is different, I posted an additional answer for learning purposes.
The original data looks something like that:
The code is actually very simple, leveraging column_ifexists():
// Data sample generation. Not part of the solution.
let p_start_time = startofday(ago(1d));
let p_interval = 5m;
let p_rows = 15;
let p_cols = 120;
let data = materialize
(
range Timestamp from p_start_time to p_start_time + p_rows * p_interval step p_interval
| mv-expand MetricID = range(1, p_cols) to typeof(int)
| extend MetricVal = rand(), MetricName = strcat("x", tostring(MetricID))
| evaluate pivot(MetricName, take_any(MetricVal), Timestamp)
| project-reorder Timestamp, * granny-asc
);
// Solution starts here
// We assume the creation of a parameter named _MetricName, in the dashboard
// Uncomment the following line when executed outside the context of the dashboard
let ['_MetricName'] = "x42";
data
| project Timestamp, column_ifexists(['_MetricName'], real(null))
| render timechart
Timestamp
x42
2022-10-13T00:00:00Z
0.89472385054721115
2022-10-13T00:05:00Z
0.11275174098360444
2022-10-13T00:10:00Z
0.96233152692333268
2022-10-13T00:15:00Z
0.21751913633816042
2022-10-13T00:20:00Z
0.69591667527850931
2022-10-13T00:25:00Z
0.36802228024058203
2022-10-13T00:30:00Z
0.29060518653083045
2022-10-13T00:35:00Z
0.13362332423562559
2022-10-13T00:40:00Z
0.013920161307282448
2022-10-13T00:45:00Z
0.05909880950497
2022-10-13T00:50:00Z
0.146454957813311
2022-10-13T00:55:00Z
0.318823204227693
2022-10-13T01:00:00Z
0.020087435985750794
2022-10-13T01:05:00Z
0.31110660126024159
2022-10-13T01:10:00Z
0.75531136771424379
2022-10-13T01:15:00Z
0.99289833682620265
Fiddle

How can I access elements in the make_list() array in Azure Data Explorer

I have a kusto query which summarize an array based on values in an id column.
The query is
Table
| where {some condition}
| extend d = parse_json(events)
| mv-expand d
| extend value=parse_json(d)["intValue"]
| project value, Id
| summarize make_list(value) by Id
The result is
Id
value
ab8e
[1664379494002, 1664379485020, 1664379487998, 1664379473022]
3dBc
[ 1664383366022,1664383372025, 1664381572017, 1664381566025]
My question is can I do math operation in the array 'value'?
e.g. I want to array[1] - array[0] + array[1] - array[2] for each row?
yes, you can do that:
datatable(Id:string, value:dynamic)
[
'ab8e', dynamic([1664379494002, 1664379485020, 1664379487998, 1664379473022]),
'3dBc', dynamic([ 1664383366022,1664383372025, 1664381572017, 1664381566025]),
]
| extend result = tolong(value[1]) - tolong(value[0]) + tolong(value[1]) - tolong(value[2])
Id
value
result
ab8e
[1664379494002,1664379485020,1664379487998,1664379473022]
-11960
3dBc
[1664383366022,1664383372025,1664381572017,1664381566025]
1806011

KQL, time difference between separate rows in same table

I have Sessions table
Sessions
|Timespan|Name |No|
|12:00:00|Start|1 |
|12:01:00|End |2 |
|12:02:00|Start|3 |
|12:04:00|Start|4 |
|12:04:30|Error|5 |
I need to extract from it duration of each session using KQL (but if you could give me suggestion how I can do it with some other query language it would be also very helpful). But if next row after start is also start, it means session was abandoned and we should ignore it.
Expected result:
|Duration|SessionNo|
|00:01:00| 1 |
|00:00:30| 4 |
You can try something like this:
Sessions
| order by No asc
| extend nextName = next(Name), nextTimestamp = next(timestamp)
| where Name == "Start" and nextName != "Start"
| project Duration = nextTimestamp - timestamp, No
When using the operator order by, you are getting a Serialized row set, which then you can use operators such as next and prev. Basically you are seeking rows with No == "Start" and next(Name) == "End", so this is what I did,
You can find this query running at Kusto Samples open database.
let Sessions = datatable(Timestamp: datetime, Name: string, No: long) [
datetime(12:00:00),"Start",1,
datetime(12:01:00),"End",2,
datetime(12:02:00),"Start",3,
datetime(12:04:00),"Start",4,
datetime(12:04:30),"Error",5
];
Sessions
| order by No asc
| extend Duration = iff(Name != "Start" and prev(Name) == "Start", Timestamp - prev(Timestamp), timespan(null)), SessionNo = prev(No)
| where isnotnull(Duration)
| project Duration, SessionNo

is it possible for better optimization of my kusto query

below is my Kusto query, it takes 2+ mins in lens dashboard to show the data, I have optimized my query to have materialize() in let statements and contains with has. is there anyother way to optimize it in a better way.
let C_masfunteams = materialize(find withsource=source in (cluster(X).database('oci-*').['TextFileLogs']) where AttemptedIngestTime > ago(7d)
and FileLineContent has "<li>Build Number:" | summarize min(AttemptedIngestTime) by source, FileLineContent);//, AttemptedIngestTime
let n = C_masfunteams | extend databaseName = extract(#"""(oci-[^""]*)""", 1, source)
| extend BuildNumber = extract(#"([A-Z]\w*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
| extend StampVersion = extract(#"([0-9]\d*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
|extend cluster ='masfunteams'
| project BuildNumber , StampVersion , min_AttemptedIngestTime
| summarize NumberOfRuns=count() , ingestedtime = min(min_AttemptedIngestTime) by BuildNumber,StampVersion;
let C_masfun= materialize(find withsource=source in (cluster(Y).database('oci-*').['TextFileLogs']) where AttemptedIngestTime > ago(7d)
and FileLineContent has "<li>Build Number:" | summarize min(AttemptedIngestTime) by source, FileLineContent);//, AttemptedIngestTime
let m = C_masfun | extend databaseName = extract(#"""(oci-[^""]*)""", 1, source)
| extend BuildNumber = extract(#"([A-Z]\w*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
| extend StampVersion = extract(#"([0-9]\d*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
|extend cluster ='masfunteams'
| project BuildNumber , StampVersion , min_AttemptedIngestTime
| summarize NumberOfRuns=count() , ingestedtime = min(min_AttemptedIngestTime) by BuildNumber,StampVersion;
let C_masvaas = materialize(find withsource=source in (cluster(z).database('oci-*').['TextFileLogs']) where AttemptedIngestTime > ago(7d)
and FileLineContent has "<li>Build Number:" | summarize min(AttemptedIngestTime) by source, FileLineContent);//, AttemptedIngestTime
let o= C_masvaas | extend databaseName = extract(#"""(oci-[^""]*)""", 1, source)
| extend BuildNumber = extract(#"([A-Z]\w*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
| extend StampVersion = extract(#"([0-9]\d*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
|extend cluster ='masfunteams'
| project BuildNumber , StampVersion , min_AttemptedIngestTime
| summarize NumberOfRuns=count() , ingestedtime = min(min_AttemptedIngestTime) by BuildNumber,StampVersion;
union isfuzzy=true m,n,o
| summarize Ingestedtime =min(ingestedtime) by BuildNumber,StampVersion
Hi the query is quite complex and without running it on the actual cluster it is hard to figure out what is the expected results. So here are a few tips:
Consider starting the union operator as the first operator with a uniform logic for the filtering, parsing and summarize operations
Consider removing the materialize() if you are only using each dataset only once
Consider removing the 'find' as you are not doing search across multiple columns, If you are using it to get the source table in your output records set, consider adding "withsource" to the union statement
If possible consider using the 'parse' operator instead of the regular expression
Hope this helps!

How to: Run a user defined function for a range of (date) values

So let’s say I want to test a function that finds outliers over past data. I’d love to end up with a table that looks like this:
Time Outliers_At_Time
<somedate> 0
<somedate + interval> 1
The function looks like this:
let OutliersAt = (TheDate:datetime) {
<… outputs zero or a positive integer>
}
My instinct would be to do something like this:
let SomeDates = range AtTime from ago(10d) to now() step 10m;
SomeDates | extend NumOutliers = OutliersAt (AtTime)
… but that gives me this error message:
Error Semantic error: '' has the following semantic error: Unresolved
reference binding: 'AtTime'. clientRequestId:
KustoWebV2;1ea28ba0-12f1-4a52-95e7-975db3310f59
Suggestions?
If you are looking on finding outliers - there is a built-in function in Kusto to do it:
https://learn.microsoft.com/en-us/azure/kusto/query/series-outliersfunction
Example:
let _data =
range Timestamp from ago(7d) to now() step 1min
| extend Value=case(rand(1000)==10, 1200.0, rand(100));
//
_data
| make-series AvgValue=avg(Value) default=0 on Timestamp in range(ago(7d), now(), 5min)
| extend outliers=series_outliers(AvgValue)
| render timechart
If the question is about general way to provide parameters to user-defined functions,
see more info here:
https://learn.microsoft.com/en-us/azure/kusto/query/functions/user-defined-functions
In particular, you can pass a serie into a user-defined-function (e.g. to get statistics):
let OutliersAt = (_serie:dynamic) {
let stats = series_stats_dynamic(_serie);
todouble(stats.max_idx) >= 0
};
let _data =
range Timestamp from ago(7d) to now() step 1min
| extend Value=case(rand(1000)==10, 1200.0, rand(100));
//
_data
| make-series AvgValue=avg(Value) default=0 on Timestamp in range(ago(7d), now(), 5min)
| extend outliers=series_outliers(AvgValue)
| project hasOutliers=OutliersAt(outliers)

Resources