Kusto - Avgif, Min , Max and Median - azure-data-explorer

I am converting the below Splunk query to Kusto
avg(eval(if(Test="Success", Duration, null()))) as AvgDuration
This Query will return the avg of duration if Test is success else return a null value. Could you pls advice if the below Kusto query will return the same result as I dont see the numbers matching
| summarize AvgDuration = avgif (Duration, Test = "Success")
Also how do I calculate the Min, Max and Median with the same condition pls. Thanks.

For min and max you can do:
let T = datatable(Test:string, Duration:timespan)["Success", timespan(05:03:01.78),"Success", timespan(15:00:06.28),"Success", timespan(02:03:05.98),"Fail", timespan(00:03:01.28)];
T
| summarize AvgDuration = avgif (Duration, Test == "Success"),
MinDuration = minif (Duration, Test == "Success"),
MaxDuration = maxif (Duration, Test == "Success")
AvgDuration
MinDuration
MaxDuration
07:22:04.6800000
02:03:05.9800000
15:00:06.2800000
The percentile() aggregation function does not have the "if" version, so you will need to do a separate calculation for it. The simplest approach is to filter before the aggregation, for example:
let T = datatable(Test:string, Duration:timespan)["Success", timespan(05:03:01.78),"Success", timespan(15:00:06.28),"Success", timespan(02:03:05.98),"Fail", timespan(00:03:01.28)];
T
| where Test == "Success"
| summarize AvgDuration = avg(Duration),
MinDuration = min(Duration),
MaxDuration = max(Duration),
Median = percentile(Duration, 50)
AvgDuration
MinDuration
MaxDuration
Median
07:22:04.6800000
02:03:05.9800000
15:00:06.2800000
05:03:01.7800000
However, sometimes you want aggregations for the full dataset at the same time as the aggregation with the condition. If that's the case you will need to run two queries and join them. For example, say that you want to include the full count:
let T = datatable(Test:string, Duration:timespan)["Success", timespan(05:03:01.78),"Success", timespan(15:00:06.28),"Success", timespan(02:03:05.98),"Fail", timespan(00:03:01.28)];
let T1 = T
| summarize AvgDuration = avgif (Duration, Test == "Success"),
MinDuration = minif (Duration, Test == "Success"),
MaxDuration = maxif (Duration, Test == "Success"),
TotalCount = count()
| extend Dummy = 1;
let T2 = T
| where Test == "Success"
| summarize Median = percentile(Duration, 50)
| extend Dummy = 1;
T1
| lookup T2 on Dummy
| project-away Dummy
AvgDuration
MinDuration
MaxDuration
TotalCount
Median
07:22:04.6800000
02:03:05.9800000
15:00:06.2800000
4
05:03:01.7800000
If there is heavy processing before the aggregation, you might want to consider using the materialize() function around the calculation of T.

Related

Kusto / Azure Data Explorer - Distinct count in kusto queries

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.
The event looks something like this:
{
"statusCode" : 200,
"some_field": "ABC123QWERTY"
}
I want the number of unique some_field with statusCode 200 . I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:
SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200
Which one is correct?
1 - dcount with default accuracy
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))
17,853 items
2 - Count by my_field and count number of rows
customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field
17,774 items.
3 - summarize with by some_field
customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode
17,626 items.
4 - dcount with higher accuracy?
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)
17,736 items
5 - count_distinct from preview
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))
17,744 items
According to the learn.microsoft.com it states:
Use dcount and dcountif to count distinct values in a specific column.
And dcount-aggfunction mentions the accuracy:
Returns an estimate of the number of distinct values of expr in the group.
count_distinct seems to be the correct way:
Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.
count_distinct() is a new KQL function that returns an accurate result.
dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast , 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).
In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
Higher accuracy means higher accuracy - statistically.
It means that the error will be bound to a lower value.
Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.
Having said that -
I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

Return decimal value using Kusto Query Language that is always returning 0

I have a Kusto Query that I am using to query Application Insights. The goal is to get number of failed requests in 5 min buckets / and divide that by total number of requests in the same 5 min bucket. I will eventually build an alert to trigger if this percentage is greater than a certain value. But, I can't seem to get the query right.
In the example below, I hardcode a specific timestamp to make sure I get some failures.
Here is the query:
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
) on timestamp
| project timestamp, failed, reqs
| extend p=round(failed/reqs, 2)
It currently returns:
timestamp [UTC] |p |failed |reqs
5/17/2021, 8:20:00.000 PM 0 1,220 6,649
If anyone can give me insight into how to get the decimal value (~0.18) I expect for p?
Had to cast values to Doubles to get it to return something other than 0.
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
) on timestamp
| project timestamp, failedReqs=failed, totalRequests=reqs, percentage=(todouble(failed) / todouble(reqs) * 100)
another option that is a bit less verbose is to multiply by a 100.0 (which is a double literal)
percentage = failed * 100.0 / reqs
Note that the multiplication has to happen before division

Kusto: How to convert table value to scalar and return from user defined function

I have the following user-defined functions with the intention of using a case conditional to output a table of 0s or 1s saying whether or not an account is active.
case needs scalar values as it's arguments, ie pro_account_active(account) and basic_account_active(account) need to be scalar values.
I'm struggling to get around the limitation of toscalar:
User-defined functions can't pass into toscalar() invocation
information that depends on the row-context in which the function is
called.
I think if there was a function I can use in place of the "??????" that would convert active to a scalar and return it from the function it would work.
Any help greatly appreciated
let basic_account_active=(account:string) {
basic_check_1(account) // returns 0 or 1 row only
| union basic_check_2(account)
| summarize result_count = count()
| extend active = iff(result_count == 2, 1, 0)
| ??????
};
let pro_account_active=(account:string) {
pro_check_1(account) // returns 0 or 1 row only
| union pro_check_2(account)
| summarize result_count = count()
| extend active = iff(result_count == 2, 1, 0)
| ??????
};
let is_active=(account_type:string, account:string) {
case(
account_type == 'pro', pro_account_active(account),
account_type == 'basic', basic_account_active(account),
-1
)
};
datatable(account_type:string, account:string)
[
'pro', '89e5678a92',
'basic', '9d8263da45',
'pro', '0b975f2454a',
'basic', '112a3f4753',
]
| extend result = is_active(account_type, account)
You can convert the output of a query to a scalar by using the toscalar() function, i.e.
let basic_account_active=(account:string) {
toscalar(basic_check_1(account) // returns 0 or 1 row only
| union basic_check_2(account)
| summarize result_count = count()
| extend active = iff(result_count == 2, 1, 0))};
From your example it looks that you have two tables per each account type and if both have entrees for a specific account, then the account is considered active. Is that correct? If so, I would use the "join" operator to find all the entrees in the applicable tables and count them. Here is an example of one way to do it (there are other ways as well).
let basicAccounts1 = datatable(account_type:string, account:string)[ 'basic', '9d8263da45', 'basic', '111111'];
let basicAccounts2 = datatable(account_type:string, account:string)[ 'basic', '9d8263da45', 'basic', '222222'];
let proAccounts1 = datatable(account_type:string, account:string)[ 'pro', '89e5678a92', 'pro', '111111'];
let proAccounts2 = datatable(account_type:string, account:string)[ 'pro', '89e5678a92', 'pro', '222222'];
let AllAccounts = union basicAccounts1, basicAccounts2, proAccounts1, proAccounts2
| summarize count() by account, account_type;
datatable(account_type:string, account:string)
[
'pro', '89e5678a92',
'basic', '9d8263da45',
'pro', '0b975f2454a',
'basic', '112a3f4753',
]
| join kind=leftouter hint.strategy=broadcast (AllAccounts) on account, account_type
| extend IsActive = count_ >=2
| project-away count_, account1, account_type1
The results are:

computing offset for prev dynamically

I want to set offset for prev dynamically, based on number of items in a group. for e.g
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
| summarize by bin(timestamp,1h), customer
| extend prev_value = prev(value,<offset>)
The offset here should be equal to number of distinct customers. How can i compute this offset dynamically
If you can split query into small parts, you can use toscalar function to get number of unique customers.
This would be my approach...
let tab_series =
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
;
let no_of_distinct_customers =
toscalar(tab_series | distinct customer | summarize count())
;
tab_series
| summarize by bin(timestamp, 1h), customer
| extend prev_value = prev(value, no_of_distinct_customers)
You can find example here.

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks
The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

Resources