Kusto / Azure Data Explorer - Distinct count in kusto queries

Kusto / Azure Data Explorer - Distinct count in kusto queries - azure-application-insights

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.
The event looks something like this:
{
"statusCode" : 200,
"some_field": "ABC123QWERTY"
}
I want the number of unique some_field with statusCode 200 . I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:
SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200
Which one is correct?
1 - dcount with default accuracy
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))
17,853 items
2 - Count by my_field and count number of rows
customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field
17,774 items.
3 - summarize with by some_field
customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode
17,626 items.
4 - dcount with higher accuracy?
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)
17,736 items
5 - count_distinct from preview
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))
17,744 items
According to the learn.microsoft.com it states:
Use dcount and dcountif to count distinct values in a specific column.
And dcount-aggfunction mentions the accuracy:
Returns an estimate of the number of distinct values of expr in the group.
count_distinct seems to be the correct way:
Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.

count_distinct() is a new KQL function that returns an accurate result.
dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast , 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).
In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
Higher accuracy means higher accuracy - statistically.
It means that the error will be bound to a lower value.
Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.
Having said that -
I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

Related

Return decimal value using Kusto Query Language that is always returning 0

I have a Kusto Query that I am using to query Application Insights. The goal is to get number of failed requests in 5 min buckets / and divide that by total number of requests in the same 5 min bucket. I will eventually build an alert to trigger if this percentage is greater than a certain value. But, I can't seem to get the query right.
In the example below, I hardcode a specific timestamp to make sure I get some failures.
Here is the query:
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
) on timestamp
| project timestamp, failed, reqs
| extend p=round(failed/reqs, 2)
It currently returns:
timestamp [UTC] |p |failed |reqs
5/17/2021, 8:20:00.000 PM 0 1,220 6,649
If anyone can give me insight into how to get the decimal value (~0.18) I expect for p?

Had to cast values to Doubles to get it to return something other than 0.
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
) on timestamp
| project timestamp, failedReqs=failed, totalRequests=reqs, percentage=(todouble(failed) / todouble(reqs) * 100)

another option that is a bit less verbose is to multiply by a 100.0 (which is a double literal)
percentage = failed * 100.0 / reqs
Note that the multiplication has to happen before division

Application Insights Summarize with Having clause

I need to summarize an Application Insights query where the count > 1. I don't see any "Having" clause like SQL has. How can I limit my query to only include records when count > 1?
traces
| extend MessageId = tostring(customDimensions.MessageId)
| summarize Count = count() by MessageId
| order by Count desc

Once you've called the summarize function Count is treated as a column so you can use a where clause to filter it:
traces
| extend MessageId = tostring(customDimensions.MessageId)
| summarize Count = count() by MessageId
| where Count > 1
| order by Count desc

Calculate Count of users every month in Kusto query language

I have a table named tab1:
Timestamp Username. sessionid
12-12-2020. Ravi. abc123
12-12-2020. Hari. oipio878
12-12-2020. Ravi. ytut987
11-12-2020. Ram. def123
10-12-2020. Ravi. jhgj54
10-12-2020. Shiv. qwee090
10-12-2020. bob. rtet4535
30-12-2020. sita. jgjye56
I want to count the number of distinct Usernames per day, so that the output would be:
day. count
10-12-2020. 3
11-12-2020. 1
12-12-2020. 2
30-12-2020. 1
Tried query:
tab1
| where timestamp > datetime(01-08-2020)
| range timestamp from datetime(01-08-2020) to now() step 1d
| extend day = dayofmonth(timestamp)
| distinct Username
| count
| project day, count

To get a very close estimation of the number of Usernames per day, just run this (the number won't be accurate, see details here):
tab1
| summarize dcount(Username) by bin(Timestamp, 1d)
If you want accurate results, then you should do this (just note that the query will be less performant than the previous one, and will only work if you have up to 1,000,000 usernames / day):
tab1
| summarize make_set(Username) by bin(Timestamp, 1d)
| project Timestamp, Count = array_length(set_Username)

computing offset for prev dynamically

I want to set offset for prev dynamically, based on number of items in a group. for e.g
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
| summarize by bin(timestamp,1h), customer
| extend prev_value = prev(value,<offset>)
The offset here should be equal to number of distinct customers. How can i compute this offset dynamically

If you can split query into small parts, you can use toscalar function to get number of unique customers.
This would be my approach...
let tab_series =
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
;
let no_of_distinct_customers =
toscalar(tab_series | distinct customer | summarize count())
;
tab_series
| summarize by bin(timestamp, 1h), customer
| extend prev_value = prev(value, no_of_distinct_customers)
You can find example here.

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks

The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Kusto / Azure Data Explorer - Distinct count in kusto queries - azure-application-insights

Related

Return decimal value using Kusto Query Language that is always returning 0

Application Insights Summarize with Having clause

Calculate Count of users every month in Kusto query language

computing offset for prev dynamically

Use values from one table in the bin operator of another table

Categories

Resources