Get Max of date column without using summarise in Kusto - azure-data-explorer

How to convert following sql query to Kusto without grouping and using summarize syntax. Thanks
SELECT Max(BirthDate) FROM [Employees]

I assume you want to get the max value of a column without using summarize because you want to use this value in per-record calculations.
The way to achieve this is to use a let statement to calculate the max value, after which you can write a query that will use the calculated value:
let MaxTimestamp = toscalar(MyTable | summarize max(Timestamp));
<Query with MaxTimestamp>
Example:
let MyData = datatable(Fruit: string, Count: long) [
"banana", 30,
"apple", 60,
"watermelon", 20
];
let NumFruit = toscalar(MyData | summarize sum(Count));
MyData
| extend Percentage = Count * 100.0 / NumFruit
Result:
Fruit
Count
Percentage
banana
30
27.2727272727273
apple
60
54.5454545454545
watermelon
20
18.1818181818182

Related

Kusto / Azure Data Explorer - Distinct count in kusto queries

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.
The event looks something like this:
{
"statusCode" : 200,
"some_field": "ABC123QWERTY"
}
I want the number of unique some_field with statusCode 200 . I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:
SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200
Which one is correct?
1 - dcount with default accuracy
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))
17,853 items
2 - Count by my_field and count number of rows
customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field
17,774 items.
3 - summarize with by some_field
customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode
17,626 items.
4 - dcount with higher accuracy?
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)
17,736 items
5 - count_distinct from preview
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))
17,744 items
According to the learn.microsoft.com it states:
Use dcount and dcountif to count distinct values in a specific column.
And dcount-aggfunction mentions the accuracy:
Returns an estimate of the number of distinct values of expr in the group.
count_distinct seems to be the correct way:
Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.
count_distinct() is a new KQL function that returns an accurate result.
dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast , 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).
In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
Higher accuracy means higher accuracy - statistically.
It means that the error will be bound to a lower value.
Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.
Having said that -
I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

Kusto Query Language: How to save column of results into a variable?

Lets say I have a query like:
cluster("cluster1").database("db2").Table3
| distinct * // distinct combinations of data
| take 5 // take 5
How do I save the values from a column in the results output to a pack_array variable.
I want to use this pack_array variable for follow on queries like:
cluster("cluster2").database("db3").Table1
| where ColumnofInterest in (pack_array_var from above)
| take 5 // take 5
Provide the "*" argument to the function and use the "let" statement. Here is an example:
let ValuesFromTheOtherCluster = cluster('cluster1').database('db2').Table3
| extend tempArray = pack_array(*)
| summarize filters = make_set(tempArray);
cluster('cluster2').database("db3").Table1
| where ColumnofInterest in (ValuesFromTheOtherCluster)

How do you access a value in a kusto table by row and column number?

I have a Kusto table counts with 4 rows and 3 columns that has the following elements
HasFailure FunnelPhase count_
0 Experienced 172425
0 NewSubs 25399
1 Experienced 3289
1 NewSubs 643
I would like to access the 3rd element in the 2nd column and save it to a scalar. I have tried the following code:
let value = counts | project count_ lookup 3;
But I am not able to obtain the desired result. What would be the correct way in which to obtain this value?
you'll need to order the records in your table (according to an order you define), then access the 3rd record (according to that same order), and finally - project the specific column you're interested in.
e.g.:
let T =
datatable(HasFailure:bool, FunnelPhase:string, count_:long)
[
0, 'Experienced', 172425,
0, 'NewSubs', 25399,
1, 'Experienced', 3289,
1, 'NewSubs', 643,
]
;
let 3rd_element_in_2nd_column = toscalar(
T
| order by count_ desc
| where row_number() == 3
| project FunnelPhase
)
;
print result = 3rd_element_in_2nd_column

kusto query to show the third column after using distinct for two other columns

Hi I m trying to display ingestedtime in my below Kusto query, can you pls provide suggestion
find withsource=source in (cluster(X).database('y*').['TextFileLogs'])
where AttemptedIngestTime > ago(7d)
and FileLineContent contains "<li>Build Number:"
| distinct source , FileLineContent //, AttemptedIngestTime
| extend databaseName = extract(#"""(oci-[^""]*)""", 1, source)
| extend BuildNumber = extract(#"([A-Z]\w*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
| extend StampVersion = extract(#"([0-9]\d*\.[0-9]\d*\.[0-9]\d*\.[0-9]\d*)",1,FileLineContent)
| extend cluster = X
//| extend IngestedTime = AttemptedIngestTime
| summarize NumberOfRuns=count() by BuildNumber , StampVersion
you could replace distinct source, FileLineContent with summarize min(AttemptedIngestTime) by source, FileLineContent
or replace min with max, depending on the semantics you want)
then, you'll still need to decide how you aggregate it in your final summarize (either as min(AttemptedIngestTime), or as a group by key, e.g. startofday(AttemptedIngestTime))
regardless, you should consider following query best practices, and:
replace usage of contains with has.
replace usage of extract with parse.

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks
The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

Resources