Kusto Distinct Count - azure-data-explorer

Kusto Distinct Count - azure-data-explorer

I have written two queries below to extract distinct count/record from a table. However, both of them are giving me different results.
The first query returns more records than the second query.
query 1:
.ReachOptimization_L0
| where CurrentSubscriptionStatus == "ACTIVE"| where SnapshotDate =="2019-11-29"| where IsOptIn==1| where CampaignName != "" or CampaignId != ""| where ReachedFlag== 1| summarize dcount(UserPUID)
query 2:
.ReachOptimization_L0| where CurrentSubscriptionStatus == "ACTIVE"| where SnapshotDate =="2019-11-29"| where IsOptIn==1| where CampaignName != "" or CampaignId != ""| where ReachedFlag== 1| distinct UserPUID

dcount() aggregation function is an estimation of distinct count as outlined in
https://learn.microsoft.com/en-us/azure/kusto/query/dcount-aggfunction
"Returns an estimate for the number of distinct values taken by a scalar expression in the summary group."
The estimation accuracy can be found on the same page:
https://learn.microsoft.com/en-us/azure/kusto/query/dcount-aggfunction#estimation-accuracy

Related

Kusto / Azure Data Explorer - Distinct count in kusto queries

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.
The event looks something like this:
{
"statusCode" : 200,
"some_field": "ABC123QWERTY"
}
I want the number of unique some_field with statusCode 200 . I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:
SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200
Which one is correct?
1 - dcount with default accuracy
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))
17,853 items
2 - Count by my_field and count number of rows
customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field
17,774 items.
3 - summarize with by some_field
customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode
17,626 items.
4 - dcount with higher accuracy?
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)
17,736 items
5 - count_distinct from preview
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))
17,744 items
According to the learn.microsoft.com it states:
Use dcount and dcountif to count distinct values in a specific column.
And dcount-aggfunction mentions the accuracy:
Returns an estimate of the number of distinct values of expr in the group.
count_distinct seems to be the correct way:
Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.

count_distinct() is a new KQL function that returns an accurate result.
dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast , 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).
In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
Higher accuracy means higher accuracy - statistically.
It means that the error will be bound to a lower value.
Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.
Having said that -
I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

Kusto Query Dynamic sort Order

I have started working on Azure Data Explorer( Kusto) recently.
My requirement to make sorting order of Kusto table in dynamic way.
// Variable declaration
let SortColumn ="run_date";
let OrderBy="desc";
// Actual Code
tblOleMeasurments
| take 10
|distinct column1,column2,column3,run_date
|order by SortColumn OrderBy
Here My code working fine till Sortcolumn but when I tried to add [OrderBy] after [SortColumn] kusto gives me error .
My requirement here is to pass Asc/desc value from Variable [OrderBy].
Kindly assist here with workarounds and solutions which help me .

The sort column and order cannot be an expression, it must be a literal ("asc" or "desc"). If you want to pass the sort column and sort order as a variable, create a union instead where the filter on the variables results with the desired outcome. Here is an example:
let OrderBy = "desc";
let sortColumn = "run_date";
let Query = tblOleMeasurments | take 10 |distinct column1,column2,column3,run_date;
union
(Query | where OrderBy == "desc" and sortColumn == "run_date" | order by run_date desc),
(Query | where OrderBy == "asc" and sortColumn == "run_date" | order by run_date asc)
The number of union legs would be the product of the number of candidate sort columns times two (the two sort order options).

An alternative would be sorting by a calculated column, which is based on your sort_order and sort_column. The example below works for numeric columns
let T = range x from 1 to 5 step 1 | extend y = -10 * x;
let sort_order = "asc";
let sort_column = "y";
T
| order by column_ifexists(sort_column, "") * case(sort_order == "asc", -1, 1)

Passing table list to "Find In" operator dynamically at run time in Kusto Query Language

I have a where condition which I want to run over a set of tables in my Azure Data Explorer DB. I found "Find in ()" operator in Kusto query quite useful, works fine when I pass list of tables as intended.
find withsource=DataType in (AppServiceFileAuditLogs,AzureDiagnostics)
where TimeGenerated > ago(31d)
project _ResourceId, _BilledSize, _IsBillable
| where _IsBillable == true
| summarize BillableDataBytes = sum(_BilledSize) by _ResourceId, DataType | sort by BillableDataBytes nulls last
However, in my scenario, I would like to decide the list of tables at run time using another query.
Usage
| where TimeGenerated > ago(32d)
| where StartTime >= startofday(ago(31d)) and EndTime < startofday(now())
| where IsBillable == true
| summarize BillableDataGB = sum(Quantity) / 1000 by DataType
| sort by BillableDataGB desc
|project DataType
find withsource=DataType in (<pass resulting table expression from above query here as comma separated list of tables>)
where TimeGenerated > ago(31d)
project _ResourceId, _BilledSize, _IsBillable
| where _IsBillable == true
| summarize BillableDataBytes = sum(_BilledSize) by _ResourceId, DataType | sort by BillableDataBytes nulls last
Found some examples of passing all tables in a database or cluster using wildcards but that does not fit my scenario. Can somebody help me here.

Here is one way to achieve this:
let Tables = toscalar(Usage
| where TimeGenerated > ago(32d)
| where StartTime >= startofday(ago(31d)) and EndTime < startofday(now())
| where IsBillable == true
| summarize by DataType);
union withsource=T *
| where T in (Tables)
| count
Note that there is a significance to the toscalar expression, it precalculates the list of tables and optimizes the filter on the union expression. I also updated your query to avoid unnecessary work.

Finding percentages from the count operator results in kusto

I have two queries that find the total number of distinct occurrences of using the count operator. I want to be able to then display percentage of total distinct occurrences of those faults. So I tried dividing DistinctRB by Distinct Faults to give me a ratio.
let DistinctFaults = materialize (FaultView
|distinct ha, Ba, ga
|count);
let DistinctRB = materialize (FaultView
|where RB =~ "yes"
|distinct ha, Ba,ga
|count);
print FaultRB / DistinctFaults

you could try this (using toscalar()):
let DistinctFaults = toscalar(
FaultView
| distinct ha, Ba, ga
| count
);
let DistinctRB = toscalar(
FaultView
| where RB =~ "yes"
| distinct ha, Ba, ga
| count
);
print result = FaultRB / DistinctFaults
or, if an estimation of the distinct count (using dcountif()) is an option:
FaultView
| summarize result = dcountif(strcat_delim("_", ha, Ba, ga), RB =~ "yes") /
dcount(strcat_delim("_", ha, Ba, ga))

Grouping query in Redshift takes huge amount of time

I have a following requirement: I have a table in following format.
and this is what I want it to be transformed into:
Basically I want number of users with various combination of activities
I want to have this format as I want to create a TreeMap visualization out of it.
This is what I have done till now.
First find out number of users with activity groupings
WITH lookup AS
(
SELECT listagg(name,',') AS groupings,
processed_date,
guid
FROM warehouse.test
GROUP BY processed_date,
guid
)
SELECT groupings AS activity_groupings,
LENGTH(groupings) -LENGTH(REPLACE(groupings,',','')) + 1 AS count,
processed_date,
COUNT( guid) AS users
FROM lookup
GROUP BY processed_date,
groupings
I put the results in a separate table
Then, I do a Split and coalesce like this:
SELECT NULLIF(SPLIT_PART(groupings,',', 1),'') AS grouping_1,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 2),''), grouping_1) AS grouping_2,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 3),''), grouping_2, grouping_1) AS grouping_3,
num_users
FROM warehouse.groupings) AS expr_qry
GROUP BY grouping_1,
grouping_2,
grouping_3
The problem is the first query takes more than 90 minutes to execute as I have more than 250M rows.
There must be a better and efficient way to di this.
Any heads up would be greatly appreciated.
Thanks

You do not need to use complex string manipulation functions (LISTAGG(), SPLIT_PART()). You can achieve what you're after with the ROW_NUMBER() function and simple aggregates.
-- Create sample data
CREATE TEMP TABLE test_data (id, guid, name)
AS SELECT 1::INT, 1::INT, 'cooking'
UNION ALL SELECT 2::INT, 1::INT, 'cleaning'
UNION ALL SELECT 3::INT, 2::INT, 'washing'
UNION ALL SELECT 4::INT, 4::INT, 'cooking'
UNION ALL SELECT 6::INT, 5::INT, 'cooking'
UNION ALL SELECT 7::INT, 3::INT, 'cooking'
UNION ALL SELECT 8::INT, 3::INT, 'cleaning'
;
-- Assign a row number to each name per guid
WITH name_order AS (
SELECT guid
, name
, ROW_NUMBER() OVER(PARTITION BY guid ORDER BY id) row_n
FROM test_data
) -- Use MAX() to collapse each guid's data to 1 row
, groupings AS (
SELECT guid
, MAX(CASE WHEN row_n = 1 THEN name END) grouping_1
, MAX(CASE WHEN row_n = 2 THEN name END) grouping_2
FROM name_order
GROUP BY guid
) -- Count the guids per each grouping
SELECT grouping_1
, COALESCE(grouping_2, grouping_1) AS grouping_2
, COUNT(guid) num_users
FROM groupings
GROUP BY 1,2
;
-- Output
grouping_1 | grouping_2 | num_users
------------+------------+-----------
washing | washing | 1
cooking | cleaning | 2
cooking | cooking | 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Kusto Distinct Count - azure-data-explorer

Related

Kusto / Azure Data Explorer - Distinct count in kusto queries

Kusto Query Dynamic sort Order

Passing table list to "Find In" operator dynamically at run time in Kusto Query Language

Finding percentages from the count operator results in kusto

Grouping query in Redshift takes huge amount of time

Categories

Resources