Calculate Count of users every month in Kusto query language - azure-data-explorer

I have a table named tab1:
Timestamp Username. sessionid
12-12-2020. Ravi. abc123
12-12-2020. Hari. oipio878
12-12-2020. Ravi. ytut987
11-12-2020. Ram. def123
10-12-2020. Ravi. jhgj54
10-12-2020. Shiv. qwee090
10-12-2020. bob. rtet4535
30-12-2020. sita. jgjye56
I want to count the number of distinct Usernames per day, so that the output would be:
day. count
10-12-2020. 3
11-12-2020. 1
12-12-2020. 2
30-12-2020. 1
Tried query:
tab1
| where timestamp > datetime(01-08-2020)
| range timestamp from datetime(01-08-2020) to now() step 1d
| extend day = dayofmonth(timestamp)
| distinct Username
| count
| project day, count

To get a very close estimation of the number of Usernames per day, just run this (the number won't be accurate, see details here):
tab1
| summarize dcount(Username) by bin(Timestamp, 1d)
If you want accurate results, then you should do this (just note that the query will be less performant than the previous one, and will only work if you have up to 1,000,000 usernames / day):
tab1
| summarize make_set(Username) by bin(Timestamp, 1d)
| project Timestamp, Count = array_length(set_Username)

Related

Kusto / Azure Data Explorer - Distinct count in kusto queries

I'm using Application Insights with a customEvent and need to get the number of events with a distinct field.
The event looks something like this:
{
"statusCode" : 200,
"some_field": "ABC123QWERTY"
}
I want the number of unique some_field with statusCode 200 . I've looked at this question and tried a couple of different queries. Some of them giving different answers. In SQL it would have looked something like this:
SELECT COUNT(DISTINCT my_field) AS Count
FROM customEvents
WHERE statusCode=200
Which one is correct?
1 - dcount with default accuracy
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field))
17,853 items
2 - Count by my_field and count number of rows
customEvents
| extend my_field = tostring(customDimensions.some_field)
| where customDimensions.statusCode == 200 and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize Count = count() by my_field
17,774 items.
3 - summarize with by some_field
customEvents
| extend StatusCode = tostring(customDimensions["statusCode"]), MyField = tostring(customDimensions["some_field"])
| where timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize any(StatusCode) by MyField
| summarize Count = count() by any_StatusCode
17,626 items.
4 - dcount with higher accuracy?
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize dcount(tostring(customDimensions.some_field),4)
17,736 items
5 - count_distinct from preview
customEvents
| where (customDimensions.statusCode == 200) and timestamp >= startofday(datetime(2022-12-01T00:00:00Z)) and timestamp <= endofday(datetime(2022-12-31T00:00:00Z))
| summarize count_distinct(tostring(customDimensions.some_field))
17,744 items
According to the learn.microsoft.com it states:
Use dcount and dcountif to count distinct values in a specific column.
And dcount-aggfunction mentions the accuracy:
Returns an estimate of the number of distinct values of expr in the group.
count_distinct seems to be the correct way:
Counts unique values specified by the scalar expression per summary group, or the total number of unique values if the summary group is omitted.
count_distinct() is a new KQL function that returns an accurate result.
dcount() returns an approximate result.
It can be used with a 2nd argument, a constant integer with value 0, 1, 2, 3 or 4 (0 = fast , 1 = default, 2 = accurate, 3 = extra accurate, 4 = super accurate).
In your examples (specifically "4 - dcount with higher accuracy?") you have not used a 2nd argument.
Higher accuracy means higher accuracy - statistically.
It means that the error will be bound to a lower value.
Theoretically (and in practice) dcount() with lower accuracy may yield in some scenarios a result that is closer to the real number than dcount() with higher accuracy.
Having said that -
I would guess that you executed your queries with a UI filter of last 24 hours or something similar.
This means that each execution ran over a different timespan.

Kusto : Summarize count by hours of the day (hours in column)

I have a list of metrics that I want to visualize by name (row) and count by hours of the current day (column)
The example below create a row by Hour and metric name
customMetrics
| extend hour= floor( timestamp % 1d , 1h)
| where name contains "WebServiceCall-"
| summarize event_count = sum(value) by hour, name
I want the data display like this:
MetricName | Count Hour0 | Count Hour2 | Count Hour3 | ... | Count Hour24
Is it possible to do it with Kusto?
Yes, you can use the pivot plugin for this.
Thanks you Avnera
customMetrics
| where name contains "WebServiceCall-"
| extend Hour= floor( timestamp % 1d , 1h)
| project name, Hour, value
| evaluate pivot(Hour, sum(value))

Application Insights Summarize with Having clause

I need to summarize an Application Insights query where the count > 1. I don't see any "Having" clause like SQL has. How can I limit my query to only include records when count > 1?
traces
| extend MessageId = tostring(customDimensions.MessageId)
| summarize Count = count() by MessageId
| order by Count desc
Once you've called the summarize function Count is treated as a column so you can use a where clause to filter it:
traces
| extend MessageId = tostring(customDimensions.MessageId)
| summarize Count = count() by MessageId
| where Count > 1
| order by Count desc

Multiple Separate WHERE classes in single VIEW

I need help creating a single SELECT statement as part of a CREAT VIEW statement that contains multiple, separate filtering or grouping requirements.
I am working on an SQLite database to track usage of our local food pantry, where we have two types of visitors, “Scheduled” or “Drop-In”, visiting on different days. One of the central tables is the “visit_log” table that tracks each visit by date, time, type of visit, and people in the household.
I’m trying to create a VIEW that summarizes that “visit_log” grouped by the visit_date, and for both number of records and SUM of household size, displaying the number of “Drop-Ins”, the number of “Scheduled” and the total of the two types.
Here is the “visit_log”
CREATE TABLE "visit_log" ("visit_date" DATE, "visit_time" TIME, "client_relation" TEXT, "household_size" INTEGER)
Here is a sample of the “visit_log” table’s content. (We have not started recording the visit_time yet, so those values are blank).
"visit_date","visit_time","client_relation","household_size"
"6/9/20","","Scheduled","1"
"6/9/20","","Scheduled","1"
"6/9/20","","Drop-In","2"
"6/9/20","","Drop-In","3"
"6/9/20","","Drop-In","8"
"6/9/20","","Drop-In","5"
"6/16/2020","","Scheduled","1"
"6/16/2020","","Scheduled","1"
"6/16/2020","","Drop-In","4"
"6/16/2020","","Drop-In","5"
"6/16/2020","","Drop-In","2"
"6/16/2020","","Drop-In","2"
"6/16/2020","","Drop-In","5"
"6/16/2020","","Drop-In","1"
I can create three separate VIEW, one for each type and one for the two combined. But my goal is to have the results of these three VIEWs in one.
Here are the three VIEWs. First is for the two client types combined.
CREATE VIEW "visit_summary" AS SELECT
visit_date,
COUNT (*) AS households_total,
SUM (household_size) AS individuals_total
FROM
"visit_log"
GROUP By visit_date
This yields
"visit_date","households_total","individuals_total"
"06/09/2020","12","44"
"06/16/2020","8","21"
"06/23/2020","7","20"
"06/30/2020","10","22"
"07/07/2020","7","18"
Next is the VIEW for the Drop-Ins
CREATE VIEW "visit_summary_dropin" AS SELECT
visit_date,
COUNT (*) AS households_dropin,
SUM (household_size) AS individuals_dropin
FROM
"visit_log"
WHERE client_relation = "Drop-In"
GROUP By visit_date
This yields
"visit_date","households_dropin","individuals_dropin"
"06/09/2020","10","42"
"06/16/2020","6","19"
"06/23/2020","4","13"
"06/30/2020","6","12"
"07/07/2020","6","16"
Finally is the VIEW for the Scheduled
CREATE VIEW "visit_summary_scheduled" AS SELECT
visit_date,
COUNT (*) AS households_schedualed,
SUM (household_size) AS individuals_scheduled
FROM
"visit_log"
WHERE client_relation = "Scheduled"
GROUP By visit_date
This yields
"visit_date","households_schedualed","individuals_scheduled"
"06/09/2020","2","2"
"06/16/2020","2","2"
"06/23/2020","3","7"
"06/30/2020","4","10"
"07/07/2020","1","2"
What I'm hoping to create is a single VIEW that yields
"visit_date","households_total","individuals_total","households_dropin","individuals_dropin","households_schedualed","individuals_scheduled"
"06/09/2020","12","44","10","42","2","2"
etc…
So my ultimate question, finally, is how to create a single VIEW containing something like multiple WHERE classes to define different columns?
You can do it with conditional aggregation:
CREATE VIEW visit_summary_scheduled_all AS
SELECT visit_date,
COUNT(*) households_total,
SUM(household_size) individuals_total,
SUM(client_relation = 'Drop-In') households_dropin,
SUM(CASE WHEN client_relation = 'Drop-In' THEN household_size END) individuals_dropin,
SUM(client_relation = 'Scheduled') households_scheduled,
SUM(CASE WHEN client_relation = 'Scheduled' THEN household_size END) individuals_scheduled
FROM visit_log
GROUP By visit_date
See the demo.
Results:
| visit_date | households_total | individuals_total | households_dropin | individuals_dropin | households_scheduled | individuals_scheduled |
| ---------- | ---------------- | ----------------- | ----------------- | ------------------ | -------------------- | --------------------- |
| 6/16/2020 | 8 | 21 | 6 | 19 | 2 | 2 |
| 6/9/20 | 6 | 20 | 4 | 18 | 2 | 2 |

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks
The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

Resources