Kusto - Render Column chart as per bucket values (extend operator) - azure-data-explorer

If I run the below Query to the following datatable I am getting results only as 10E for event _count>10. Is there any reason the other categories are not getting displayed in the bucket. I would like to render column chart as per the event count category. Thanks.
| summarize event_count=count() by State
| where event_count > 10
| extend bucket = case (
event_count > 10, "10E",
event_count > 100, "100E",
event_count > 500, "500E",
event_count > 1000, "1000E",
event_count > 5000, ">5000E",
"N/A")
| project bucket```
datatable (State: string, event_count: long) [
"VIRGIN ISLANDS",long(12),
"AMERICAN SAMOA",long(16),
"DISTRICT OF COLUMBIA",long(22),
"LAKE ERIE",long(27),
"LAKE ST CLAIR",long(32),
"LAKE SUPERIOR",long(34),
"RHODE ISLAND",long(51),
"LAKE HURON",long(63),
"CONNECTICUT",long(148)
]

When a condition is true in a "case" function, it doe not continue to the next one. Since all of your counts are bigger than 10, then the first category is correct for all of them. It seems that you wanted that the condition would be less or equal to, here is an example:
datatable (State: string, event_count: long) [
"VIRGIN ISLANDS",long(12),
"AMERICAN SAMOA",long(16),
"DISTRICT OF COLUMBIA",long(22),
"LAKE ERIE",long(27),
"LAKE ST CLAIR",long(32),
"LAKE SUPERIOR",long(34),
"RHODE ISLAND",long(51),
"LAKE HURON",long(63),
"CONNECTICUT",long(148)
]
| where event_count > 10
| extend bucket = case (
event_count <= 10, "10E",
event_count <= 100, "100E",
event_count <= 500, "500E",
event_count <= 1000, "1000E",
event_count <= 5000, ">5000E",
"N/A")
| summarize sum(event_count) by bucket
| render columnchart
bucket
sum_event_count
100E
257
500E
148

Related

Limiting Azure data explorer update policy input

We have a use case where we are saving telemetry and statistic data from the machines but the update policy, which is supposed to process the raw data, is giving us trouble and running out of memory.
Aggregation over string column exceeded the memory budget of 8GB during evaluation
We have two tables, the 'ingest-table' where the data is initially being ingested to and the 'main-table' where it should end up.
We are in a process of migrating from another solution to ADX and have to ingest a high volume of data.
The raw data is in a matrix format, which means that one message from a machine will end up as multiple rows/records in the ADX database. We use mv-expand for the breakdown and the query is pretty much doing that, among with some data formatting.
So, our update policy looks like the following:
['ingest-table']
| mv-expand Counter = Data.Counters
| mv-expand with_itemindex = r Row = Rows
| mv-expand Column = Rows[r].Data
| project ...
I don't see any way how could I improve the processing query itself and I'm looking for a way to somehow limit the number of the record which the update policy function would receive.
I've tried playing around with the ingestion batching (MaximumNumberOfItems = 1000) and also sharding policy (MaxRowCount = 1000) for the 'ingest-table' but it does not have any effect on the number of records the update policy is pulling it at once.
My idea is to let only 1000 items at once to be processed by the update policy function because I've manually tested and it works fine to up to 5k record but fails closly above that.
Any suggestion what we could do in this case and how I can achieve that?
EDIT:
An example raw message which has to be processed by the update policy.
The number of rows the policy has to generate is the number of COUNTERS * ROWS * COLUMNS. In this case it would mean that we end up with ~1200 rows after this single message is processed.
I do not see any other way that doing a mv-expand here.
{
"Name": "StatisicName",
"TimeInterval": {
"StartUtc": 1654221156.285,
"EndUtc": 1654221216.286
},
"Legend": {
"RowLabels": [
"0",
"0.04",
"0.08",
"0.12",
"0.16",
"0.2",
"0.24",
"0.28",
"0.32",
"0.36",
"0.4",
"0.44",
"0.48",
"0.52",
"0.56",
"0.6",
"0.64",
"0.68",
"0.72",
"0.76",
"0.8",
"0.84",
"0.88",
"0.92",
"0.96",
"1",
"1.04",
"1.08",
"1.12",
"1.16",
"1.2",
"1.24",
"1.28",
"1.32",
"1.36",
"1.4",
"1.44",
"1.48",
"1.52",
"1.56",
"1.6",
"1.64",
"1.68",
"1.72",
"1.76",
"1.8",
"1.84",
"1.88",
"1.92",
"1.96"
],
"ColumnLabels": [
"Material1",
"Material2",
"Material3",
"Material4",
"Material5",
"Material6",
"Material7",
"Material8",
"Material9",
"Material10",
"Material11",
"Material12"
]
},
"Counters": [
{
"Type": "Cumulative",
"Matrix": {
"Rows": [
{
"Data": [
6.69771873292923,
0,
0,
0,
0.01994649920463562,
0.017650499296188355,
0.007246749711036683,
0.003443999862670899,
0.1422802443265915,
0,
0,
0.0008609999656677247
]
}
//,{...} ... for each row of the matrix
]
}
},
{
"Type": "Count",
"Matrix": {
"Rows": [
{
"Data": [
0.0001434999942779541,
0,
0,
0,
0.0001434999942779541,
0.0001434999942779541,
0.0001317590856552124,
0.0001434999942779541,
0.00014285165093031273,
0,
0,
0.0001434999942779541
]
}
//,{...} ... for each row of the matrix
]
}
}
]
}
The main issue I see in your code is this:
| mv-expand with_itemindex = r Row = Rows
| mv-expand Column = Rows[r].Data
You explode Rows and get the exploded values in a new column called Row, but then instead of working with Row.Data, you keep using the original unexploded Rows, traversing through the elements using r.
This leads to unnecessary duplication of Rows and it is probably what creates the memory pressure.
Check out the following code.
You can use the whole code and get the data formatted as a table with columns Material1, Material2 etc., or exclude the last 2 rows and simply get the exploded values, each in a separate row.
// Data sample generation. Not part of the solution
let p_matrixes = 3;
let p_columns = 12;
let p_rows = 50;
let ['ingest-table'] =
range i from 1 to p_matrixes step 1
| extend StartUtc = floor((ago(28d + rand()*7d) - datetime(1970))/1ms/1000,0.001)
| extend EndUtc = floor((ago(rand()*7d) - datetime(1970))/1ms/1000,0.001)
| extend RowLabels = toscalar(range x from todecimal(0) to todecimal(0.04 * (p_rows - 1)) step todecimal(0.04) | summarize make_list(tostring(x)))
| extend ColumnLabels = toscalar(range x from 1 to p_columns step 1 | summarize make_list(strcat("Material",tostring(x))))
| extend Counters_Cumulative = toscalar(range x from 1 to p_rows step 1 | mv-apply range(1, p_columns) on (summarize Data = pack_dictionary("Data", make_list(rand()))) | summarize make_list(Data))
| extend Counters_Count = toscalar(range x from 1 to p_rows step 1 | mv-apply range(1, p_columns) on (summarize Data = pack_dictionary("Data", make_list(rand()))) | summarize make_list(Data))
| project i, Data = pack_dictionary("Name", "StatisicName", "TimeInterval", pack_dictionary("StartUtc", StartUtc, "EndUtc",EndUtc), "Legend", pack_dictionary("RowLabels", RowLabels, "ColumnLabels", ColumnLabels), "Counters", pack_array(pack_dictionary("Type", "Cumulative", "Matrix", pack_dictionary("Rows", Counters_Cumulative)), pack_dictionary("Type", "Count", "Matrix", pack_dictionary("Rows", Counters_Count))))
;
// Solution starts here
// Explode values
['ingest-table']
| project Name = tostring(Data.Name), StartUtc = todecimal(Data.TimeInterval.StartUtc), EndUtc = todecimal(Data.TimeInterval.EndUtc), RowLabels = Data.Legend.RowLabels, ColumnLabels = Data.Legend.ColumnLabels, Counters = Data.Counters
| mv-apply Counters on (project Type = tostring(Counters.Type), Rows = Counters.Matrix.Rows)
| mv-apply RowLabels to typeof(decimal), Rows on (project RowLabels, Data = Rows.Data)
| mv-expand ColumnLabels to typeof(string), Data to typeof(real)
// Format as table
| evaluate pivot(ColumnLabels, take_any(Data))
| project-reorder Name, StartUtc, EndUtc, RowLabels, Type, * granny-asc
"Explode values" sample
Name
StartUtc
EndUtc
ColumnLabels
RowLabels
Type
Data
StatisicName
1658601891.654
1660953273.898
Material4
0.88
Count
0.33479977032253788
StatisicName
1658601891.654
1660953273.898
Material7
0.6
Cumulative
0.58620965468565811
StatisicName
1658801257.201
1660941025.56
Material1
0.72
Count
0.23164306814350025
StatisicName
1658601891.654
1660953273.898
Material4
1.68
Cumulative
0.47149864409592157
StatisicName
1658601891.654
1660953273.898
Material12
1.08
Cumulative
0.777589612330022
"Format as table" Sample
Name
StartUtc
EndUtc
RowLabels
Type
Material1
Material2
Material3
Material4
Material5
Material6
Material7
Material8
Material9
Material10
Material11
Material12
StatisicName
1658581605.446
1660891617.665
0.52
Cumulative
0.80568785763966921
0.69112398516227513
0.45844947991605256
0.87975011678339887
0.19607303271777138
0.76728212781319993
0.27520162657976527
0.48612400400362971
0.23810927904958085
0.53986865017468966
0.31225384042818344
0.99380179164514848
StatisicName
1658581605.446
1660891617.665
0.72
Count
0.77601864161716061
0.351768361021601
0.59345888695494731
0.92329751241805491
0.80811999338933449
0.49117503870065837
0.97871902062153937
0.94241064167069055
0.52950523227349289
0.39281849330041424
0.080759530370922858
0.8995622227351241
StatisicName
1658345203.482
1660893443.968
1.92
Count
0.78327575542772387
0.16795871437570925
0.01201541525964204
0.96029371013283549
0.60248327254185241
0.019315208353334352
0.4828009899119266
0.75923221663483853
0.29630236707606555
0.23977292819044668
0.94531978804572625
0.54626985282267437
StatisicName
1658345203.482
1660893443.968
1
Count
0.65268575186841382
0.61471913013853441
0.80536656853846211
0.380104887115314
0.84979344481966745
0.68790819414895632
0.80862491082567767
0.083687871352600765
0.16707928827946666
0.4071460045501768
0.94115460659910444
0.25011225557898314
StatisicName
1658581605.446
1660891617.665
1.6
Count
0.75532393959433786
0.71081551001527776
0.9757484452705758
0.55510969429009
0.055800808878012885
0.74924458240427783
0.78706505608871058
0.18745675452118818
0.70192553697345517
0.39429935579653647
0.4048784200404818
0.14888395753558561
Fiddle

Make the value return timespan(0), if rows not available or condition is not met in Kusto?

I have the following code:
Telemetry
| where DataMetadata["category"] == "Warning"
| summarize
Duration = sum(case(Name == "Event", totimespan(Value), totimespan(0))),
Text = min(case(Name == "Information", tostring(Value), "N/A")),
DeviceID = min(case(Name == "Ident", tostring(Value), "N/A"))
by Timestamp
| summarize TotalDuration = sum(Duration) by Text,DeviceID
| top 2 by TotalDuration
| summarize Duration = max(case(isnotnull(TotalDuration) or isnotempty(TotalDuration), strcat("Duration: ",format_timespan(TotalDuration, 'dd:hh:mm:ss'), "[sec] ",DeviceID," - ",Text), tostring(timespan(0))))
Checking the last hour of data, the condition DataMetadata["category"] == "Warning" is not met and in this case I want to display as a result 00:00:00:00 as shown in the summarize at the end of the code.
However, what I get as a result is the following:
What is the issue here and how can I solve it ?
I assume that you do want the top 2 records by TotalDuration, in case there are any.
let Telemetry = datatable(DataMetadata:dynamic, Name:string, Timestamp:datetime, Value:string)[];
Telemetry
| where DataMetadata["category"] == "Warning"
| summarize
Duration = sum(case(Name == "Event", totimespan(Value), totimespan(0))),
Text = min(case(Name == "Information", tostring(Value), "N/A")),
DeviceID = min(case(Name == "Ident", tostring(Value), "N/A"))
by Timestamp
| summarize TotalDuration = sum(Duration) by Text,DeviceID
| union (print TotalDuration = 0s, Text = "NA", DeviceID = "NA")
| top 2 by TotalDuration
| project Duration = strcat("Duration: ",format_timespan(TotalDuration, 'dd:hh:mm:ss'), "[sec] ",DeviceID," - ",Text)
Duration
Duration: 00:00:00:00[sec] NA - NA
Fiddle

Kusto for sliding window

I am new to Kusto Query language. Requirement is to alert when the continuous 15 minute value of machine status is 1.
I have two columns with column1:(timestamp in every second) and column2:machine status(values 1 and 0).How can I use a sliding window to find if the machine is 1 for continuous 15 minutes.
Currently I have used the bin function, but it does not seem to be the proper one.
summarize avg_value = avg(status) by customer, machine,bin(timestamp,15m)
What could be the better solution for this.
Thanks in advance
Here is another option using time series functions:
let dt = 1s;
let n_bins = tolong(15m/dt);
let coeffs = repeat(1, n_bins);
let T = view(M:string) {
range Timestamp from datetime(2022-01-11) to datetime(2022-01-11 01:00) step dt
| extend machine = M
| extend status = iif(rand()<0.002, 0, 1)
};
union T("A"), T("B")
| make-series status=any(status) on Timestamp step dt by machine
| extend rolling_status = series_fir(status, coeffs, false)
| extend alerts = series_equals(rolling_status, n_bins)
| project machine, Timestamp, alerts
| mv-expand Timestamp to typeof(datetime), alerts to typeof(bool)
| where alerts == 1
You can also do it using the scan operator.
thanks
Here is one way to do it, the example uses generated data, hopefully it fits in your scenario:
let view = range x from datetime(2022-01-10 13:00:10) to datetime(2022-01-10 13:10:10) step 1s
| extend status = iif(rand()<0.01, 0, 1)
| extend current_sum = row_cumsum(status)
| extend prior_sum = prev(current_sum, 15)
| extend should_alert = (current_sum-prior_sum != 15 and isnotempty(prior_sum))
If you have multiple machines you need to sort it first by machines and restart the row_cumsum operation:
let T = view(M:string) {
range Timestamp from datetime(2022-01-10 13:00:10) to datetime(2022-01-10 13:10:10) step 1s
| extend machine = M
| extend status = iif(rand()<0.01, 0, 1)
};
union T("A"), T("B")
| sort by machine asc, Timestamp asc
| extend current_sum = row_cumsum(status, machine != prev(machine))
| extend prior_sum = iif(machine == prev(machine, 15), prev(current_sum, 15), int(null))
| extend should_alert = (current_sum-prior_sum != 15 and isnotempty(prior_sum))

Update element on map depending on other update

I want to do update a map doing the following:
def updateInfo() do
person = %{ person | name : "new name", age : person.age +1, observation : changeObs(person.age)}
end
def changeObs(age), when age >= 18, do: "Adult"
def changeObs(age), do: "kid"
If I updateInfo() and the age is 17, I would expect the person to change the observation to "Adult". But it is not working. I thought the updates of the map were done sequentially, but apparently not, so I cannot rely on the fact that the age now is 18. I can do this if I split the update like so:
person = %{ person | name : "new name", age : person.age +1}
person = %{ person | observation : changeObs(person.age)}
Is there a way to keep it all the update in one line, relying on previous updates of the attributes on the map?
Well, as far as I can tell, you can just cache the new value in a variable:
new_age = person.age + 1
person = %{ person | name: "new name", age: new_age, observation: changeObs(new_age)}
Or you could piple the changes like this:
person
|> Map.put(:name, "new_name")
|> Map.put(:age, new_age)
|> Map.put(:observation, changeObs(new_age))
First of all, the syntax of updating maps uses colons, not equal signs. Your code raises SyntaxError exception.
There are plenty of ways to accomplish the task as oneliner:
person = with age <- person.age + 1,
do: %{person | age: age,
observation: (if age >= 18, do: "Adult", else: "Kid")}
Or:
person = (age = person.age + 1;
%{person | age: age,
observation: (if age >= 18, do: "Adult", else: "Kid")})
Or a pipe chain:
{_, person} = person
|> Map.put(:age, person.age + 1)
|> Map.get_and_update(:age, fn age ->
{age, age + 1}
end)
The idiomatic would be the last one. My fave would be the first one.

Marketing Channel Flow Map

I have data on all marketing engagements (links clicked, etc), their 'marketing channel', and their 'engagement position".
Engagement position are the following: first touch [first time they ever engage with us], lead create [when they form-fill and give us enough info], opportunity create [the engagement that happened right before an opportunity was created], and closed won [the engagement that happened right before they signed and purchased].
what i want to do is take these 'paths' through our marketing channel, and create a flow map which will map all the possible marketing paths someone has taken.
the data i have contains ID of the engagement, channel, and position like such:
______________________________
| id | channel | position |
| 1 | direct | FT |
| 1 | SEM | LC |
| 1 | email | OC |
| 1 | video | CW |
______________________________
That would be an example of one prospects 'marketing path' and i have a couple hundred thousand of those unique paths. This particular lead would have gone direct > SEM > email > video -- and this would be 1 path.
I'd like to map this out by having the channels be the 'destinations' and the positions determine the order of the movement with the most common path being the boldest (or brightest) and the least common being the least bold (or flattest color)--probably done in ggplot2
I understand this is a bit broad, but i have very very limited experience in visualizing a 'mapping' type of data set, so i dont even know which packages would be useful to me.
I am using R
Here's a try using ggplot. First, make some example data:
library(tidyverse)
tbl1 <- tibble(
id=1:100,
channel = sample(c("direct", "SEM", "email", "video"),
size=100, replace=TRUE, prob=c(.1,.2,.3,.4)),
position = "1-FT")
tbl2 <- tibble(
id=1:100,
channel = sample(c("direct", "SEM", "email", "video"),
size=100, replace=TRUE, prob=c(.2,.1,.3,.4)),
position = "2-LC")
tbl3 <- tibble(
id=1:100,
channel = sample(c("direct", "SEM", "email", "video"),
size=100, replace=TRUE, prob=c(.3,.2,.1,.4)),
position = "3-OC")
tbl4 <- tibble(
id=1:100,
channel = sample(c("direct", "SEM", "email", "video"),
size=100, replace=TRUE, prob=c(.4, .3,.2,.1)),
position = "4-CW")
tbl= bind_rows(tbl1, tbl2, tbl3, tbl4)
Then, make an example graph:
ggplot(tbl, aes(x=position, y=channel, group=id)) +
geom_line(alpha=.1, size=3)
I think it would be cooler to vary the size by the count; another option would be to use a color scale with the count. Here, I'm using a single alpha value as a hack for a scale.

Resources