How to exclude "outlier" data in kusto

How to exclude "outlier" data in kusto - azure-data-explorer

I am trying to track some network query responses.
In some cases, data measured is way out of the norm "5x" the normal data. Maybe developer was debugging something, but that data is still getting logged, etc.
Is there a way to exclude outlier data points in Kusto? So they don't impact the p95, p5 numbers?

This can be achieved by checking the value of the 95th percentile of your metric in TableX
let Percentile95 = toscalar(
TableX
| summarize percentile(metric, 95));
leading to something like .
Then you can use this number to filter out the outliers!
TableX
| where metric <= Percentile95
To filter out both 5% statistical outliers we would therefore use
let Percentile05 = toscalar(
TableX
| summarize percentile(metric, 5));
let Percentile95 = toscalar(
TableX
| summarize percentile(metric, 95));
TableX
| where metric <= Percentile95 and metric >= Percentile5

Related

Line Graph where each line is a column in the same table

I am trying to make a line graph in Azure Data Explorer, I have a single table and I am trying to make each line in the line graph be based on a column in that table.
Using a single column works just fine with the below query
scandevicedata
| where SuccessFullScan == 1
| summarize SuccessFulScans = count() by scans = bin(todatetime(TransactionTimeStampUtc), 30s)
The problem is now I want to add a second column from the same table like this
scandevicedata
| where UnSuccessFullScan == 1
| summarize UnSuccessFulScans = count() by scans = bin(todatetime(TransactionTimeStampUtc), 30s)
As you can see the first query takes out successful scans and the second query takes out Un Successful scans and now I want to combine them in the same output and do a graph on them but I cant figure out how to do this since they are different columns
How can I achieve this?

You could use the countif() aggregation function :
scandevicedata
| summarize
Successful = countif(SuccessFullScan == 1),
Unsuccessful = countif(UnsuccessFullScan == 1)
by bin(todatetime(TransactionTimeStampUtc), 30s)
| render timechart

KQL extend to new column with summarize inside

I'm trying to make a table with these columns
type | count
I tried this with no luck
exceptions
| where timestamp > ago(144h)
| extend
type = type, count = summarize count() by type
| limit 100
Any idea on what I'm doing wrong?

You should do this instead:
exceptions
| where timestamp > ago(144h)
| summarize count = count() by type
| limit 100
Explanation:
You should use extend when you want to add new/replace columns to the result, for example, extend day_of_month = dayofmonth(Timestamp) - you'll remain with exactly the same record count in this case - see more info in the doc
You should use summarize when you want to summarize multiple records (so the record count after the summarize will usually be smaller than the original record count), like in your case - see more info in the doc
By the way, instead of 144h you can use 6d, which is exactly the same, but is more natural to the human eye :)

Query to Get Multiple y-series for Using ysplit=panels to render multiple panels

How do I write my query to create the data result in the proper format to be plotted in multiple panels using the | render timechart with (ysplit=panels) output?
Looking at Microsoft's examples, I need to have my IPPrefix column to produce multiple columns in a single row. Instead, my query is producing separate rows for each grouping in IPPrefix.
I have the following query:
let startTime = datetime('2020.07.23 20:00:00');
let endTime = datetime('2020.07.23 23:59:00');
AzureDiagnostics
| where TimeGenerated between (startTime..endTime)
| where ResourceType == "APPLICATIONGATEWAYS" and OperationName == "ApplicationGatewayAccess"
| where requestUri_s contains "api/auth/ping"
| extend IPParts = split(clientIP_s, '.')
| extend IPPrefix = strcat(IPParts[0], '.', IPParts[1], '.', IPParts[2])
| make-series Count = count() on TimeGenerated in range(startTime, endTime, 5m) by IPPrefix
//| summarize AggregatedValue = count() by IPPrefix, bin(TimeGenerated, 1m)
| render timechart with (ysplit=panels)
I want the result to look something like:
But instead, all the y-series are plotted in a single panel like:
I suppose that I am not using make-series in the correct way in order to produce the result I need but I have not been able to apply it in a different way to make it work.

I realized that I needed to pivot on the data before rendering. I also learned there is a limit of 5 panels on the ysplit=panels option. I had to limit the series to five and then perform a pivot on the aggregated data.
...
| make-series Count = count() on TimeGenerated in range(startTime, endTime, 1m) by IPPrefix
| take 5
| evaluate pivot(IPPrefix, any(Count), TimeGenerated)
| render timechart with(ysplit=panels)
Resulting chart with five panels.

Sqlite improve case-when and group by performance

I'm optimizing my query using SQLite3.
There are some "CASE WHEN", "GROUP BY", "COUNT" functions.
BUT the query is VERY Slow (about 14sec)
Here is my database file information.
size: about 2GB
rows : about 3 millions
columns : 55 columns
What can i do for optimizing the query's performance?
Is there any better query for the result?
Please help me TT Thanks.
select
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end as range,
count(*) as count
from grade_info
where type < 9 and
(date >= '2019-07-09 00:00:00' and date <= '2019-07-09 23:59:59') and
is_new = 1
group by
case
when score = 100 then 'A'
when score < 100 and score >= 40 then 'B'
else 'C'
end;
Table grade_info has multi-column index: (type, date, is_new, score)
The conditions for columns (type, date, is_new) are always used in this query. Here is the explain query plan result.
selectid | order | from | detail
--------------------------------
0 0 0 SEARCH TABLE grade_info USING INDEX idx_03 (type<?) (~2777 rows)
0 0 0 USE TEMP B-TREE FOR GROUP BY
and I want the result like this.
A | 5124
B | 124
C | 12354

As Shawn suggests, try changing the index to have the date column as the first column:
CREATE INDEX [idx_cover] ON [grade_info] ([date], [is_new], [type], [score]);
sqlite allows reference to aliased expressions in the WHERE and GROUP BY clauses, so you can simply say GROUP BY range rather than repeating the CASE statement. This probably won't change the efficiency, but makes the query shorter and more readable.
If you run ANALYZE as MikeT suggests, the execution plan should change to say "COVERING INDEX...". If I understand correctly, that indicates that the entire query can be executed by traversing the single multi-column index without going back to the table data.
Try date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59'.
Finally, CASE... WHEN is short-circuited, so make sure to place the more likely cases first so that it avoids unnecessary calculations. Also eliminate redundant conditional checks. If you have already checked a certain range in a previous condition, no need to re-evaluate that range in the next condition. (If you have already ruled out score = 100, then it is unnecessary to check score < 100 since it will of course be less than 100... assuming that all scores are ensured to be in the range 0 to 100) For instance, if scores are uniformly distributed, then the following could be faster, possibly eliminating +17000 conditional checks.
SELECT
CASE
WHEN score < 40 then 'C'
WHEN score < 100 then 'B' -- already tested to be >= 40
ELSE 'A' -- already tested to be >= 100
END AS range,
count(*) AS count
FROM grade_info
WHERE type < 9 AND
(date BETWEEN '2019-07-09 00:00:00' AND '2019-07-09 23:59:59') AND
is_new = 1
GROUP BY
range;

How can I introduce a constant reference line based on an aggregation to a Kusto timechart?

I've got a simple KQL query that plots the (log of the) count of all exceptions over 90 days:
exceptions
| where timestamp > ago(90d)
| summarize log(count()) by bin(timestamp, 1d)
| render timechart
What I'd like to do is add some reference lines to the timechart this generates. Based on the docs, this is pretty straightforward:
| extend ReferenceLine = 8
The complicating factor is that I'd like these reference lines to be based on aggregations of the value I'm plotting. For instance, I'd like a reference line for the minimum, mean, and 3rd quartile values.
Focusing on the first of these (minimum), it turns out that you can't use min() outside of summarize(). But I can use this within an extend().
I was drawn to min_of(), but this expects a list of arguments instead of a column. I'm thinking I could probably expand the column into a series of values, but this feels hacky and would fall down beyond a certain number of values.
What's the idiomatic way of doing this?

you could try something like the following:
exceptions
| where timestamp > ago(90d)
| summarize c = log(count()) by bin(timestamp, 1d)
| as hint.materialized=true T
| extend _min = toscalar(T | summarize min(c)),
_perc_50 = toscalar(T | summarize percentile(c, 50))
| render timechart

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to exclude "outlier" data in kusto - azure-data-explorer

Related

Line Graph where each line is a column in the same table

KQL extend to new column with summarize inside

Query to Get Multiple y-series for Using ysplit=panels to render multiple panels

Sqlite improve case-when and group by performance

How can I introduce a constant reference line based on an aggregation to a Kusto timechart?

Categories

Resources