I want to show a graph of minimum value, maximum value and difference between maximum and minimum for each timeslice.
It works ok for min and max
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition) ,min(FromPosition) group by _timeslice
but I couldn't find the correct way to specify the difference.
e.g.
| (max(FromPosition)- min(FromPosition)) as diffFromPosition by _timeslice
returns error -Unexpected token 'b' found.
I've tried a few different combinations to declare them on different lines as suggested on https://help.sumologic.com/05Search/Search-Query-Language/aaGroup. e.g.
| int(FromPosition) as intFromPosition
| max(intFromPosition) as maxFromPosition , min(intFromPosition) as minFromPosition
| (maxFromPosition - minFromPosition) as diffFromPosition
| diffFromPosition by _timeslice
without success.
Can anyone suggest the correct syntax?
Try this:
| parse "FromPosition *)" as FromPosition
| timeslice 2h
| max(FromPosition), min(FromPosition) by _timeslice
| _max - _min as diffFromPosition
| fields _timeslice, diffFromPosition
the group by is for the min and max functions to know what range to work with, not the group by for the overall search query. That's why you were getting the syntax errors and one reason I prefer to just use by as above.
For these kinds of queries I usually prefer a box plot where you would just do:
| min(FromPosition), pct(FromPosition, 25), pct(FromPosition, 50), pct(FromPosition, 75), max(FromPosition) by _timeslice
Then selecting box plot as the graph type. Looks great on a dashboard and provides a lot of detailed information about deviation and such at a glance.
Related
I'm trying to make a table with these columns
type | count
I tried this with no luck
exceptions
| where timestamp > ago(144h)
| extend
type = type, count = summarize count() by type
| limit 100
Any idea on what I'm doing wrong?
You should do this instead:
exceptions
| where timestamp > ago(144h)
| summarize count = count() by type
| limit 100
Explanation:
You should use extend when you want to add new/replace columns to the result, for example, extend day_of_month = dayofmonth(Timestamp) - you'll remain with exactly the same record count in this case - see more info in the doc
You should use summarize when you want to summarize multiple records (so the record count after the summarize will usually be smaller than the original record count), like in your case - see more info in the doc
By the way, instead of 144h you can use 6d, which is exactly the same, but is more natural to the human eye :)
How do I write my query to create the data result in the proper format to be plotted in multiple panels using the | render timechart with (ysplit=panels) output?
Looking at Microsoft's examples, I need to have my IPPrefix column to produce multiple columns in a single row. Instead, my query is producing separate rows for each grouping in IPPrefix.
I have the following query:
let startTime = datetime('2020.07.23 20:00:00');
let endTime = datetime('2020.07.23 23:59:00');
AzureDiagnostics
| where TimeGenerated between (startTime..endTime)
| where ResourceType == "APPLICATIONGATEWAYS" and OperationName == "ApplicationGatewayAccess"
| where requestUri_s contains "api/auth/ping"
| extend IPParts = split(clientIP_s, '.')
| extend IPPrefix = strcat(IPParts[0], '.', IPParts[1], '.', IPParts[2])
| make-series Count = count() on TimeGenerated in range(startTime, endTime, 5m) by IPPrefix
//| summarize AggregatedValue = count() by IPPrefix, bin(TimeGenerated, 1m)
| render timechart with (ysplit=panels)
I want the result to look something like:
But instead, all the y-series are plotted in a single panel like:
I suppose that I am not using make-series in the correct way in order to produce the result I need but I have not been able to apply it in a different way to make it work.
I realized that I needed to pivot on the data before rendering. I also learned there is a limit of 5 panels on the ysplit=panels option. I had to limit the series to five and then perform a pivot on the aggregated data.
...
| make-series Count = count() on TimeGenerated in range(startTime, endTime, 1m) by IPPrefix
| take 5
| evaluate pivot(IPPrefix, any(Count), TimeGenerated)
| render timechart with(ysplit=panels)
Resulting chart with five panels.
I've got a simple KQL query that plots the (log of the) count of all exceptions over 90 days:
exceptions
| where timestamp > ago(90d)
| summarize log(count()) by bin(timestamp, 1d)
| render timechart
What I'd like to do is add some reference lines to the timechart this generates. Based on the docs, this is pretty straightforward:
| extend ReferenceLine = 8
The complicating factor is that I'd like these reference lines to be based on aggregations of the value I'm plotting. For instance, I'd like a reference line for the minimum, mean, and 3rd quartile values.
Focusing on the first of these (minimum), it turns out that you can't use min() outside of summarize(). But I can use this within an extend().
I was drawn to min_of(), but this expects a list of arguments instead of a column. I'm thinking I could probably expand the column into a series of values, but this feels hacky and would fall down beyond a certain number of values.
What's the idiomatic way of doing this?
you could try something like the following:
exceptions
| where timestamp > ago(90d)
| summarize c = log(count()) by bin(timestamp, 1d)
| as hint.materialized=true T
| extend _min = toscalar(T | summarize min(c)),
_perc_50 = toscalar(T | summarize percentile(c, 50))
| render timechart
I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks
Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.
I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:
Dimension Gazetteer:
dimension_id
country_name
province_name
district_name
Dimension Device:
dimension_id
device_category
device_subcategory
Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)
A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:
+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id | area_m2 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?
I am using the Pentaho stack.
Thanks in advance
[moved from comment]
If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.
If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.
I would also consider adding a device-count fact, which could have the num devices of each type per hazard.