Parse data in Kusto - azure-data-explorer

I am trying to parse the below data in Kusto. Need help.
[[ObjectCount][LinkCount][DurationInUs]]
[ChangeEnumeration][[88][9][346194]]
[ModifyTargetInLive][[3][6][595903]]
Need generic implementation without any hardcoding.

ideally - you'd be able to change the component that produces source data in that format to use a standard format (e.g. CSV, Json, etc.) instead.
The following could work, but you should consider it very inefficient
let T = datatable(s:string)
[
'[[ObjectCount][LinkCount][DurationInUs]]',
'[ChangeEnumeration][[88][9][346194]]',
'[ModifyTargetInLive][[3][6][595903]]',
];
let keys = toscalar(
T
| where s startswith "[["
| take 1
| project extract_all(#'\[([^\[\]]+)\]', s)
);
T
| where s !startswith "[["
| project values = extract_all(#'\[([^\[\]]+)\]', s)
| mv-apply with_itemindex = i keys on (
extend Category = tostring(values[0]), p = pack(tostring(keys[i]), values[i + 1])
| summarize b = make_bag(p) by Category
)
| project-away values
| evaluate bag_unpack(b)
--->
| Category | ObjectCount | LinkCount | DurationInUs |
|--------------------|-------------|-----------|--------------|
| ChangeEnumeration | 88 | 9 | 346194 |
| ModifyTargetInLive | 3 | 6 | 595903 |

Related

MariaDB DATETIME Index not working with Between FROM_UNIXTIME()

I have a table with DATETIME field, which is indexed by a BTree. Now i want to query it with following statement:
SELECT
count(us.CITY) as metric,
us.CITY as Name,
us.LATITUDE as latitude,
us.LONGITUDE as longitude
FROM
FACT
LEFT JOIN
USER us
ON
us.ID_USER = FACT.USER
WHERE
ASSESSMENT_DATE BETWEEN FROM_UNIXTIME(1601568552) AND FROM_UNIXTIME(1604028277)
GROUP BY us.CITY, us.LATITUDE, us.LONGITUDE;
EXPLAIN:
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | FACT | ALL | INDEX_FACT_ASSESSMENT_DATE | NULL | NULL | NULL | 762621 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | us | eq_ref | PRIMARY | PRIMARY | 46 | dwh0.FACT.USER,dwh0.FACT.ENV | 1 | |
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
2 rows in set (0.001 sec)
Interestingly, by only changing the dates manually into the DATETIME Format string it uses the index. But the FROM_UNIXTIME() function should in my opinion return the exactly same thing...
SELECT
count(us.CITY) as metric,
us.CITY as Name,
us.LATITUDE as latitude,
us.LONGITUDE as longitude
FROM
FACT
LEFT JOIN
USER us
ON
us.ENV = FACT.ENV AND us.ID_USER = FACT.USER
WHERE
-- ASSESSMENT_DATE BETWEEN FROM_UNIXTIME(1596649101) AND FROM_UNIXTIME(1599108827)
ASSESSMENT_DATE BETWEEN '2020-08-05 11:30:11.987' AND '2020-09-03 11:30:11.987'
GROUP BY us.CITY, us.LATITUDE, us.LONGITUDE;
EXPLAIN:
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
| 1 | SIMPLE | FACT | range | INDEX_FACT_ASSESSMENT_DATE | INDEX_FACT_ASSESSMENT_DATE | 5 | NULL | 132008 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | us | eq_ref | PRIMARY | PRIMARY | 46 | dwh0.FACT.USER,dwh0.FACT.ENV | 1 |
|
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
2 rows in set (0.001 sec)
Can anyone refer to such a problem? the where clause is generated by grafana, so i can not change that, but the rest i can change if it changes something.
Thanks for suggestions!
Sorry for bothering.. after around 10^5 more inserts, it works for both cases... Maybe it was just bad luck

Is there a KQL query to limit the number of sub results I get per a particular category?

I’m trying to generate a query where I limit the number of sub results I get per a particular category, and could use some help on if there is a good function for this.
Quick Example:
| ID | Category | Value | A bunch of other important columns |
|-----------|-----------------|--------------|-------------------------------------------|
| 1 | A | GUID | |
| 2 | A | GUID | |
| 3 | A | GUID | |
| 4 | A | GUID | |
| 5 | B | GUID | |
| 6 | B | GUID | |
I want to return only N GUIDs per category. (Largely because I’m hitting the 64MB Kusto query limits for some Categories that won’t be useful anyway)
The Top-nested operator looks good at first, BUT I don’t want to do any aggregation, and it filters out other important columns. Per the note on the page, I can use Ignore=max(1) to remove the aggregation, then do some serializing of all my other columns to a certain value, then unpack after the filter. But that feels like I’m doing something very wrong.
I've also tried something like:
| partition by Category ( top 3 by Value)
But it's limited to 64 partitions, and I need closer to 500.
Any idea of a good pattern to do this?
Here you go:
let NumItemsPerCategory = 3;
datatable(ID:long, Category:string, Value:guid)
[
1, "A", guid(40b73f8f-78d2-4eae-bd5b-b3e00f38ac33),
2, "A", guid(043ee507-aadf-4453-bcc6-d8f4f541b043),
3, "A", guid(f71d3cc0-ce46-474f-9dcd-f3883fa08859),
4, "A", guid(bf259fc8-e9fe-4a99-a296-ca81e1fa250a),
5, "B", guid(d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2),
6, "B", guid(282e74ff-3b71-407c-a2a7-92bb1cb17b27),
]
| summarize PackedItems = make_list(pack_all(), NumItemsPerCategory) by Category
| project-away Category
| mv-expand PackedItem = PackedItems
| evaluate bag_unpack(PackedItem)
| project-away PackedItems
Result:
| ID | Category | Value |
|----|----------|--------------------------------------|
| 1 | A | 40b73f8f-78d2-4eae-bd5b-b3e00f38ac33 |
| 2 | A | 043ee507-aadf-4453-bcc6-d8f4f541b043 |
| 3 | A | f71d3cc0-ce46-474f-9dcd-f3883fa08859 |
| 5 | B | d8ee3ac7-da76-4e87-a9ed-e5a37c943ad2 |
| 6 | B | 282e74ff-3b71-407c-a2a7-92bb1cb17b27 |

Cumulative count of occurrences per value in array in Kusto

I'm looking to get the count of query param usage from the query string from page views stored in app insights using KQL. My query currently looks like:
pageViews
| project parsed=parseurl(url)
| project keys=bag_keys(parsed["Query Parameters"])
and the results look like
with each row looking like
I'm looking to get the count of each value in the list when it is contained in the url in order to anwser the question "How many times does page appear in the querystring". So the results might look like:
Page | From | ...
1000 | 67 | ...
Thanks in advance
you could try something along the following lines:
datatable(url:string)
[
"https://a.b.c/d?p1=hello&p2=world",
"https://a.b.c/d?p2=world&p3=foo&p4=bar"
]
| project parsed = parseurl(url)
| project keys = bag_keys(parsed["Query Parameters"])
| mv-expand key = ['keys'] to typeof(string)
| summarize count() by key
which returns:
| key | count_ |
|-----|--------|
| p1 | 1 |
| p2 | 2 |
| p3 | 1 |
| p4 | 1 |

Addition of calculated field in rpivotTable

I want to create a calculated field to use with the rpivotTable package, similar to the functionality seen in excel.
For instance, consider the following table:
+--------------+--------+---------+-------------+-----------------+
| Manufacturer | Vendor | Shipper | Total Units | Defective Units |
+--------------+--------+---------+-------------+-----------------+
| A | P | X | 173247 | 34649 |
| A | P | Y | 451598 | 225799 |
| A | P | Z | 759695 | 463414 |
| A | Q | X | 358040 | 225565 |
| A | Q | Y | 102068 | 36744 |
| A | Q | Z | 994961 | 228841 |
| A | R | X | 454672 | 231883 |
| A | R | Y | 275994 | 124197 |
| A | R | Z | 691100 | 165864 |
| B | P | X | 755594 | 302238 |
| . | . | . | . | . |
| . | . | . | . | . |
+--------------+--------+---------+-------------+-----------------+
(my actual table has many more columns, both dimensions and measures, time, etc. and I need to define multiple such "calculated columns")
If I want to calculate defect rate (which would be Defective Units/Total Units) and I want to aggregate by either of the first three columns, I'm not able to.
I tried assignment by reference (:=), but that still didn't seem to work and summed up defect rates (i.e., sum(Defective_Units/Total_Units)), instead of sum(Defective_Units)/sum(Total_Units):
myData[, Defect.Rate := Defective_Units / Total_Units]
This ended up giving my defect rates greater than 1. Is there anywhere I can declare a calculated field, which is just a formula evaluated post aggregation?
You're lucky - the creator of pivottable.js foresaw cases like yours (and mine, earlier today) by implementing an aggregator called "Sum over Sum" and a few more, likewise, cf. https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L111 and https://github.com/nicolaskruchten/pivottable/blob/master/src/pivot.coffee#L169.
So we'll use "Sum over Sum" as parameter "aggregatorName", and the columns whose quotient we want in the "vals" parameter.
Here's a meaningless usage example from the mtcars data for reproducibility:
require(rpivotTable)
data(mtcars)
rpivotTable(mtcars,rows="gear", cols=c("cyl","carb"),
aggregatorName = "Sum over Sum",
vals =c("mpg","disp"),
width="100%", height="400px")

R apply script output in different formats for similar inputs

I'm using a double apply function to get a list of p-values for cor.test between any two columns of two tables.
hel_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
The otud data.frame is 90X11 (90rows,11 colums or to say dim(otud) 90 11) and will be used with different data.frames.
bc and hel - are both 90X2 data.frame-s - so for both I get 2*11=22 p-values out of functions
bc_plist<-apply(bc, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
hel_plist<-apply(hel, 2, function(x) { apply(otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}}) })
For bc I will have an output with dim=NULL a list of elements of otunames$bcnames$ p-value (a format that I have always got from these scripts and are happy with)
But for hel I will get and output of dim(hel) 11 2 - an 11X2 table with p-values written inside.
Shortened examples of output.
hel_plist
+--------+--------------+--------------+
| | axis1 | axis2 |
+--------+--------------+--------------+
| Otu037 | 1.126362e-18 | 0.01158251 |
| Otu005 | 3.017458e-2 | NULL |
| Otu068 | 0.00476002 | NULL |
| Otu070 | 1.27646e-15 | 5.252419e-07 |
+--------+--------------+--------------+
bc_plist
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07
Why is it like that when the input formats are all the same? (Shortened examples)
bc
+-------+-----------+-----------+
| group | axis1 | axis2 |
+-------+-----------+-----------+
| 1B041 | 0.125219 | 0.246319 |
| 1B060 | -0.022412 | -0.030227 |
| 1B197 | -0.088005 | -0.305351 |
| 1B222 | -0.119624 | -0.144123 |
| 1B227 | -0.148946 | -0.061741 |
+-------+-----------+-----------+
hel
+-------+---------------+---------------+
| group | axis1 | axis2 |
+-------+---------------+---------------+
| 1B041 | -0.0667782322 | -0.1660606406 |
| 1B060 | 0.0214470932 | -0.0611351008 |
| 1B197 | 0.1761876858 | 0.0927570627 |
| 1B222 | 0.0681058251 | 0.0549292399 |
| 1B227 | 0.0516864361 | 0.0774155225 |
| 1B235 | 0.1205676221 | 0.0181712761 |
+-------+---------------+---------------+
How could I force my scripts to always produce "flat" outputs as in the case of bc
OK different output-s are caused because of the NULL results from conditional function in bc_plist case. If I'd to modify code to replace possible NULL-s with NA-s I'd get 2d tables in any case.
So to keep things constant :
bc_nmds_plist<-apply(bc_nmds, 2, function(x) { apply(stoma_otud, 2, function(y) { if (cor.test(x,y,method="spearman", exact=FALSE)$p.value<0.05){cor.test(x,y,method="spearman", exact=FALSE)$p.value}else NA}) })
And I get a 2d tabel out for bc_nmds_plist too.
So I guess this thing can be called solved - as I now have a piece of code that produces predictable output on any correct input.
If anyone has any idea how to force the output to conform to previos bc_plist format instead I would still be interested as I do actually prefer that form:
$axis1
$axis1$Otu037
[1] 1.247717e-06
$axis1$Otu005
[1] 1.990313e-05
$axis1$Otu068
[1] 5.664597e-07

Resources