Kusto - Materialized View on latest version of rows - azure-data-explorer

I have a Table called 'metadata', that contains a list of Parameter and ParamterValue that is partitioned by a TestId. Everytime a Test is changed, the Test will be reingested to Azure Data Explorer with a never Version.
My overall goal is to:
Define a Function (GetTestsFromSearch) that takes a Parameter (dynamic) of key value pairs, that lets me Query all Tests (Of latest version) for a match of the Key/Value pairs:
( {{"Search param1", "Search value1"},{"Search param2", "Search value2"}}
Example
GetTestsFromSearch({{"ProjectId", "SturnProject"},{"Product Name", "Nacelle "}})
Should return
TestId
Version
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
3
ea5b688c-b61f-4c5b-bb87-af2eac94d454
1
from the example metadata table below
Another Goal is to Create a Materialized View that contains only latest metadata for each Test (Explained below table)
Example of metadata table
TestId
TestName
Parameter
ParameterValue
Version
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
ProjectId
SturnProject
1
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Product Category
2MW
1
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Project Start Date
2022-02-03
1
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
ProjectId
SturnProject
2
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Product Category
2MW
2
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Project Start Date
2022-02-03
2
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
ProjectId
SturnProject
3
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Product Category
2MW
3
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
MyTest
Project Start Date
2022-02-03
3
ea5b688c-b61f-4c5b-bb87-af2eac94d454
MyTest
ProjectId
SturnProject
1
ea5b688c-b61f-4c5b-bb87-af2eac94d454
MyTest
Project State
Open
1
ea5b688c-b61f-4c5b-bb87-af2eac94d454
MyTest
Product Name
Nacelle
1
Over time there will be thousands of Tests in several different Versions, and hence I anticipate, that it would be a good idea to create a Materialized View, that only maintains the Latest Versions of each Test - I have tried to create the view as:
metadata
| summarize arg_max(Version,*) by TestId
But this only gives me one Parameter and Parameter Value for each TestId/Version, not the entire result set of the Test.
Can anyone point me in the right direction for this materialized view?
I have included an example of a metadata table as DataTable, which can be used in kusto directly.
Metadata Table as DataTable
datatable (TestId: string, Name: string, Parameter: string, ParameterValue: string, Version: int) [
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Report, DMS number","1234-231",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project name","Thor3",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","GTRS reference","gtrs",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Product Category","2MW",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project number","TE-12321",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DUT responsible person","ANFRB3",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test execution person","ANFRB3",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project Manager","ANFRB3",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVPR, DMS number","1234-1234",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVRE, DMS number","1231-1213",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Start Date","2022-02-23T00:00:00.0000000Z",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Category","Verification safety",int(3),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","GTRS reference","gtrs",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project number","TE-12321",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DUT responsible person","ANFRB2",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test execution person","ANFRB2",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project Manager","ANFRB2",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVPR, DMS number","1234-1234",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVRE, DMS number","1231-1213",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Start Date","2022-02-23T00:00:00.0000000Z",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Category","Verification safety",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Product Category","2MW",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project name","Thor3",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Report, DMS number","1234-231",int(2),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Category","Verification safety",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project Manager","ANFRB",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","GTRS reference","gtrs",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Product Category","2MW",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project name","Thor3",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Project number","TE-12321",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DUT responsible person","ANFRB",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test execution person","ANFRB",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVPR, DMS number","1234-1234",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Report, DMS number","1234-231",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","DVRE, DMS number","1231-1213",int(1),
"fc76aa10-5cf8-447e-95f6-3bd801ef2ed0","ANFRB-FILEVIEW-TEST","Test Start Date","2022-02-23T00:00:00.0000000Z",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","GTRS reference","gtrs232",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Product Category","4MW",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Project name","Myproject",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Project number","43324534",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","DUT responsible person","ANFRB",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test execution person","ANFRB",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Project Manager","ANFRB",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","DVPR, DMS number","435123454",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test Report, DMS number","123123123",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","DVRE, DMS number","12312312312",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test Start Date","2022-03-01T00:00:00.0000000Z",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test Category","Verification functionality",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test facility","CHE",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Test rig","rig23",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Sample ID","1",int(1),
"ea5b688c-b61f-4c5b-bb87-af2eac94d454","ANFRB2-TEST","Link to test data","asdfsafdsdfa",int(1)
]
Thanks

If most tests include same properties (as in your example), you can consider changing the schema to a wide schema, in which each run (Version) of a TestId and Name is a single record. The result schema would look like the output of the following:
**datatable**
| extend pack(Parameter, ParameterValue)
| summarize make_bag(Column1) by TestId, Name, Version
| evaluate bag_unpack(bag_Column1)
Then, you can set up a materialized view with the following aggregation that will provide what you're looking for IIUC:
T | summarize arg_max(Version, *) by TestId, Name
To switch from the schema in your example to the suggested one, you can either change your ingestion pipeline to ingest in new schema format, or use an update policy for the transformation. If you choose the latter, avoid using bag_unpack plugin in the update policy function. Instead, project the columns you need explicitly, to avoid a non-deterministic schema.
Another alternative, is keeping all properties in single dynamic column, as in the result of:
**datatable**
| extend pack(Parameter, ParameterValue)
| summarize make_bag(Column1) by TestId, Name, Version
And using same materialized view definition as above.
For the 1st question - using the 2nd suggested schema, you can try something like the following:
let GetTestsFromSearch = (Filter:dynamic)
{
T
| extend pack(Parameter, ParameterValue)
| summarize Properties = make_bag(Column1) by TestId, Name, Version
| summarize arg_max(Version, *) by TestId, Name
| extend Filter
| mv-apply Filter on
(
extend key=tostring(bag_keys(Filter)[0])
| extend expected = tostring(Filter[key]), actual = tostring(Properties[key])
| summarize count(), countif(actual == expected)
| where count_ == countif_
)
};
GetTestsFromSearch(dynamic({"Test Category" : "Verification safety", "Project name" : "Thor3"}));
TestId
Name
Version
Properties
count_
countif_
fc76aa10-5cf8-447e-95f6-3bd801ef2ed0
ANFRB-FILEVIEW-TEST
3
{ "Test Report, DMS number": "1234-231", "Project name": "Thor3", "GTRS reference": "gtrs", "Product Category": "2MW", "Project number": "TE-12321", "DUT responsible person": "ANFRB3", "Test execution person": "ANFRB3", "Project Manager": "ANFRB3", "DVPR, DMS number": "1234-1234", "DVRE, DMS number": "1231-1213", "Test Start Date": "2022-02-23T00:00:00.0000000Z", "Test Category": "Verification safety"}
2
2

Related

Data ingestion issue with KQL update policy ; Query schema does not match table schema

I'm writing a function which takes in raw data table (contains multijson telemetry data) and reformat it to a multiple cols. I use .set MyTable <| myfunction|limit 0 to create my target table based off of the function and use update policy to alert my target table.
Here is the code :
.set-or-append MyTargetTable <|
myfunction
| limit 0
.alter table MyTargetTable policy update
#'[{ "IsEnabled": true, "Source": "raw", "Query": "myfunction()", "IsTransactional": false, "PropagateIngestionProperties": false}]'
But I'm getting ingestion failures: Here is the ingestion failure message :
Failed to invoke update policy. Target Table = 'MyTargetTable', Query = '
let raw = __table("raw", 'All', 'AllButRowStore')
| where extent_id() in (guid(659e3b3c-6859-426d-9c37-003623834455));
myfunction()': Query schema does not match table schema
I double check the query schema and target table; they are the same . I'm not sure what this error means.
Also, I ran count on both the raw and mytarget tables; there are relatively large discrepancies (400 rows for My target and 2000 rows in raw table).
Any advise will be appreciated.
Generally speaking - to find the root of the mismatch between schemas, you can run something along the following lines, and filter for differences:
myfunction
| getschema
| join kind=leftouter (
table('MyTargetTable')
| getschema
) on ColumnOrdinal, ColumnType
In addition - you should make sure the output schema of the function you use in your update policy is 'stable', i.e. isn't affected by the input data
The output schema of some query plugins such as pivot() and bag_unpack() depends on the input data, and therefore it isn't recommended to use those in update policies.

Improve Kusto Query - mailbox audit log search

I am trying to identify shared mailboxes that aren't in use. Checked "Search-MailboxAuditLog" already and some mailboxes do not return any results even tho auditing enabled, but can see activity in Azure sentinel.
Is there a way to improve below Kusto code? (During testing tried mailboxes with activities but sometimes do not get any results from the query)
With Kusto, Is there a way to loop through "mbs" like powershell "foreach ( $item in $mbs)"?
Thanks,
let mbs = datatable (name: string)
[
"xxx1#something.com",
"xxx2#something.com",
"xxx3#something.com",
];
OfficeActivity
| where OfficeWorkload == "Exchange" and TimeGenerated > ago(30d)
| where MailboxOwnerUPN in~ (mbs)
| distinct MailboxOwnerUPN
Update : Need help with the query
Input would be list of shared mailbox UPNs
Output would be list of shared mailboxes with any activity, example MBs with any action in “Operation" filed
"in" doesn't work on datatables (tabular inputs) like that; it is not a "filter", it is an "operator". The "where" is effectively the "foreach" you are referring to.
Given the sample input, the query could probably be written as:
OfficeActivity //tabular input with many records
| TimeGenerated > ago(30d) //Filter records to window of interest first
| where OfficeWorkload == "Exchange" //foreach row
| where MailboxOwnerUPN in~ ( //foreach row
"xxx1#something.com","xxx2#something.com","xxx3#something.com"
)
| distinct MailboxOwnerUPN
You can see it in the docs at https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/inoperator#arguments where "col" is the "column to filter"

how to create a new table having json record from a kusto table

We receive a multiline json (in format below), and we store them into a Kusto table "OldT" after using multiline json mapping.
{"severity":"0","hostname":"Test.login","sender":"Test.login","body":"2a09dfa1","facility":"1","version":"1","timestamp":"2020-04-23T07:07:06.077963Z"}
{"severity":"0","hostname":"Test.login","sender":"Test.login","body":"2a09dfa1","facility":"1","version":"1","timestamp":"2020-04-23T07:07:00.893151Z"}
Records in table "OldT":
sender timestamp severity version body priority facility hostname
Test.login 2020-04-23T07:07:06.077963 0 2a09dfa1 1 Test.login
Test.login 2020-04-23T07:07:00.893151Z 0 2a09dfa1 1 Test.login
Now I need to move the data into another table, say "NewT" with only one column, say "Rawrecord"
Rawrecord:
{"severity":"0","hostname":"Test.login","sender":"Test.login","body":"2a09dfa1","facility":"1","version":"1","timestamp":"2020-04-23T07:07:06.077963Z"}
{"severity":"0","hostname":"Test.login","sender":"Test.login","body":"2a09dfa1","facility":"1","version":"1","timestamp":"2020-04-23T07:07:00.893151Z"}
How can I move this data to NewT?
You can use the pack_all() function. For example:
OldT | project Rawrecord = pack_all()
To move it to another table you can use the .set-or-append command for example:
.set-or-append NewT <| OldT | project Rawrecord = pack_all()

How to set new item's attribute as (from the max amongst all items incremented by n)?

I want the following table structure, to store an auto increment a row's pid* attribute
| id | timstamp | pid*
| 00000000-1111-2222-93bb-0371fcb45674 | 0 | 1
| 00000000-1111-2222-93bb-ee742a825e88 | 1 | 2
| 00000000-1111-2222-93bb-bfac8753c0ba | 2 | 3
PutItem -> autoId() | timestamp() | max(pid) + 1 = 4 ??
For PutItem operation, Is something like the following 1) possible, and 2) acceptable in DynamoDB land?
"pid" : $util.dynamodb.toDynamoDBJson(records.findMax(pid) + 1) # just pseudo code
3) How might one implement the above using DyanmoDB resolver mapper template?
Use case:
I'm trying to use AWS DynamoDB to back GraphQL, managed by AWS AppSync, the following is the request mapping template for Mutation.createFoo
{
"version" : "2018-05-29",
"operation" : "PutItem",
"key" : {
"id": $util.dynamodb.toDynamoDBJson($util.autoId()),
},
"attributeValues" : {
"timestamp" : $util.dynamodb.toDynamoDBJson($util.time.nowEpochMilliSeconds()),
"pid" : $util.dynamodb.toDynamoDBJson($ctx.args.input.pid), # currently client provided but it is not acceptable
...
}
}
The primary key id is an UUID auto-generated by DynamoDB which is fine. But our use-case requires a incrementing pid for each new Foo in our FooTable. The business model requires at least for show a unique pid, while under the hood, queries like GetItem the UUID and timestamp will be used instead and business as usual.
I'm also weary to call for a change in business model because of an implementation detail issue.
References:
Resolver Mapping Template Reference for DynamoDB

How would I return a count of all rows in the table and then the count of each time a specific status is found?

Please forgive my ignorance on sqlalchemy, up until this point I've been able to navigate the seas just fine. What I'm looking to do is this:
Return a count of how many items are in the table.
Return a count of many times different statuses appear in the table.
I'm currently using sqlalchemy, but even a pure sqlite solution would be beneficial in figuring out what I'm missing.
Here is how my table is configured:
class KbStatus(db.Model):
id = db.Column(db.Integer, primary_key=True)
status = db.Column(db.String, nullable=False)
It's a very basic table but I'm having a hard time getting back the data I'm looking for. I have this working with 2 separate queries, but I have to believe there is a way to do this all in one query.
Here are the separate queries I'm running:
total = len(cls.query.all())
status_count = cls.query.with_entities(KbStatus.status, func.count(KbStatus.id).label("total")).group_by(KbStatus.status).all()
From here I'm converting it to a dict and combining it to make the output look like so:
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
Any help is greatly appreciated.
I don't know about sqlalchemy, but it's possible to generate the results you want in a single query with pure sqlite using the JSON1 extension:
Given the following table and data:
CREATE TABLE data(id INTEGER PRIMARY KEY, status TEXT);
INSERT INTO data(status) VALUES ('Assigned'),('In Progress'),('Peer Review'),('Ready to Publish')
,('Unassigned'),('Unassigned'),('Unassigned'),('Unassigned');
CREATE INDEX data_idx_status ON data(status);
this query
WITH individuals AS (SELECT status, count(status) AS total FROM data GROUP BY status)
SELECT json_object('data'
, json_object('status_count'
, json_group_object(status, total)
, 'total_requests'
, (SELECT sum(total) FROM individuals)))
FROM individuals;
will return one row holding (After running through a JSON pretty printer; the actual string is more compact):
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
If the sqlite instance you're using wasn't built with support for JSON1:
SELECT status, count(status) AS total FROM data GROUP BY status;
will give
status total
-------------------- ----------
Assigned 1
In Progress 1
Peer Review 1
Ready to Publish 1
Unassigned 4
which you can iterate through in python, inserting each row into your dict and adding up all total values in another variable as you go to get the total_requests value at the end. No need for another query just to calculate that number; do it manually. I bet it's really easy to do the same thing with your existing second sqlachemy query.

Resources