Using both 'distinct' and 'project' - azure-data-explorer

In Azure Data Explorer, I am trying to use both the 'project' and 'distinct' keywords.
The table records have 3 fields I want to use the 'project' on:
CowName
CowType
CowNum
CowLabel
But there are many other fields in the table such as Date, Measurement, etc, that I do not want to return.
Cows
| project CowName, CowType, CowNum, CowLabel
However, I want to avoid duplicate records of CowName and CowNum, so I included
Cows
| project CowName, CowType, CowNum, CowLabel
| distinct CowName, CowNum
But when I do this, the only columns that are returned are CowName and CowNum. I am now missing CowType and CowLabel entirely.
Is there a way to use both 'project' and 'distinct' without them interfering with each other?
Is there a different approach I should take?

You can do:
Cows
| distinct CowName, CowType, CowNum
or, if you don't want to have distinct values of CowType - and just have any value of it:
Cows
| summarize any(CowType) by CowName, CowNum
References:
Summarize operator: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/summarizeoperator
Distinct operator:https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/distinctoperator
any() aggregation function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/any-aggfunction

You can use this
| summarize any(CowType, CowLabel) by CowName, CowNum
To visualize how this will work take the following sample table/query:
let CowTable = datatable(CowNum:int, CowName:string, CowType:string, CowLabel:string, DontWantThis:int)
[
1, "Bob", "Bull", "label1", 99,
2, "Tipsy", "Heifer", "label1", 98,
3, "Milly", "Heifer", "label2", 99,
4, "Bob", "Bull", "label2", 87,
4, "Bob", "Bull", "label2", 77,
2, "Hanna", "Heifer", "label1", 98,
];
CowTable
| summarize any(CowType, CowLabel) by CowName, CowNum
Results:
Note that we do not see CowNum 4 listed twice, however we do see CowNum 2 listed twice; this is because those rows are unique in regard to the CowName & CowNum. We also see Bob listed twice (not 3 times); this is because 2 of the Bob entries are unique in regard to CowName/CowNum, but 2 of the Bob entries are not unique in regard to CowName/CowNum.
If you truly only want results where the CowName is unique and the CowNum is also distinct you can do this in a 2-step summarize:
CowTable
| summarize any(CowName, CowType, CowLabel) by CowNum
| summarize any(CowNum, any_CowType, any_CowLabel) by any_CowName
//normalize column names
| project CowNum = any_CowNum, CowName = any_CowName, CowType = any_any_CowType, CowLabel = any_any_CowLabel
Results:

Related

How to summarize a dynamic object column?

Say I have an exceptions table which I know contains some data like the below, where details is a dynamic object
operation_id
details
1
{"cause": "sometext"}
1
{"other_info": 240}
1
{"message": "blabal" }
2
{"cause": "some other text"}
2
{"other_info": 88}
2
{"message": "blabal2" }
How can I query these results to be grouped by operation_id, but somehow aggregate everying in the details column, perhaps something like
operation_id
details_1
details_2
details_3
1
{"cause": "sometext"}
{"other_info": 240}
{"message": "blabal" }
2
{"cause": "some other text"}
{"other_info": 88}
{"message": "blabal2" }
or even just join all details into a single column
I tried doing it with summarize, but it just shows each entry on a separate line (since each details is unique):
exceptions
| where timestamp > now() - 10m
| summarize by operation_Id, dynamic_to_json(['details'])
Does anyone have any advice about this?
you can use the make_bag() aggregation function.
for example:
datatable(operation_id:int, details:dynamic)
[
1, dynamic({"cause": "sometext"}),
1, dynamic({"other_info": 240}),
1, dynamic({"message": "blabal" }),
2, dynamic({"cause": "some other text"}),
2, dynamic({"other_info": 88}),
2, dynamic({"message": "blabal2" }),
]
| summarize details = make_bag(details) by operation_id
operation_id
details
1
{ "cause": "sometext", "other_info": 240, "message": "blabal"}
2
{ "cause": "some other text", "other_info": 88, "message": "blabal2"}
I also got it working like this (using make_set())
exceptions
| project
operation_Id,
details
| summarize Details=make_set(details) by operation_Id
Although it returns details as an array of objects rather than a merged object

How to convert to dynamic type/ apply multiple functions on same 'pack' in KQL/Kusto

I am absolutely in love with ADX time series capabilities; having worked tons on sensor data with Python. Below are the requirements for my case:
Handle Sensor data tags at different frequencies -- bring them to all to 1 sec frequency (if in milliseconds, aggregate over a 1sec interval)
Convert stacked data to unstacked data.
Join with another dataset which has multiple "string-labels" by timestamp, after unstack.
Do linear interpolation on some columns, and forward fill in others (around 10-12 in all).
I think with below query I have gotten the first three done; but unable to use series_fill_linear directly on column. The docs say this function requires a dynamic type as input. The error message is helpful:
series_fill_linear(): argument #1 was not of an expected data type: dynamic
Is it possible to apply series_fill_linear where I'm already using pack instead of using pack again. How can I apply this function selectively by Tag; and make my overall query more readable? It's important to note that only sensor_data table requires both series_fill_linear and series_fill_forward; label_data only requires series_fill_forward.
List item
sensor_data
| where timestamp > datetime(2020-11-24 00:59:59) and timestamp <datetime(2020-11-24 12:00:00)
| where device_number =='PRESSURE_599'
| where tag_name in ("tag1", "tag2", "tag3", "tag4")
| make-series agg_value = avg(value) default = double(null) on timestamp in range (datetime(2020-11-24 00:59:59), datetime(2020-11-24 12:00:00), 1s) by tag_name
| extend series_fill_linear(agg_value, double(null), false) //EDIT
| mv-expand timestamp to typeof(datetime), agg_value to typeof(double)
| summarize b = make_bag(pack(tag_name, agg_value)) by timestamp
| evaluate bag_unpack(b)
|join kind = leftouter (label_data
| where timestamp > datetime(2020-11-24 00:58:59) and timestamp <datetime(2020-11-24 12:00:01)
| where device_number =='PRESSURE_599'
| where tag != "PRESSURE_599_label_Raw"
| summarize x = make_bag(pack(tag, value)) by timestamp
| evaluate bag_unpack(x)) on timestamp
| project timestamp,
MY_LINEAR_COL_1 = series_fill_linear(tag1, double(null), false),
MY_LINEAR_COL_2 = series_fill_forward(tag2),
MY_LABEL_1 = series_fill_forward(PRESSURE_599_label_level1),
MY_LABEL_2 = series_fill_forward(PRESSURE_599_label_level2)
EDIT: I ended up using extend with case to handle different cases of interpolation.
// let forward_tags = dynamic({"tags": ["tag2","tag4"]}); unable to use this in query as "forward_tags.tags"
sensor_data
| where timestamp > datetime(2020-11-24 00:59:59) and timestamp <datetime(2020-11-24 12:00:00)
| where device_number = "PRESSURE_599"
| where tag_name in ("tag1", "tag2", "tag3", "tag4") // use a variable here instead?
| make-series agg_value = avg(value)
default = double(null)
on timestamp
in range (datetime(2020-11-24 00:59:59), datetime(2020-11-24 12:00:00), 1s)
by tag_name
| extend agg_value = case (tag_name in ("tag2", "tag3"), // use a variable here instead?
series_fill_forward(agg_value, double(null)),
series_fill_linear(agg_value, double(null), false)
)
| mv-expand timestamp to typeof(datetime), agg_value to typeof(double)
| summarize b = make_bag(pack(tag_name, agg_value)) by timestamp
| evaluate bag_unpack(b)
| join kind = leftouter (
label_data // don't want to use make-series here, will be unecessary data generation since already in 'ss' format.
| where timestamp > datetime(2020-11-24 00:58:59) and timestamp <datetime(2020-11-24 12:00:01)
| where tag != "PRESSURE_599_label_Raw"
| summarize x = make_bag(pack(tag, value)) by timestamp
| evaluate bag_unpack(x)
)
on timestamp
I was wondering if it is possible in KQL to pass a list of strings inside a query/fxn to use as shown below. I have commented where I think a list of strings could be passed to make the code more readable.
Now, I just need to fill_forward the label columns (MY_LABEL_1, MY_LABEL_2); which are a result of the below query. I would prefer the code is added on to the main query, and the final result is a table with all columns; Here is a sample table based on my case's result.
datatable (timestamp:datetime, tag1:double, tag2:double, tag3:double, tag4:double, MY_LABEL_1: string, MY_LABEL_2: string)
[
datetime(2020-11-24T00:01:00Z), 1, 3, 6, 9, "x", "foo",
datetime(2020-11-24T00:01:01Z), 1, 3, 6, 9, "", "",
datetime(2020-11-24T00:01:02Z), 1, 3, 6, 9,"", "",
datetime(2020-11-24T00:01:03Z), 1, 3, 6, 9,"y", "bar",
datetime(2020-11-24T00:01:04Z), 1, 3, 6, 9,"", "",
datetime(2020-11-24T00:01:05Z), 1, 3, 6, 9,"", "",
]
Series functions in ADX only work on dynamic arrays. You can apply a selective fill function using case() function, by replacing this line:
| extend series_fill_linear(agg_value, double(null), false) //EDIT
With something like the following:
| extend agg_value = case(
tag_name == "tag1", series_fill_linear(agg_value, double(null), false),
tag_name == "tag2", series_fill_forward(agg_value),
series_fill_forward(agg_value)
)
Edit:
Here is an example of string column fill-forward workaround:
let T = datatable ( Timestamp: datetime, Employee: string )
[ datetime(2020-01-01), "Bob",
datetime(2021-01-02), "",
datetime(2021-01-03), "Alice",
datetime(2021-01-04), "",
datetime(2021-01-05), "",
datetime(2021-01-06), "Alan",
datetime(2021-01-07), "",
datetime(2021-01-08), "" ]
| sort by Timestamp asc;
let employeeLookup = toscalar(T | where isnotempty(Employee) | summarize make_list(Employee));
T
| extend idx = row_cumsum(tolong(isnotempty(Employee)))
| extend EmployeeFilled = employeeLookup[idx - 1]
| project-away idx
Timestamp
Employee
EmployeeFilled
2021-01-01 00:00:00.0000000
Bob
Bob
2021-01-02 00:00:00.0000000
Bob
2021-01-03 00:00:00.0000000
Alice
Alice
2021-01-04 00:00:00.0000000
Alice
2021-01-05 00:00:00.0000000
Alice
2021-01-06 00:00:00.0000000
Alan
Alan
2021-01-07 00:00:00.0000000
Alan
2021-01-08 00:00:00.0000000
Alan
Regarding your requirement to convert the time series in many frequencies to a common one, have a look at series_downsample_fl() function library

KQL, time difference between separate rows in same table

I have Sessions table
Sessions
|Timespan|Name |No|
|12:00:00|Start|1 |
|12:01:00|End |2 |
|12:02:00|Start|3 |
|12:04:00|Start|4 |
|12:04:30|Error|5 |
I need to extract from it duration of each session using KQL (but if you could give me suggestion how I can do it with some other query language it would be also very helpful). But if next row after start is also start, it means session was abandoned and we should ignore it.
Expected result:
|Duration|SessionNo|
|00:01:00| 1 |
|00:00:30| 4 |
You can try something like this:
Sessions
| order by No asc
| extend nextName = next(Name), nextTimestamp = next(timestamp)
| where Name == "Start" and nextName != "Start"
| project Duration = nextTimestamp - timestamp, No
When using the operator order by, you are getting a Serialized row set, which then you can use operators such as next and prev. Basically you are seeking rows with No == "Start" and next(Name) == "End", so this is what I did,
You can find this query running at Kusto Samples open database.
let Sessions = datatable(Timestamp: datetime, Name: string, No: long) [
datetime(12:00:00),"Start",1,
datetime(12:01:00),"End",2,
datetime(12:02:00),"Start",3,
datetime(12:04:00),"Start",4,
datetime(12:04:30),"Error",5
];
Sessions
| order by No asc
| extend Duration = iff(Name != "Start" and prev(Name) == "Start", Timestamp - prev(Timestamp), timespan(null)), SessionNo = prev(No)
| where isnotnull(Duration)
| project Duration, SessionNo

How do you combine Lapply() and dbListFields() to get all column names for every table in a DATABASE?

I would like to create a short Catalog for myself out of a database which would show what tables and fields are available there with the combination of SAPPLY(), LAPPLY() etc. and DBListNames.
So far I only get this far but it returns a "0" character variable:
catalog <- lapply(list_of_tables, function(t) dbListFields(con, name = paste0(t)))
So I would like to create an output like this:
+------------+--------------------+
| DB TABLES | FIELDS |
+------------+--------------------+
| ORDERS | "PRODUCT", "TIME" |
| CLIENTS | "ID", "NAME" |
| PROMOTIONS | "DATE", "DISCOUNT" |
+------------+--------------------+
I haven't used these kind of loops and I would like to start it here..
Thank you for your support in advance!

Conditional mutating of the R data frame based on the strings

I am using R and trying to create a new column based on the string information from the existing columns.
My data is like:
risk_code | area
-----------------------------------
DEEP DIGGING ALL | --
CONSTRUCTION PRO | Construction
CLAIMS ONSHORE | --
OFFSHORE CLAIMS | --
And the result I need is:
risk_code | area | area_new
-------------------------------------------------
DEEP DIGGING ALL | -- | Digging
CONSTRUCTION PRO | Construction | Construction
CLAIMS ONSHORE | -- | Onshore
OFFSHORE CLAIMS | -- | Offshore
I understanding that I make several mistakes in the code, but after the whole week of staring on it and internet searching, I cannot get the result I need.
I appreciate your help.
Thanks in advance.
Occupancy <- read_excel("Occupancy.xlsx")
OccupancyMutated <- mutate(Occupancy, area_new = area)
OccupancyMutated <- as.data.frame(OccupancyMutated)
OccupancyMutated$area_new[Occupancy$area == "--"] <-
{
if (OccupancyMutated$risk_code == %Digging%) {"Digging"}
else if (OccupancyMutated$risk_code == %ONSHORE%) {"Onshore"}
else if (OccupancyMutated$risk_code == %OFFSHORE%) {"Offshore"}
else {"empty"}
}
View(OccupancyMutated)
We can use stringr for this operation. The function word will extract the first word of each string in risk_code and the function str_to_title will convert to your required format. Both functions are vectorized so simply,
library(stringr)
str_to_title(word(df$risk_code, 1, 1))
#[1] "Digging" "Construction" "Onshore" "Offshore"
If it is not always the first word and you need to do it for specific words only, you can do,
str_to_title(str_extract(tolower(df$risk_code), 'digging|offshore|onshore'))
#[1] "Digging" NA "Onshore" "Offshore"
So, this is the answer (thanks to Sotos):
Occupancy <- read_excel("Occupancy.xlsx")
OccupancyMutated <- mutate(Occupancy, area_new = area)
OccupancyMutated <- as.data.frame(OccupancyMutated)
OccupancyMutated$area_new[Occupancy$area == "--"] <-
str_to_title(str_extract(tolower(Occupancy$risk_code), 'Extraction|Offshore|Onshore'))
View(OccupancyMutated)

Resources