Kusto summarize total count from different rows - azure-data-explorer

The count from the below data table for the same build, device, and Tier is split into different rows because the os versions are different. How do I summarize the total, excluding the platform os, please? For example , I need to summarize the total count as 1388+1739+2070 for build - "19.50.20",device - "Google",Tier - 3
datatable (build_ver: string, device: string, Tier: long, count: long, os: string)
[
"19.50.20","Google",long(3),long(1388),"10.1",
"19.50.20","Google",long(3),long(1739),"10.2",
"19.50.20","Google",long(3),long(2070),"10.3",
"19.50.20","Windows",long(2),long(1486),"11",
"19.50.20","Windows",long(2),long(1476),"11.2",
]

If I understood correctly, this could work:
datatable (build_ver: string, device : string, tier: long, count: long, os: string)
[
"19.50.20","Google",long(3),long(1388),"10.1",
"19.50.20","Google",long(3),long(1739),"10.2",
"19.50.20","Google",long(3),long(2070),"10.3",
"19.50.20","Windows",long(2),long(1486),"11",
"19.50.20","Windows",long(2),long(1476),"11.2",
]
| summarize sum(count) by build_ver, device, tier

Related

Preventing padded virtual columns in external table pathformat

My data is partitioned in ADL the following manner:-
year=2022/month=9/day=7/hour=5/city=chicago/ [...a bunch of data files...]
year=2022/month=9/day=7/hour=5/city=london/ [...a bunch of data file...]
So it's partitioned by year, month, day, hour & city. Hour is a 24 hour format with values ranging from 0 to 23. Also please note that directory names are not 0 padded. As you can see in the example the month=9 appears as a single digit and not 09 , the same goes for day and hour too. The thing to note is that this data is produced by another process so we can't change the way partition directories appear.
My goal is to create an external table to read this table from Kusto. To optimize any query on this table I would like to take an advantage of virtual column feature. In case of our data the actual year, month, day, hour, city values are not part of data itself but rather only appear in partition directory names (as is common in many big data scenarios). So considering this I created the external table as follows:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (date_id:datetime, city:string)
pathformat = (
"year=" datetime_pattern("yyyy",date_id)
"/month=" datetime_pattern("MM",date_id)
"/day=" datetime_pattern("dd",date_id)
"/hour=" datetime_pattern("HH",date_id)
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
As you can notice I defined two virtual columns i.e. date_id & city.
The external table creation was successful. But I think it's not correctly looking into the right partitions when I tried to query it:-
external_table('myexternaltable') | where date_id == datetime(2022-9-3-5) | where city == 'london' | take 1
This returned no rows , even though there is data in the corresponding locations. I am suspecting that the issue is that the pathformat I use uses padded digit format , i.e. it probably searches year=2022/month=09/day=03/hour=05 whereas the data exists in year=2022/month=9/day=3/hour=5. Is that the reason? If so what is the correct pathformat for this sort of requirement ?
After some playing around I found the following hack to be working:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (year:string,
month:string,
day:string,
hour:string,
city:string)
pathformat = (
"year=" year
"/month=" month
"/day=" day
"/hour=" hour
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
Now both the following methods to query the table are working quite fast , as you can see now it doesn't matter whether I use 0 in the query because behind the scenes Kusto is removing that while converting number to string since I defined year,month,day,hour as strings:-
external_table('myexternaltable') | where year==2022 | where month == 09 | where day==03 | where hour==05 | where city=='london' | take 1
external_table('myexternaltable') | where year==2022 | where month == 9 | where day==3 | where hour==5 | where city=='london' | take 1
It would still be good to find a single virtual column like date_id of type datetime , so one can perform datetime arithmetic and also it would look neat to have a single virtual column instead of four.
Updated: I believe the following should work. See example in the docs:
pathformat = (datetime_pattern("'year='yyyy'/month='M'/day='d",date_id))

Kusto - If else condition with Kusto

I am trying to convert the below Splunk query to Kusto.
| eval result=if(Match(Status,"Success|Passed"), "succeess","failed")
Below is the example from Kusto that is not clear . How do I modify this Kusto example as per the above Splunk Query pls. Thanks
| extend day = iff(floor(Timestamp, 1d)==floor(now(), 1d), "today", "anotherday")
You could try this:
...
| summarize success = countif(Status in ("Success", "Passed")), total = count()
| project success, failure = total - success
in case the values in the column named Status can have different casing, you can use in~()
in case the values in the column named Status are longer strings, which you want to look for substring in, you can use, for example: Status contains "Success" or Status contains "Passed"

Tag key & value using Teradata Regular Expression

I have a TERADATA dataset that resembles the below :
'Project: Hercules IssueType: Improvement Components: core AffectsVersions: 2.4.1 Priority: Minor Time: 15:25:23 04/06/2020'
I want to extract tag value from the above based on the key.
Ex:
with comm as
(
select 'Project: Hercules IssueType: Improvement Components: core AffectsVersions: 2.4.1 Priority: Minor' as text
)
select regexp_substr(comm.text,'[^: ]+',1,4)
from comm where regexp_substr(comm.text,'[^: ]+',1,3) = 'IssueType';
Is there a way to query without having to change the position arguments for every tag.
Also I am finding the last field a little tricky with date & time fields.
Any help is appreciated.
Thank you.
There's the NVP function to access Name/Value-pair data, but to split into multiple rows you need either strtok_split_to_table or regexp_split_to_table. The tricky part in your case are the delimiters, would be easier if they were unique instead of ' 'and ':':
WITH comm AS
(
SELECT 1 as keycol, -- should be a key column in your table, either numeric or varchar
'Project: Hercules IssueType: Improvement Components: core AffectsVersions: 2.4.1 Priority: Minor Time: 15:25:23 04/06/2020' AS text
)
SELECT id, tokennum, token,
-- get the key
StrTok(token,':', 1) AS "Key",
-- get the value (can't use StrTok because of ':' delimiter)
Substring(token From Position(': ' IN token)+2) AS "Value"
FROM TABLE
( RegExp_Split_To_Table(comm.keycol
,comm.text
,'( )(?=[^ ]+: )' -- assuming names don't contain spaces: split at the last space before ': '
, 'c')
RETURNS (id INT , tokennum INTEGER, token VARCHAR(1000) CHARACTER SET Latin)) AS dt

How would I return a count of all rows in the table and then the count of each time a specific status is found?

Please forgive my ignorance on sqlalchemy, up until this point I've been able to navigate the seas just fine. What I'm looking to do is this:
Return a count of how many items are in the table.
Return a count of many times different statuses appear in the table.
I'm currently using sqlalchemy, but even a pure sqlite solution would be beneficial in figuring out what I'm missing.
Here is how my table is configured:
class KbStatus(db.Model):
id = db.Column(db.Integer, primary_key=True)
status = db.Column(db.String, nullable=False)
It's a very basic table but I'm having a hard time getting back the data I'm looking for. I have this working with 2 separate queries, but I have to believe there is a way to do this all in one query.
Here are the separate queries I'm running:
total = len(cls.query.all())
status_count = cls.query.with_entities(KbStatus.status, func.count(KbStatus.id).label("total")).group_by(KbStatus.status).all()
From here I'm converting it to a dict and combining it to make the output look like so:
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
Any help is greatly appreciated.
I don't know about sqlalchemy, but it's possible to generate the results you want in a single query with pure sqlite using the JSON1 extension:
Given the following table and data:
CREATE TABLE data(id INTEGER PRIMARY KEY, status TEXT);
INSERT INTO data(status) VALUES ('Assigned'),('In Progress'),('Peer Review'),('Ready to Publish')
,('Unassigned'),('Unassigned'),('Unassigned'),('Unassigned');
CREATE INDEX data_idx_status ON data(status);
this query
WITH individuals AS (SELECT status, count(status) AS total FROM data GROUP BY status)
SELECT json_object('data'
, json_object('status_count'
, json_group_object(status, total)
, 'total_requests'
, (SELECT sum(total) FROM individuals)))
FROM individuals;
will return one row holding (After running through a JSON pretty printer; the actual string is more compact):
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
If the sqlite instance you're using wasn't built with support for JSON1:
SELECT status, count(status) AS total FROM data GROUP BY status;
will give
status total
-------------------- ----------
Assigned 1
In Progress 1
Peer Review 1
Ready to Publish 1
Unassigned 4
which you can iterate through in python, inserting each row into your dict and adding up all total values in another variable as you go to get the total_requests value at the end. No need for another query just to calculate that number; do it manually. I bet it's really easy to do the same thing with your existing second sqlachemy query.

Why is this query answered instantly by the sqlite3 application, but slowly using sqlitejdbc?

I'm using Mac OS X v10.4.11 with the standard Java (1.5.0_19) and sqlite3 (3.1.3) that came it. (Yeah, a little old... but see comment below.)
I have a sqlite3 database with a table with a few hundred thousand rows, with "name" and "stored" text columns. Name is one of six (so far) short strings; stored is a 19-character standard date-time string. Each column has a stand-alone index. There are only six unique name values. The following query:
select distinct name from myTable where stored >= date("now");
lists the relevant names instantly when I perform it through the Mac OS X "sqlite3" application. But it takes over 2 seconds to find each name (total of about 15 seconds) when I do the same thing in the usual way in my application:
String q = "SELECT DISTINCT name FROM myTable " +
"WHERE stored >= DATE('now');" ;
try {
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(q);
while (rs.next()) {
final String s = rs.getString("symbol");
System.err.println("Got " + s);
}
rs.close();
}
I've tried this with both sqlitejdbc-v054 and sqlitejdbc-v055 . No perceptible difference.
Is this a known deficiency? If not, anyone have any suggestions how to attack it ?

Resources