Preventing padded virtual columns in external table pathformat - azure-data-explorer

My data is partitioned in ADL the following manner:-
year=2022/month=9/day=7/hour=5/city=chicago/ [...a bunch of data files...]
year=2022/month=9/day=7/hour=5/city=london/ [...a bunch of data file...]
So it's partitioned by year, month, day, hour & city. Hour is a 24 hour format with values ranging from 0 to 23. Also please note that directory names are not 0 padded. As you can see in the example the month=9 appears as a single digit and not 09 , the same goes for day and hour too. The thing to note is that this data is produced by another process so we can't change the way partition directories appear.
My goal is to create an external table to read this table from Kusto. To optimize any query on this table I would like to take an advantage of virtual column feature. In case of our data the actual year, month, day, hour, city values are not part of data itself but rather only appear in partition directory names (as is common in many big data scenarios). So considering this I created the external table as follows:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (date_id:datetime, city:string)
pathformat = (
"year=" datetime_pattern("yyyy",date_id)
"/month=" datetime_pattern("MM",date_id)
"/day=" datetime_pattern("dd",date_id)
"/hour=" datetime_pattern("HH",date_id)
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
As you can notice I defined two virtual columns i.e. date_id & city.
The external table creation was successful. But I think it's not correctly looking into the right partitions when I tried to query it:-
external_table('myexternaltable') | where date_id == datetime(2022-9-3-5) | where city == 'london' | take 1
This returned no rows , even though there is data in the corresponding locations. I am suspecting that the issue is that the pathformat I use uses padded digit format , i.e. it probably searches year=2022/month=09/day=03/hour=05 whereas the data exists in year=2022/month=9/day=3/hour=5. Is that the reason? If so what is the correct pathformat for this sort of requirement ?

After some playing around I found the following hack to be working:-
.create-or-alter external table myexternaltable
(
...data column fields...
...data column fields...
...data column fields...
... etc. ...
)
kind=adl
partition by (year:string,
month:string,
day:string,
hour:string,
city:string)
pathformat = (
"year=" year
"/month=" month
"/day=" day
"/hour=" hour
"/city=" city
)
dataformat=parquet
(
...ADL endpoint...
)
Now both the following methods to query the table are working quite fast , as you can see now it doesn't matter whether I use 0 in the query because behind the scenes Kusto is removing that while converting number to string since I defined year,month,day,hour as strings:-
external_table('myexternaltable') | where year==2022 | where month == 09 | where day==03 | where hour==05 | where city=='london' | take 1
external_table('myexternaltable') | where year==2022 | where month == 9 | where day==3 | where hour==5 | where city=='london' | take 1
It would still be good to find a single virtual column like date_id of type datetime , so one can perform datetime arithmetic and also it would look neat to have a single virtual column instead of four.

Updated: I believe the following should work. See example in the docs:
pathformat = (datetime_pattern("'year='yyyy'/month='M'/day='d",date_id))

Related

Kusto sub query selection using toscalar - returns only last matching record

I am referring sqlcheatsheet - Nested queries
Query 1:
traces
| where customDimensions.Domain == "someDomain"
| where message contains "some-text"
| project itemId=substring(itemId,indexof(itemId,"-"),strlen(itemId))
Result :
itemId
-c580-11e9-888a-8776d3f65945
-c580-11e9-888a-8776d3f65945
-c580-11e9-9b01-c3be0f4a2bf2
Query 2:
traces
| where customDimensions.Domain == "someDomain"
| where itemId has toscalar(
traces
| where customDimensions.Domain == "someDomain"
| where message contains "some-text"
| project itemId=substring(itemId,indexof(itemId,"-"),strlen(itemId)))
Result for the second query returns records matching only last record of sub query
ie:) > -c580-11e9-9b01-c3be0f4a2bf2
Question :
How get entire result set that has matching with all the three items.
My requirement is to take entire sequence of logs for a particular request.
To get that I have below inputs, I could able to take one log, from that I can find ItemId
The itemId looks like "b5066283-c7ea-11e9-9e9b-2ff40863cba4". Rest of all logs related to this request must have "-c7ea-11e9-9e9b-2ff40863cba4" this value. Only first part will get incremented like b5066284 , b5066285, b5066286 like that.
toscalar(), as its name implies, returns a scalar value.
Given a tabular argument with N columns and M rows it'll return the value in the 1st column and the 1st row.
For example: the following will return a single value - 1
let T = datatable(a:int, b:int, c:int)
[
1,2,3,
4,5,6,
7,8,9,
]
;
print toscalar(T)
If I understand the intention in your 2nd query correctly, you should be able to achieve your requirement by using has_any.
For example:
let T = datatable(item_id:string)
[
"c580-11e9-888a-8776d3f65945",
"c580-11e9-888a-8776d3f65945",
"c580-11e9-9b01-c3be0f4a2bf2",
]
;
T
| where item_id has_any (
(
T
| parse item_id with * "-" item_id
)
)

How would I return a count of all rows in the table and then the count of each time a specific status is found?

Please forgive my ignorance on sqlalchemy, up until this point I've been able to navigate the seas just fine. What I'm looking to do is this:
Return a count of how many items are in the table.
Return a count of many times different statuses appear in the table.
I'm currently using sqlalchemy, but even a pure sqlite solution would be beneficial in figuring out what I'm missing.
Here is how my table is configured:
class KbStatus(db.Model):
id = db.Column(db.Integer, primary_key=True)
status = db.Column(db.String, nullable=False)
It's a very basic table but I'm having a hard time getting back the data I'm looking for. I have this working with 2 separate queries, but I have to believe there is a way to do this all in one query.
Here are the separate queries I'm running:
total = len(cls.query.all())
status_count = cls.query.with_entities(KbStatus.status, func.count(KbStatus.id).label("total")).group_by(KbStatus.status).all()
From here I'm converting it to a dict and combining it to make the output look like so:
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
Any help is greatly appreciated.
I don't know about sqlalchemy, but it's possible to generate the results you want in a single query with pure sqlite using the JSON1 extension:
Given the following table and data:
CREATE TABLE data(id INTEGER PRIMARY KEY, status TEXT);
INSERT INTO data(status) VALUES ('Assigned'),('In Progress'),('Peer Review'),('Ready to Publish')
,('Unassigned'),('Unassigned'),('Unassigned'),('Unassigned');
CREATE INDEX data_idx_status ON data(status);
this query
WITH individuals AS (SELECT status, count(status) AS total FROM data GROUP BY status)
SELECT json_object('data'
, json_object('status_count'
, json_group_object(status, total)
, 'total_requests'
, (SELECT sum(total) FROM individuals)))
FROM individuals;
will return one row holding (After running through a JSON pretty printer; the actual string is more compact):
{
"data": {
"status_count": {
"Assigned": 1,
"In Progress": 1,
"Peer Review": 1,
"Ready to Publish": 1,
"Unassigned": 4
},
"total_requests": 8
}
}
If the sqlite instance you're using wasn't built with support for JSON1:
SELECT status, count(status) AS total FROM data GROUP BY status;
will give
status total
-------------------- ----------
Assigned 1
In Progress 1
Peer Review 1
Ready to Publish 1
Unassigned 4
which you can iterate through in python, inserting each row into your dict and adding up all total values in another variable as you go to get the total_requests value at the end. No need for another query just to calculate that number; do it manually. I bet it's really easy to do the same thing with your existing second sqlachemy query.

MS Acess: max(Date/Time field) on query when field may contain 00:00:00

I am trying to build a Query in MS Access that returns the last date/time for a given entity ID. Research shows that using the MAX() function on the corresponding field and using GROUP BY on the remaining fields appears to be the way to go.
However, this doesn't seem to work in the presence of values that hold 0 hours, 0 minutes and 0 seconds, as it shows those values as well. The query's SQL is as follows:
SELECT Int(Historico_Classificacoes.ID_Entidade) AS ID_Entidade, Max(Historico_Classificacoes.Timestamp_Classificacao) AS [Data da última classificação], Historico_Classificacoes.US_Indicia_Pais_Constituicao, Historico_Classificacoes.US_Indicia_Responsabilidades_Fiscais, Historico_Classificacoes.US_Indicia_Morada_Coletiva, Historico_Classificacoes.US_Indicia_Telefone, Historico_Classificacoes.US_Indicia_Proveniencia_Capital, Historico_Classificacoes.US_Indicia_Beneficiários, Historico_Classificacoes.US_Indicia_Naturalidade, Historico_Classificacoes.US_Indicia_Nacionalidade, Historico_Classificacoes.US_Indicia_Morada_Singular, Historico_Classificacoes.US_Indicia_Laboral
FROM Historico_Classificacoes
GROUP BY Int(Historico_Classificacoes.ID_Entidade), Historico_Classificacoes.US_Indicia_Pais_Constituicao, Historico_Classificacoes.US_Indicia_Responsabilidades_Fiscais, Historico_Classificacoes.US_Indicia_Morada_Coletiva, Historico_Classificacoes.US_Indicia_Telefone, Historico_Classificacoes.US_Indicia_Proveniencia_Capital, Historico_Classificacoes.US_Indicia_Beneficiários, Historico_Classificacoes.US_Indicia_Naturalidade, Historico_Classificacoes.US_Indicia_Nacionalidade, Historico_Classificacoes.US_Indicia_Morada_Singular, Historico_Classificacoes.US_Indicia_Laboral
ORDER BY Int(Historico_Classificacoes.ID_Entidade);
The Historico_Classificacoes table currently holds the following data:
"ID_Entidade";"Timestamp_Classificacao";"Classificacao_DMIF";"Notacao_Risco_BCFT";"US_Indicia_Pais_Constituicao";"US_Indicia_Responsabilidades_Fiscais";"US_Indicia_Morada_Coletiva";"US_Indicia_Telefone";"US_Indicia_Proveniencia_Capital";"US_Indicia_Beneficiários";"US_Indicia_Naturalidade";"US_Indicia_Nacionalidade";"US_Indicia_Morada_Singular";"US_Indicia_Laboral"
"62";20/9/2015 00:00:00;1;30;1;1;1;1;1;1;1;1;1;0
"62";28/9/2015 10:43:38;1;30;1;1;1;1;1;1;1;1;1;1
"62";29/9/2015 17:52:24;1;30;1;1;1;1;1;1;1;1;1;1
"62";29/9/2015 17:52:40;1;30;1;1;1;1;1;1;1;1;1;1
"98";20/9/2015 00:00:00;2;15;1;1;1;1;1;1;0;0;0;0
"98";20/9/2015 00:00:01;0;0;0;0;0;0;0;0;0;0;0;0
The query, when executed in Datasheet View, outputs the following:
"ID_Entidade";"Data da última classificação";"US_Indicia_Pais_Constituicao";"US_Indicia_Responsabilidades_Fiscais";"US_Indicia_Morada_Coletiva";"US_Indicia_Telefone";"US_Indicia_Proveniencia_Capital";"US_Indicia_Beneficiários";"US_Indicia_Naturalidade";"US_Indicia_Nacionalidade";"US_Indicia_Morada_Singular";"US_Indicia_Laboral"
62;29/9/2015 17:52:40;1;1;1;1;1;1;1;1;1;1
62;20/9/2015 00:00:00;1;1;1;1;1;1;1;1;1;0
98;20/9/2015 00:00:00;1;1;1;1;1;1;0;0;0;0
98;20/9/2015 00:00:01;0;0;0;0;0;0;0;0;0;0
There are duplicated records for entities 62 and 98, when only one record for each was expected. Am I missing something here? Why are the entries whose values hold 00:00:00 present?
You may want to consider using an additional query as an intermediate step that identifies the MAX Date/Time combination for each group ID first, then a follow up query that pulls the entire record where that Group ID, Date and Time match, this will ensure you won't have to use First or Min on the rest of your fields, and you will always get the correct data
You use Group By for the last fields like US_Indicia_Morada_Singular and US_Indicia_Laboral. You'll have to use First, Last, Min, or Max on these as well.
Here is your attempt (without the repeated alias)
SELECT INT(ID_Entidade) AS ID_Entidade
, MAX(Timestamp_Classificacao) AS [Data da última classificação]
, US_Indicia_Pais_Constituicao
, US_Indicia_Responsabilidades_Fiscais
, US_Indicia_Morada_Coletiva
, US_Indicia_Telefone
, US_Indicia_Proveniencia_Capital
, US_Indicia_Beneficiários
, US_Indicia_Naturalidade
, US_Indicia_Nacionalidade
, US_Indicia_Morada_Singular
, US_Indicia_Laboral
FROM Historico_Classificacoes
GROUP BY INT(ID_Entidade)
, US_Indicia_Pais_Constituicao
, US_Indicia_Responsabilidades_Fiscais
, US_Indicia_Morada_Coletiva
, US_Indicia_Telefone
, US_Indicia_Proveniencia_Capital
, US_Indicia_Beneficiários
, US_Indicia_Naturalidade
, US_Indicia_Nacionalidade
, US_Indicia_Morada_Singular
, US_Indicia_Laboral
ORDER BY INT(ID_Entidade);
From you comments, here is SQL that is close to what you need. I have added the field "AnotherField" for you as you may or may not need to add field here.
This currently selects the whole record from the table, but only the single "most recent" record for each value found in the AnotherField is listed.
It may be that you need more that one field where AnotherField appears in the SQL. Think of the field you use instead of AnotherField as being the fields that need to be used to find the maximum date record.
SELECT Main.*
FROM Historico_Classificacoes AS Main
INNER JOIN ( SELECT AnotherField
, MAX(Timestamp_Classificacao) AS [MaxDate]
FROM Historico_Classificacoes
GROUP BY AnotherField
)
AS MostRecent
ON ( Main.AnotherField = MostRecent.AnotherField
AND
Main.Timestamp_Classificacao = MostRecent.MaxDate
)

SQLite best way to save and traverse arrays of strings

I have a table that looks like this.
| id | coords |
| 0 | [1,0],[4,3],[4,9],[9,3],[1,8]
| 1 | [3,6],[3,8],[7,4],[5,2],[2,1]
.. and more
There will be around 70k-100k rows at most, and the CPU is not very powerful.
What is the fastest and least cpu intensive SQLite statement i can use to determine which id has any given coordinate? No two id's share a coordinate.
Example.
SELECT * FROM mytable WHERE coords LIKE '%[[]3,8]%'
I imagine the LIKE statement above will get pretty intensive right?
You should always try to have a properly normalized database.
In this case, the coordinate list is not in the first normal form.
If you move the coordinates to a separate table, you can search for coordinates with a simple and obvious query, which can be be sped up with an index:
CREATE TABLE MyTable (
ID,
[...]
);
CREATE TABLE MyCoordinates (
MyTableID,
CoordX,
CoordY
);
SELECT MyTableID FROM MyCoordinates WHERE X = ? AND Y = ?;

Query to find 'most watched' [COUNT()] from one table while returning the results from another

The question probably is quite confusing.
In affect i have the following:
WatchList table
UserId | FilmId
| 3 77
| etc etc
|
|
|
these are foreign keys for the following tables
FilmDB - Film_title, Film_plot, Film_Id etc.
and
aspnet_memberships - UserId, Username etc..
Now, i presume i will need to use a join but i am struggling with the syntax.
I would like to use 'Count' on the 'WatchList' and return the most frequent filmId's and their counterpart information, but i'd then like to return the REST of the FilmDB results, essentially giving me a list of ALL films, but with those found in the WatchedList my frequently sorted to the top.
Does that make sense? Thanks.
SELECT *
FROM filmdb
LEFT JOIN (
SELECT filmid, count(*) AS cnt
FROM watch_list
GROUP BY filmid) AS a
ON filmdb.film_id = a.filmid
ORDER BY isnull(cnt, 0) DESC;
http://sqlfiddle.com/#!3/46b16/10
You did not specify if the query should be grouped by film_id or user_id. The example I have provided is grouped by user if you change that to film_id then you will get the watch count for all users per film.
You need to use a subquery to get the count and then order the results by the count descending to get an ordered list.
SELECT
*
FROM
(
SELECT
WatchList.Film_Id,
WatchCount=COUNT(*)
FilmDB.Film_Title
FROM
WatchList
INNER JOIN FilmDB ON FilmDB.Film_Id=WatchList.Film_Id
GROUP BY
WatchList.UserID,
WatchList.Film_Id,
FilmDB.Film_Title
)AS X
ORDER BY
WatchCount DESC

Resources