What is the best way to do range based cross join in Presto? - window-functions

I have a table event_log in Athena which has logs collected from an event processing system. There are various stages in the system and each stage processes these events in sequential order. start_time column indicates the time at which the event entered the system and end_time is the time at which it exited. The system processes millions of events per day. And, we have data for a year in the below table.
event_id
event_type
start_time
end_time
E1
TypeA
T1
T4
E2
TypeB
T2
T6
M1
TypeM
T2
T6
E3
TypeA
T3
T7
E4
TypeB
T4
T7
E5
TypeA
T5
T8
M2
TypeM
T5
T8
E6
TypeB
T6
T9
E7
TypeA
T7
T10
E8
TypeB
T8
T11
M3
TypeM
T8
T11
There are special type of events TypeM (marker events). I have to calculate the processing latency of these special events from these logs. From the table above, this can be achieved by filtering events of that type and compute the latency as end_time - start_time. In addition to that, I want to augment the latency with an additional info - number of events that were actively being processed in various stages of the system when this event was being processed.
-- sample event_log table
CREATE TABLE event_log AS
SELECT * FROM (
VALUES
('E1','TypeA', 1, 4),
('E2','TypeB', 2, 6),
('M1','TypeM', 2, 6),
('E3','TypeA', 3, 7),
('E4','TypeB', 4, 7),
('E5','TypeA', 5, 8),
('M2','TypeM', 5, 8),
('E6','TypeB', 6, 9),
('E7','TypeA', 7, 10),
('E8','TypeB', 8, 11),
('M3','TypeM', 8, 11)
) AS t (event_id, event_type, start_time, end_time)
-- filtered marker table
CREATE TABLE marker_table AS
SELECT * FROM event_log
WHERE event_type = 'TypeM'
-- Join with the filtered marker table on markers start and end time
SELECT mark.*,count(processed_events_in_band.event_id) AS events_processed_count
FROM event_log processed_events_in_band
JOIN marker_table mark
ON processed_events_in_band.end_time between mark.start_time AND mark.end_time
WHERE processed_events_in_band.event_type != 'TypeM'
GROUP BY mark.event_id
Expected result
event_id
event_type
start_time
end_time
events_processed_count
M1
TypeM
T2
T6
2 E1, E2
M2
TypeM
T5
T8
4 E2, E3, E4, E5
M3
TypeM
T8
T11
4 E5, E6, E7, E8
There are partitions on end_time (daily). Have been using them to reduce the data scan. Single day data can be up to 10m. Query should scale to that. Query took around 17 mins for marker table having 18K rows and event logs having 10m rows. There are around 2K parquet files to scan for this 10m rows. Don't think there is S3 read latency causing an issue here.
How do I optimize this query? What is the best way to get this data efficiently?

To improve performance:
be aware that CREATE TABLE will write the output of the query on disk (doc). Consider using a common table expression instead:
with marker_table as (SELECT * FROM event_log
WHERE event_type = 'TypeM')
select ...
try using a condition with an = sign in the join. Presto will do a hash join which is a lot more efficient. In your case I would attempt truncating down start and truncating up end times and write the ON condition using equality of truncated times (up or down)
always place the largest table to the left of the join (doc)
if you want to add the list of events being processed, and not just the count, you can use the array_agg function (doc) combined with array_distinct to produce a list of unique entries and array_join to join it into a string.

Related

KQL Join on max value

I need to join on a table to return the MAX value from that right-hand table. I have tried to mock it up using 'datatable' but have failed miserably :(. I'll try and describe with words.
T1 = datatable(ID:int, Properties:string, ConfigTime:datetime) [1,'a,b,c','2021-03-04 00:00:00']
T2 = datatable(ID:int, Properties:string, ConfigTime:datetime) [2,'a,b,c','2021-03-02 00:00:00', 3,'a,b','2021-03-01 00:00:00', 4,'c','2021-03-20 00:00:00']
I'm using this as an update policy on T2, which has a source of T1. So I want to select the rows from T1 and then join the rows from T2 that have the highest timestamp. My first attempt was below:
T1 | join kind=inner T2 on Id
| summarize arg_max(ConfigTime1, Id, Properties, Properties1, ConfigTime) by Id
| project Id, Properties, ConfigTime
In my actual update policy, I merge the properties from T1 and T2 then write to T2, but for simplicity, I've left that for now.
Currently, I'm not getting any output in my T2 from the update policy. Any guidance on another way I should be doing this would be appreciated. Thanks
It seems that you want to push the arg_max calculation into the T2 side of the join, something like this:
T1
| join kind=inner (
T2
| summarize arg_max(ConfigTime1, Id, Properties, Properties1, ConfigTime) by Id
| project Id, Properties, ConfigTime
) on Id
Note that to ensure acceptable performance you want to limit the timeframe for the arg_max search, so you should consider a time based filter before the arg_max.
I think what you're looking for is a union
let T1 = datatable(ID:int, Properties:string, ConfigTime:datetime) [
1,'a,b,c','2021-03-04 00:00:00'
];
let T2 = datatable(ID:int, Properties:string, ConfigTime:datetime) [
2,'a,b,c','2021-03-02 00:00:00',
3,'a,b','2021-03-01 00:00:00',
4,'c','2021-03-20 00:00:00'
];
Here is an example using a variable with summarize max:
let Latest = toscalar(T2 | summarize max(ConfigTime));
T1
| union (T2 | where ConfigTime == Latest)
The result will keep the entries from T1 and only the latest entries from T2.
If this doesn't reflect your expected results please show your expected output.

How to evaluate Application Insights requests "own" duration, without considering duration of dependencies?

I'm trying to produce a Kusto query to measure the "own" duration of the requests (subtracting out durations of dependencies). However, I can't really figure out how to work this out through a pure Kusto query.
To better understand what would would expected, below a sample case:
High level view (where R is the request and Dx the dependencies)
R =============================== (31ms)
D1 ******* (7ms)
D2 ******** (8ms)
D3 ****** (6ms)
D4 ** (2ms)
D5 **** (4ms)
Proj ==*************======******====
D1 overlaps D2 during 2ms
D5 and D4 shouldn't be taken into account as completely overlapped by other dependencies
Proj being a projection of a potential intermediate step where only meaningful dependencies segments are shown
Given the following testbed dataset
let reqs = datatable (timestamp: datetime, id:string, duration: real)
[
datetime("2020-12-15T08:00:00.000Z"), "r1", 31 // R
];
let deps = datatable (timestamp: datetime, operation_ParentId:string, duration: real)
[
datetime("2020-12-15T08:00:00.002Z"), "r1", 7, // D1
datetime("2020-12-15T08:00:00.007Z"), "r1", 8, // D2
datetime("2020-12-15T08:00:00.021Z"), "r1", 6, // D3
datetime("2020-12-15T08:00:00.023Z"), "r1", 2, // D4
datetime("2020-12-15T08:00:00.006Z"), "r1", 4, // D5
];
In this particular case, the Kusto query, joining the two data tables, should be able to retrieve 12 (duration of the request, removing all dependencies), ie.
Expected total duration = 31 - (7 + 8 - 2) - (6) = 12
Any help to move this forward would be greatly appreciated <3
I succeeded to solve that using that using row_window_session(). This is a Window function. You can read more about it at Window functions overview.
The solution is:
let reqs = datatable (timestamp: datetime, operation_ParentId:string, duration: real)
[
datetime("2020-12-15T08:00:00.000Z"), "r1", 31 // R
];
let deps = datatable (timestamp: datetime, operation_ParentId:string, duration: real)
[
datetime("2020-12-15T08:00:00.002Z"), "r1", 7, // D1
datetime("2020-12-15T08:00:00.007Z"), "r1", 8, // D2
datetime("2020-12-15T08:00:00.021Z"), "r1", 6, // D3
datetime("2020-12-15T08:00:00.006Z"), "r1", 4, // D5
datetime("2020-12-15T08:00:00.023Z"), "r1", 2, // D4
];
deps
| extend endTime = timestamp + totimespan(duration * 10000)
| sort by timestamp asc
| serialize | extend SessionStarted = row_window_session(timestamp, 1h, 1h, timestamp > prev(endTime))
| summarize max(endTime) by operation_ParentId, SessionStarted
| extend diff = max_endTime - SessionStarted
| summarize todouble(sum(diff)) by operation_ParentId
| join reqs on operation_ParentId
| extend diff = duration - sum_diff / 10000
| project diff
The idea here is to sort the entries by the open time, and as long as the next previous end time is later than the current start time, we don't open a new session. Let's explain each line of this query to see how this is being done:
Calculate the endTime based on the duration. To normalize the data I'll multiply by 10000 the duration:
| extend endTime = timestamp + totimespan(duration * 10000)
Sort by start time:
| sort by timestamp asc
This is the key of this solution. It is calculated on the timestamp column. The next two parameters are limits when to start new buckets. Since we don't want to seal a bucket based on time that have passed, I provided 1 hour which will not hit with this input. The forth argument helps us to create a new session based on the data. As long as there are more rows that will result in timestamp > prev(endTime) they will have the same start time.
| serialize | extend SessionStarted = row_window_session(timestamp, 1h, 1h, timestamp > prev(endTime))
Now we have multiple rows per session start. So we want to keep only the latest time per session. We also keep operation_ParentId to later on join on that key:
| summarize max(endTime) by operation_ParentId, SessionStarted
Calculate the time of each session:
| extend diff = max_endTime - SessionStarted
Sum up all session times:
| summarize todouble(sum(diff)) by operation_ParentId
Join on req to get the total starting time:
| join reqs on operation_ParentId
Calculate the diff between the total time and the session times. Unnormalize the data:
| extend diff = duration - sum_diff / 10000
Project the final result:
| project diff
You can find this query running at Kusto Samples open database.
Having said that, please note that this is a linear operation. Meaning that if there are 2 following segments, that should be under the same segment, but they do not intersect, it will fail. For example, adding the following into deps:
datetime("2020-12-15T08:00:00.026Z"), "r1", 1, // D6
which should not add anything to the calculation, cause it to misbehave. This is because d4 is the previous point, and it has no point of contact with d6, although d3 covers them both.
To solve that, you need to repeat the same logic of steps 3-5. Unfortunately Kusto does not have recursions, therefore you cannot solve this for any kind of input. But assuming there are no really depth such cases that breaks this logic, I think it is good enough.
Take a look at the query below to see if it can meet your requirement:
let reqs = datatable (timestamp: datetime, id:string, duration: real, key1:string)
[
datetime("2020-12-15T08:00:00.000Z"), "r1", 31 , "k1" // R
];
let deps = datatable (timestamp: datetime, operation_ParentId:string, duration: real,name:string)
[
datetime("2020-12-15T08:00:00.002Z"), "r1", 7, "D1",
datetime("2020-12-15T08:00:00.007Z"), "r1", 8, "D2",
datetime("2020-12-15T08:00:00.021Z"), "r1", 6, "D3",
datetime("2020-12-15T08:00:00.023Z"), "r1", 2, "D4",
datetime("2020-12-15T08:00:00.006Z"), "r1", 4, "D5"
];
let d2 = deps
| where name !in ("D4","D5")
| summarize a=sum(duration)-2
| extend key1="k1";
reqs
| join d2 on key1
| extend result = duration - a
| project result
Test result:

Sqlite3 repeats value in other dates

.The involved tables:
data_incidencia.
data_ticket.
My query is the following
select t1.hurtos, t2.fallas,t3.ticket, t1.fecha_carga
from
(select count(ttc) as hurtos,
fecha_carga from data_incidencia
where campo_key_id = 2
group by fecha_carga) t1,
(select count(ttc) as fallas,
fecha_carga from data_incidencia
where campo_key_id = 1
group by fecha_carga) t2,
(select count(ticket) as ticket,
fecha_solicitud as fecha_carga from data_ticket ) t3
where t1.fecha_carga =t2.fecha_carga;
and the output:
but the desired output is:
notice that "ticket" is repeating value in 2018-05-16 where is no tickets, is probably something dumb as case when or group by, but I can't figure it out.
Any ideas of how should i fix this query ?
You have three subqueries, t1, t2, and t3.
t1 and t2 are joined, but t3 is not, so you get an implicit cross join.
The column names you're using look as if you want to join all three on the same column:
SELECT t1.hurtos, t2.fallas, t3.ticket, t1.fecha_carga
FROM (...) AS t1,
(...) AS t2,
(...) AS t3
WHERE t1.fecha_carga = t2.fecha_carga
AND t1.fecha_carga = t3.fecha_carga;
And implicit joins are outdated since 1992; better use explicit joins:
SELECT t1.hurtos, t2.fallas, t3.ticket, fecha_carga
FROM (...) AS t1
JOIN (...) AS t2 USING (fecha_carga)
JOIN (...) AS t3 USING (fecha_carga);

how to write the sqlite command to choose different records?

There are two tables(t1 and t2) in my database,both of them has three fields code ,date and price.t1 has 800 records ,t2 has 790 records , code fields are the same.
select distinct code from t1 = select distinct code from t2
i want to choose records from t1 and t2.
suppose that in t1
code date
x1 d1
x1 d2
x1 d3
in t2:
code date
x1 d4
x1 d2
x1 d5
I want to choose records in t1.
code date
x1 d1
x1 d3
I want to choose records in t2.
code date
x1 d4
x1 d5
how to write the sqlite command?
Think for CL,it works fine for me,
but it is difficult for me to understand the query.
1.what is the meaning of
SELECT 1
FROM t2
WHERE t2.code = t1.code
AND t2.date = t1.date ?
Why don't write it as
SELECT 1
FROM t1
WHERE t2.code = t1.code
AND t2.date = t1.date ?
2.why i can't write it as
SELECT *
FROM t1
WHERE NOT EXISTS (SELECT 1
FROM t2,t1
WHERE t2.code = t1.code
AND t2.date = t1.date)
the query SELECT 1 FROM t2 WHERE t2.code = t1.code AND t2.date = t1.date will get two 1 .and what is the meaning of two 1 exists?
To check whether a record exists, use EXISTS with a correlated subquery.
This selects all t1 rows whose combination of code/date values do not exist in t2:
SELECT *
FROM t1
WHERE NOT EXISTS (SELECT 1
FROM t2
WHERE t2.code = t1.code
AND t2.date = t1.date)
You can also try this:
SELECT *
FROM t1
WHERE (t1_code || t1_date) not in
(
select (t2_code || t2_date)
from t2 );
What I am doing is just concatenating code and date in t1 and concatenating code and date in t2 and getting the combinations that are there in t1 and not there in t2.

How to design a database(sqlite) for multi-condition query?

Suppose 1000,000 records arranged as:
c1_v1 c2_v1 c3_v1 d1
c1_v1 c2_v1 c3_v2 d2
c1_v1 c2_v1 c3_v3 d3
...
c1_v1 c2_v2 c3_v1 d999
c1_v1 c2_v2 c3_v2 d1000
...
c1_v999 c2_v999 c3_v998 d999999
c1_v999 c2_v999 c3_v999 d1000000
say that we need three conditions(c1_vx, c2_vx, c3_vx) to query the result(dx), but the single condition such as c1_v1 in different records may be same. An alternative style of the records:
c1_v1
c2_v1
c3_v1 : d1
c3_v2 : d2
c3_v3 : d3
...
c2_v2
c3_v1 : d999
c3_v2 : d1000
...
c1_v999
c2_v999
c3_v998: d999999
c3_v1000: d1000000
How to design the tables for fasttest query? (Just query, don't care about insert/update/delete)
Thanks!
A typical query operation is like select d from t_table where c1 = 'UA1000_2048X32_MCSYN' and c2 = '1.234' and c3 = '2.345';
Well, then you just need a composite index on {c1, c2, c3}.
Ideally, you'd also cluster the table, so retrieving d just involves an index seek without a table heap access, but I don't think SQLite supports clustering. Alternatively, consider creating a covering index on {c1, c2, c3, d} instead.
c1 is a string like UA1000_2048X32_MCSYN, c2 and c3 is a real(double) number
I'd refrain from trying to equate numbers with strings in your query - some DBMSes can't use index in these situations and SQLite might be one of them. Instead, just write the query in most natural way, without single quotes around number literals:
select d from t_table
where c1 = 'UA1000_2048X32_MCSYN' and c2 = 1.234 and c3 = 2.345;

Resources