relationship between ingestion latency and ingestion_time() - azure-data-explorer

Let's say I run the following query precisely at datetime(2021-11-02 05:05:00)
MyTable | where ingestion_time() between (datetime(2021-11-02 05:00:00)..5m) | ..
Is it guaranteed that all the rows with their ingestion_time() falling within this time window are already available as part of the table? Experimentally I found that roughly up to 10 mins after any given time the data for that latest 5 min time window still keeps flowing. So in this example lets say that even at datetime(2021-11-02 05:10:00) , new rows are still getting populated to the table with ingestion_time() value falling between datetime(2021-11-02 05:00:00) and datetime(2021-11-02 05:05:00). Is it expected ( I guess some basic ingestion latency could be the reason ) ? And if it's expected , should we be setting some delay , say 10-15 min , in our automation to run this query for time range which was 10-15 min old?

No, there is no such guarantee. If you are looking for a way to perform exactly-once processing of all data, with no data loss guarantees, you should be using database cursors.

Related

What is the best way to detect anomalies in exception count, from logs, in azure

I have an asp.net application deployed in azure. This generates plenty of logs, some of which are exceptions. I do have a query in Log Analytics Workspace that picks up exceptions from logs.
I would like to know what is the best and/or cheapest way to detect anomalies in the exception count over a time period.
For example, if the average number of exceptions for every hour is N (based on information collected over the past 1 month or so), and if average goes > N+20 at any time (checked every 1 hour or so), then I need to be notified.
N would be dynamically changing based on trend.
I would like to know what is the best and/or cheapest way to detect anomalies in the exception count over a time period.
Yes, we can achieve this by following steps:
Store the average value in a Stored Query Result in Azure.
Using stored query result
.set stored_query_result
These are some limitations to keep the result. Refer MSDOC for detailed information.
Note: The stored query result will be available only 24 hours.
Workaround Follows
Set the Stored query result
# here i am using Stored query result to store the average value of trace message count for 5 hours
.set stored_query_result average <|
traces
| summarize events = count() by bin(timestamp, 5h)
| summarize avg(events)
2. Once Query Result Set you can use the Stored Query Result value in another KQL Query (The stored value was available till 24 hours)
# Retrieve the stored Query Result
stored_query_result(<StoredQueryResultName>) |
Query follows as per your need
Schedule the alert.

Slow query on table | WHERE x | ORDER by timestamp | DISTINCT a,b,c,d | TAKE 20 when table large

We are experiencing a sudden performance drop with a query structured like this:
table(tablename)
| where MeasurementName in ('ActiveJobId')
and MachineId == machineId
and SourceTimestamp <= from
and isnotnull( Value)
| order by SourceTimestamp desc
| distinct SourceTimestamp, MeasurementName, tostring(Value), SourceTimestampUtc
| take rows
tablename, machineId, from, rows are all query parameters. rows is typically "20". Value column is of type "dynamic"
The table contains 240 Million entries, with about 64,000 matching the WHERE criteria. The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
The query runs smooth in the Staging database system, but started to degrade in performance on the Dev system. Possibly because of increased data amount.
If we remove the distinct clause, or move it behind the TAKE clause, the query completes very fast. (<1s). The data contains about 5-10% duplicate entries.
To our understanding the query should be performed like this:
Prepare a filter for the source table, start at a specific datetime range
Order desc: walk backwards
Walk down the table and stop when you got 20 distinct rows
From the time it sometimes takes it looks almost as if ADX walks down the whole table, performs a distinct, and then only takes the topmost 20 rows.
The problem persists if we swap | order and | distinct around.
The problem disappears if we move | distinct to the end of the query, but then we often receive 1-2 items less than required.
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
The goal of the query is to get the last 20 UNIQUE, non-empty entries for a given machine and data point, starting after a specific date.
This part of the description doesn't match the filter in your query: and SourceTimestamp <= from - did you mean to use >= instead of <= ?
Is there a logical error we make, can this query be rewritten, or are there better options at hand?
If you can't eliminate the duplicates upstream, you can consider setting a materialized view that performs the deduplication, then query the view directly instead of the raw data. Also see Handle duplicate data

Executing SQL to get few rows from a huge table is throwing a error "No more spool space error"

Select row_id from table_name
Sample 20;
issuing no more spool space error...
Is there any query to get 20 arbitary rows from a table in less time ? Assume table is very huge one
This is pretty common depending on the primary index of the table. If you add a predicate using the primary index you should return results.
Just add a WHERE clause with the primary index in it to limit the results, and you should see results without a spool issue.
Assuming
Table_T is table name
Table_v is view on that Table_T....
Following is the explain on query executed....
First we lock the table_T in view table_v for access
Next we will do an all Amp retrieve step from table in view table_v by way of an all rows scan with a condition of (table_T in view table_v.col2 is null) in to spool2(all_amps), which is built locally on all amps. The input table will not be cached in the memory, bit is eligible for synchronized scanning. The size of spool is estimated with high confidence to be 1 row (35 bytes). The estimated time for step is 2 minutes and 16 seconds
We do an all amps stat function step from spool2 by way of an all-rows scan in to spool5, which is redistributed by hash code to all amps. He result rows are put into spool1(group_amps), which is built locally on amps. The step is used to retrieve top 20 rows, then execute step4. The size is estimated with high confidence to be 1 row 41 bytes
We do an all-amps stat function step from spool2 (last use) by ways of an all row scan in to spool5 (last use) which is redistributed by hash code to all amps. The result rows are put in to spool1 (group_amps) which is built locally on the amps. This step is used to retrieve top 20 rows. The step is estimated with high confidence to be 1 row (41 bytes)
Finally we will send an END TRANSACTION to all amps involved
Content of spool1 are sent back to user

NHibernate Query slows other queries

i'm writing a program in which i use two database queries using NHibernate. First query is a large one - select with two joins (the big SELECT query) whose result is about 50000 records. Query takes about 30 secs. Next step in the program is iterating through these 50000 record and invoking query on each of this records. This query is pretty small COUNT method.
There are two interesting things tough:
If i run the small COUNT query before the big SELECT, the COUNT query takes about 10ms, but if i ran it after the big SELECT query it takes 8-9 seconds. Furthermore, if i reduce the complexity of the big SELECT query i also reduce the time of the COUNT query execution afterwards.
If i ran the the big SELECT query on sql server management studio it takes 1 sec, but from ASP.NET application it takes 30 secs.
SO there are two main questions. Why is the query taking so long to execute in code when its so fast in ssms? Why is the big SELECT query affecting the small COUNT queries afterwards.
I know there are many possible answers to this problem but i have googled a lot and this is what i have tried:
Setting the SET parameters of asp.net application and ssms so they are the same to avoid different query plans
Clearing the ssms cache so the good ssms result is not caused by ssms caching - same 1 second result after the cache clear
The big SELECT query:
var subjects = Query
.FetchMany(x => x.Registrations)
.FetchMany(x => x.Aliases)
.Where(x => x.InvalidationDate == null)
.ToList();
The small COUNT query:
Query.Count(x => debtorIRNs.Contains(x.DebtorIRN.CodIRN) && x.CurrentAmount > 0 && !x.ArchivationDate.HasValue && x.InvalidationDate == null);
As it turned out the above mentioned FatchMany's were inevitable for the program so i couldn't just skip. The first significant improvement i achieved was turning off the loggs of the application (as i mentioned the above code is just a fragment). Performance without logs were about a half faster. But still it took considerable amount of time. SO i decided to avoid using NHibernate for this query and wrote plain sqlQuery to data reader, which i than parsed into my object's. I was able to reduce the execution time from 2.5 days (50000 * 4 sec -> number of small queries * former execution time of one small query) to 8 minutes.

Applying calculations on large data set

I'm currently optimizing our data warehouse and processes which uses it and i'm looking for some suggestions.
The problem is that i'm not sure about the calculations on retrieved data.
For make things more clearer for example we have following data stucture:
id : 1
param: static_value
param2: static_value
And let's consider that we got about 50 million entries with this structure.
Also let's assume that we are querying this data set about 30 times per minute which results every time at least 10k entries.
So, in short we got these stats:
Data set: 50 million entries.
Access frequency: 30 / s.
Resulting data size: ~10k results
On every query in resulting data set i have to go thought every entry and apply on it some calculations which results a field (for example param3 ) with it's dynamic value. For example:
Query2 ( 2k results ) and one of it's entries:
id : 2
param: static_value_2
param2: static_value_2
param3: dynamic_value_2
Query3 ( 10k results ) and one of it's entries:
id : 3
param: static_value_3
param2: static_value_3
param3: dynamic_value_3
And so on..
The problem is that i can't prepare the field param3 value earlier than i get it's values by query because of many dynamic values which are used in calculations.
Main question:
Is there any guidelines, practises or even the technologies for optimizing this kind of „problems“ or implementing this kind solutions?
Thanks for any information.
Update 1:
The field "param3" is calculated on every query in every data result entry, it means that this calculated value is not stored in any storage it just computed on every query. I can't store this value because it's dynamic and depends on many variables due this reason i can't store it as static value when it's dynamic.
I guess it's not good practise to have such implementation?

Resources