KQL Help: Need to trim the Datetime value - azure-data-explorer

I need to trim the Datetime value in KQL.
I have Timer Trigger based Azure function which runs every 30 mins
("0 */30 * * * *")]
I have 2 datetime columns StartTime and EndTime. I am getting the Runtimes of Azure Function by summarizing min(StartTime) - max(EndTime).
I want the min(StartTime) to trimmed to the actual start time of the Azure Function.
Example: If the min(StartTime) Column Value is "2021-10-25 10:02:26.7630995" then the StartTime should be "2021-10-25 10:00:00.000000"
And
If the min(StartTime) Column Value is "2021-10-25 10:32:26.7630995" then the StartTime should be "2021-10-25 10:30:00.000000"
My Code so far: (I need help in line #4 )
MyKustoTable | where isnotempty(RunID) and RunID > 41
| project RunID, CollectionTime, IngestionTime = ingestion_time()-30m
| summarize StartTime = min(CollectionTime), EndTime = max(IngestionTime) by RunID
| extend RBACDurationInMins = case((EndTime - StartTime)/1m > 30, "Trimmed StartTime", StartTime)
| extend RBACDurationInMins = (EndTime - StartTime)/1m, ResourceType = "RBAC"
| project ResourceType, RunID, StartTime, EndTime, RBACDurationInMins
| sort by RunID desc

you could use the bin() function: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/binfunction
if the min(StartTime) Column Value is "2021-10-25 10:02:26.7630995" then the StartTime should be "2021-10-25 10:00:00.000000"
If the min(StartTime) Column Value is "2021-10-25 10:32:26.7630995" then the StartTime should be "2021-10-25 10:30:00.000000"
print dt1 = datetime(2021-10-25 10:02:26.7630995),
dt2 = datetime(2021-10-25 10:32:26.7630995)
| project result1 = bin(dt1, 30m),
result2 = bin(dt2, 30m)
result1
result2
2021-10-25 10:00:00.0000000
2021-10-25 10:30:00.0000000

Related

How to predict when a disk runs out of space?

I collect Free disk space metrics at regular intervals and would like to predict when the disk will be full.
I thought I could use series_decompose_forecast
Here's a sample query:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
| make-series
FreeSpace = max(FreeSpace) default= long(null)
on Timestamp from ago(60d) to now() step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend series_decompose_forecast(FreeSpace, 24)
| render timechart
And the result
The baseline seems like it could show me when it will hit zero (or some other threshold), but if I specify more Points, it excludes more points from the learning process (still unsure if it excludes them from the start or end).
I don't even care for the whole time series, just the date of running out of free space. Is this the correct approach?
It seems that series_fit_line() is more than enough in this scenario.
Once you got the slope and the interception you can calculate any point on the line.
range Timestamp from now() to ago(60d) step -1d
| extend rn = row_number() + 10
| extend FreeSpace = rn + case(rn % 5 == 0, 5, rn % 3 == 0, -4, rn % 7 == 0, 3, 0)
| make-series FreeSpace = max(FreeSpace) default= long(null) on Timestamp from ago(60d) to now() step 12h
| extend FreeSpace = series_fill_forward(series_fill_backward(FreeSpace))
| extend (rsquare, slope, variance, rvariance, interception, line_fit) = series_fit_line(FreeSpace)
| project slope, interception, Timestamp, FreeSpace, line_fit
| extend x_intercept = todatetime(Timestamp[0]) - 12h*(1 + interception / slope)
| project-reorder x_intercept
| render timechart with (xcolumn=Timestamp, ycolumns=FreeSpace,line_fit)
x_intercept
2022-12-06T01:56:54.0389796Z
Fiddle
P.S.
No need for serialize after order by.
No need for order by if you create the range backwards.
Null value in a time-series breaks a lot of functionality (fixed with additional series_fill_forward)
If you look at the example: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/series-decompose-forecastfunction
You will see that they add 0 slots into the "future" of the original series which the forecast then predicts.
This is also stated in the notes:
The dynamic array of the original input series should include a number of points slots to be forecasted. The forecast is typically done by using make-series and specifying the end time in the range that includes the timeframe to forecast.
To make your example work:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
// add 4 weeks of empty slots in the "future" - these slots will be forecast
| make-series FreeSpace = max(FreeSpace) default=long(null) on Timestamp from ago(60d) to now()+24h*7*4 step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend forecast=series_decompose_forecast(FreeSpace, 7*4*2)
| render timechart
The documentation could be a bit clearer but I think what the points parameter does is simply to omit the last N points from training (since they are empty and you don't want to include them in your forecast model)
Output:
To get when you hit close to 0:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
| make-series FreeSpace = max(FreeSpace) default=long(null) on Timestamp from ago(60d) to now()+24h*7*4 step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend forecast=series_decompose_forecast(FreeSpace, 7*4*2)
| mv-apply with_itemindex=idx f=forecast to typeof(double) on (
where f <= 0.5
| summarize min(idx)
)
| project AlmostOutOfDiskSpace = Timestamp[min_idx], PredictedDiskSpaceAtThatPoint = forecast[min_idx]
AlmostOutOfDiskSpace
PredictedDiskSpaceAtThatPoint
5/12/2022 13:02:24
0.32277009977544

table counterpart of column_ifexists()

We do have a function column_ifexists() which refers to a certain column if it exists, otherwise it refers to another option if we provide. Is there a similar function for table? I want to refer to a table and run some logic against it in the query , if the table exists , but if it doesn't exist, there shouldn't be a failure -- it should simply return no data.
e.g.
table_ifexists('sometable') | ...<logic>...
Please note that the fields referenced in the query should be defined in the dummy table, otherwise in case of non-existing table, the query will yield an exception
Failed to resolve scalar expression named '...'
In the following example these fields are StartTime, EndTime & EventType
Table exists
let requested_table = "StormEvents";
let dummy_table = datatable (StartTime:datetime, EndTime:datetime, EventType:string)[];
union isfuzzy=true table(requested_table), dummy_table
| where EndTime - StartTime > 30d
| summarize count() by EventType
EventType
count_
Drought
1635
Flood
20
Heat
14
Wildfire
4
Fiddle
Table does not exist
let requested_table = "StormEventsXXX";
let dummy_table = datatable (StartTime:datetime, EndTime:datetime, EventType:string)[];
union isfuzzy=true table(requested_table), dummy_table
| where EndTime - StartTime > 30d
| summarize count() by EventType
EventType
count_
Fiddle

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks
The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

R: Writing loops to replace NULL with Dates

The Here is an example of my table:
custID | StartDate | EndDate | ReasonForEnd | TransactionType | TransactionDate
1a | NULL | 2/12/2014 | AccountClosed | AccountOpened | 1/15/2004
1a | NULL | 2/12/2014 | AccountClosed | Purchase | 3/16/2004
.......
2b | 7/7/2011 | 6/14/2013 | AccountClosed | AccountOpened | 8/1/2010
The problem has to do with the StartDate field. For each custId, if the entry is NULL then I want to replace with the TransactionDate where TransactionType = AccountOpened. If StartDate is after the TransactionDate where TransactionType = AccountOpened, then replace with this date.
The actual data is over 250,000 rows. I really need some help figuring out how to write this in R.
You could try the following, however I didn't test it yet. I assume your data.frame is called df:
require(dplyr)
df %>%
mutate_each(funs(as.Date(as.character(., format="%m/%d/%Y"))),
StartDate, EndDate, TransactionDate) %>%
group_by(custID) %>%
mutate(StartDate = ifelse(is.na(StartDate) | StartDate > TransactionDate[TransactionType == "AccountOpened"],
TransactionDate[TransactionType == "AccountOpened"], StartDate))
This code first converts several columns to Date format (in this step, NULL entries will be converted to NA), groups by custID and then checks if StartDate is either NA or greater than TransactionDate where TransactionType == "AccountOpened" and if TRUE, replaces StartDate with TransactionDate where TransactionType == "AccountOpened".

Fetching records by sampling epoch from MySQL table - selecting nearest value?

I have a table with feilds like
TimeStamp | Feild1 | Feild 2
--------------------------------------
1902909002 | xyddtz | 233447
1902901003 | xytzff | 233442
1902901105 | xytzdd | 233443
1902909007 | xytzdd | 233443
1902909909 | xytsqz | 233436
Now i want to query it and fetch records like timestamp = 1902909002 approx OR 1902901003 approx OR ...
I want i record for each epoch time which ever is nearest to that epoch
i have written something :
string sqlQuery = "SELECT TimeStamp, FwdHr, W FROM Meter_Data WHERE (TimeStamp <= "+timeSt[0].ToString();
for (int i = 1; i < timeSt.Count; i++)
{
sqlQuery = sqlQuery + " OR TimeStamp <= " +timeSt[i].ToString() ;
}
sqlQuery=sqlQuery+ ") AND MeterID = #meterID AND DeviceID = #deviceID ORDER BY TimeStamp";
which is returning null and also if from date and to date have large diff OR's will be in thousands. can any body suggest a better way?
Can you apply a range? I mean something like this:
var max=timeSt.Max();
var min=timeSt.Min();
var sqlString= string.Format(#"
SELECT
TimeStamp,
FwdHr,
W
FROM
Meter_Data
WHERE
TimeStamp BETWEEN {0} AND {1}
AND MeterID = #meterID AND DeviceID = #deviceID
ORDER BY
TimeStamp",min,max);

Resources