How to predict when a disk runs out of space? - azure-data-explorer

I collect Free disk space metrics at regular intervals and would like to predict when the disk will be full.
I thought I could use series_decompose_forecast
Here's a sample query:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
| make-series
FreeSpace = max(FreeSpace) default= long(null)
on Timestamp from ago(60d) to now() step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend series_decompose_forecast(FreeSpace, 24)
| render timechart
And the result
The baseline seems like it could show me when it will hit zero (or some other threshold), but if I specify more Points, it excludes more points from the learning process (still unsure if it excludes them from the start or end).
I don't even care for the whole time series, just the date of running out of free space. Is this the correct approach?

It seems that series_fit_line() is more than enough in this scenario.
Once you got the slope and the interception you can calculate any point on the line.
range Timestamp from now() to ago(60d) step -1d
| extend rn = row_number() + 10
| extend FreeSpace = rn + case(rn % 5 == 0, 5, rn % 3 == 0, -4, rn % 7 == 0, 3, 0)
| make-series FreeSpace = max(FreeSpace) default= long(null) on Timestamp from ago(60d) to now() step 12h
| extend FreeSpace = series_fill_forward(series_fill_backward(FreeSpace))
| extend (rsquare, slope, variance, rvariance, interception, line_fit) = series_fit_line(FreeSpace)
| project slope, interception, Timestamp, FreeSpace, line_fit
| extend x_intercept = todatetime(Timestamp[0]) - 12h*(1 + interception / slope)
| project-reorder x_intercept
| render timechart with (xcolumn=Timestamp, ycolumns=FreeSpace,line_fit)
x_intercept
2022-12-06T01:56:54.0389796Z
Fiddle
P.S.
No need for serialize after order by.
No need for order by if you create the range backwards.
Null value in a time-series breaks a lot of functionality (fixed with additional series_fill_forward)

If you look at the example: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/series-decompose-forecastfunction
You will see that they add 0 slots into the "future" of the original series which the forecast then predicts.
This is also stated in the notes:
The dynamic array of the original input series should include a number of points slots to be forecasted. The forecast is typically done by using make-series and specifying the end time in the range that includes the timeframe to forecast.
To make your example work:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
// add 4 weeks of empty slots in the "future" - these slots will be forecast
| make-series FreeSpace = max(FreeSpace) default=long(null) on Timestamp from ago(60d) to now()+24h*7*4 step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend forecast=series_decompose_forecast(FreeSpace, 7*4*2)
| render timechart
The documentation could be a bit clearer but I think what the points parameter does is simply to omit the last N points from training (since they are empty and you don't want to include them in your forecast model)
Output:
To get when you hit close to 0:
let DiskSpace =
range Timestamp from ago(60d) to now() step 1d
| order by Timestamp desc
| serialize rn=row_number() + 10
| extend FreeSpace = case
(
rn % 5 == 0, rn + 5
, rn % 3 == 0, rn -4
, rn % 7 == 0, rn +3
, rn
)
| project Timestamp, FreeSpace;
DiskSpace
| make-series FreeSpace = max(FreeSpace) default=long(null) on Timestamp from ago(60d) to now()+24h*7*4 step 12h
| extend FreeSpace = series_fill_backward(FreeSpace)
| extend forecast=series_decompose_forecast(FreeSpace, 7*4*2)
| mv-apply with_itemindex=idx f=forecast to typeof(double) on (
where f <= 0.5
| summarize min(idx)
)
| project AlmostOutOfDiskSpace = Timestamp[min_idx], PredictedDiskSpaceAtThatPoint = forecast[min_idx]
AlmostOutOfDiskSpace
PredictedDiskSpaceAtThatPoint
5/12/2022 13:02:24
0.32277009977544

Related

Return decimal value using Kusto Query Language that is always returning 0

I have a Kusto Query that I am using to query Application Insights. The goal is to get number of failed requests in 5 min buckets / and divide that by total number of requests in the same 5 min bucket. I will eventually build an alert to trigger if this percentage is greater than a certain value. But, I can't seem to get the query right.
In the example below, I hardcode a specific timestamp to make sure I get some failures.
Here is the query:
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
| where timestamp == "2021-05-17T20:20:00Z"
) on timestamp
| project timestamp, failed, reqs
| extend p=round(failed/reqs, 2)
It currently returns:
timestamp [UTC] |p |failed |reqs
5/17/2021, 8:20:00.000 PM 0 1,220 6,649
If anyone can give me insight into how to get the decimal value (~0.18) I expect for p?
Had to cast values to Doubles to get it to return something other than 0.
let fn = "APP_NAME";
requests
| where success == "False" and cloud_RoleName == fn
| summarize failed=sum(itemCount) by bin(timestamp, 5m)
| join (
requests
| where cloud_RoleName == fn
| summarize reqs=sum(itemCount) by bin(timestamp, 5m)
) on timestamp
| project timestamp, failedReqs=failed, totalRequests=reqs, percentage=(todouble(failed) / todouble(reqs) * 100)
another option that is a bit less verbose is to multiply by a 100.0 (which is a double literal)
percentage = failed * 100.0 / reqs
Note that the multiplication has to happen before division

computing offset for prev dynamically

I want to set offset for prev dynamically, based on number of items in a group. for e.g
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
| summarize by bin(timestamp,1h), customer
| extend prev_value = prev(value,<offset>)
The offset here should be equal to number of distinct customers. How can i compute this offset dynamically
If you can split query into small parts, you can use toscalar function to get number of unique customers.
This would be my approach...
let tab_series =
T
| make-series value = sum(value) on timestamp from .. to .. step 5m by customer
;
let no_of_distinct_customers =
toscalar(tab_series | distinct customer | summarize count())
;
tab_series
| summarize by bin(timestamp, 1h), customer
| extend prev_value = prev(value, no_of_distinct_customers)
You can find example here.

How can I fix wrong date format in SQLite

I'm working on app where I use SQLite to store data.
I created column Date. Since I'm beginner I made a mistake by inputing date as %m/%d/%Y (for example: 2/20/2020)
Now I've got a problem while taking out rows between selected dates.
I tried using this code:
SELECT * FROM Table WHERE Date BETWEEN strftime('%m/%d/%Y','2/5/2019') AND strftime('%m/%d/%Y','2/20/2020')
But that's not working.
Example table:
ID | Date
01 | 9/2/2019
02 | 2/20/2020
Thank you in advance for your help.
Update your dates to the only valid for SQLite date format which is YYYY-MM-DD:
update tablename
set date = substr(date, -4) || '-' ||
substr('00' || (date + 0), -2, 2) || '-' ||
substr('00' || (substr(date, instr(date, '/') + 1) + 0), -2, 2);
See the demo.
Results:
| ID | Date |
| --- | ---------- |
| 1 | 2019-09-02 |
| 2 | 2020-02-20 |
Now you can set the conditions like:
Date BETWEEN '2019-02-05' AND '2020-02-20'
If you do this change then you can use the function strftime() in select statements to return the dates in any format that you want:
SELECT strftime('%m/%d/%Y', date) date FROM Table
If you don't change the format of date column then every time you need to compare dates you will have to transform the value with the expression used in the UPDATE statement, and this is the worst choice that you could make.

Use values from one table in the bin operator of another table

Consider the following query:
This will generate a 1 cell result for a fixed value of bin_duration:
events
| summarize count() by id, bin(time , bin_duration) | count
I wish to generate a table with variable values of bin_duration.
bin_duration will take values from the following table:
range bin_duration from 0 to 600 step 10;
So that the final table looks something like this:
How do I go about achieving this?
Thanks
The bin(value,roundTo) aka floor(value,roundTo), will round value down to the nearest multiple of roundTo, so you don't need an external table.
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
You can try this out on the Stormevents tutorial:
let events = StormEvents | extend duration = (EndTime - StartTime) / 1h;
events
| summarize n = count() by bin(duration,10)
| where duration between(0 .. 600)
| order by duration asc
When dealing with timeseries data, bin() also understands the handy timespan literals, ex.:
let events = StormEvents | extend duration = (EndTime - StartTime);
events
| summarize n = count() by bin(duration,10h)
| where duration between(0h .. 600h)
| order by duration asc

Order of columns after pivot in application insights

User wants a count of unique sessions per week in application insights. I have the query working, including a pivot, but the Week columns are out of order. I would prefer if they were in order.
pageViews
| where timestamp < now()
| summarize Sessions= dcount(session_Id)
by Week=bin(datepart("weekOfYear", timestamp), 1), user_AuthenticatedId
| order by Week
| evaluate pivot(Week, sum(Sessions))
| join kind=innerunique (pageViews
| summarize MostRecentRequest = max(timestamp) by user_AuthenticatedId)
on $right.user_AuthenticatedId == $left.user_AuthenticatedId
| project-away user_AuthenticatedId1
I've tried ordering by timestamp before the summarize, and ordering by week after the summarize (still in there) and no luck.
There's currently a "trick" that will work: serialize right after your order by
pageViews
| where timestamp < now()
| where isnotempty(user_AuthenticatedId)
| summarize Sessions= dcount(session_Id)
by Week=bin(datepart("weekOfYear", timestamp), 1), user_AuthenticatedId
| order by Week
| serialize // <--------------------------------- RIGHT HERE
| evaluate pivot(Week, sum(Sessions))
| join kind=innerunique (pageViews
| summarize TotalSessions=dcount(session_Id), MostRecentRequest = max(timestamp) by user_AuthenticatedId)
on $right.user_AuthenticatedId == $left.user_AuthenticatedId
| project-away user_AuthenticatedId1
| top 100 by TotalSessions desc
gets me this in workbooks, with the weeks in descending order (I also added total session count to sort/top by with some custom column settings set):
the custom settings I have for the column settings in workbooks:
delete all the #'d columns that are there by default and add one for ^[0-9]+$ set to heatmap:
I refactored query a bit for my own comprehension. I took the the left and right into "views". Thought I'd share.
let users_MostRecent_Session =
pageViews
| summarize
TotalSessions=dcount(session_Id)
, MostRecentRequest = max(timestamp)
by
user_AuthenticatedId
;
//
let users_sessions_ByWeek =
pageViews
| where timestamp < now()
| where isnotempty(user_AuthenticatedId)
| summarize
Sessions= dcount(session_Id)
by
Week=bin(datepart("weekOfYear", timestamp), 1)
, user_AuthenticatedId
| order by Week
| serialize
| evaluate pivot(Week, sum(Sessions))
;
//
//
users_sessions_ByWeek
| join kind=innerunique
users_MostRecent_Session
on user_AuthenticatedId
| project-away user_AuthenticatedId1
| top 100 by TotalSessions desc

Resources