My query has count function which returns the count of rows summarized by day. Now, when there are no rows from that table, I'm not getting any result, instead I need, rows with all days and count as zero. I tried with coalesc but didnt work. Any help is much appreciated!
Thanks!
Here is my query:
exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| summarize schedulerFailed = coalesce(count(),tolong("0")) by tostring(day)
Instead of summarize you need to use make-series which will fill the gaps with a default value for you.
exceptions
| where name == 'my_scheduler' and timestamp > ago(30d)
| extend day = split(tostring(timestamp + 19800s), 'T')[0]
| make-series count() on tolong(x) step 1
You might want to add from and to to make-series in order for it to also fill gaps at the beginning and the end of the 30d period.
Related
I have data with a start and end date e.g.
+---+----------+------------+
| id| start| end|
+---+----------+------------+
| 1|2021-05-01| 2022-02-01|
| 2|2021-10-01| 2021-12-01|
| 3|2021-11-01| 2022-01-01|
| 4|2021-06-01| 2021-10-01|
| 5|2022-01-01| 2022-02-01|
| 6|2021-08-01| 2021-12-01|
+---+----------+------------+
I want a count for each month on how many observations were "active" in order to display that in a plot. With active I mean I want a count on how many observations have a start and end date that includes the given month. The result for the example data should look like this:
Example of a plot for the active times
I have looked into the pyspark Window function, but I don't think that can help me with my problem. So far my only idea is to specify an extra column for each month in the data and indicate whether the observation is active in that month and work from there. But I feel like there must be a much more efficient way to do this.
You can use sequence SQL. sequence will create the date range with start, end and interval and return the list.
Then, you can use explode to flatten the list and then count.
from pyspark.sql import functions as F
# Make sure your spark session is set to UTC.
# This SQL won't work well with a month interval if timezone is set to a place that has a daylight saving.
spark = (SparkSession
.builder
.config('spark.sql.session.timeZone', 'UTC')
... # other config
.getOrCreate())
df = (df.withColumn('range', F.expr('sequence(to_date(`start`), to_date(`end`), interval 1 month) as date'))
.withColumn('observation', F.explode('range')))
df = df.groupby('observation').count()
I am running KQL (Kusto query language) queries against Azure Application Insights. I have certain measurements that I want to aggregate weekly. I am trying to figure out how to split my data into weeks.
To illustrate what I seek, here is a query that computes daily averages of the duration column.
requests
| where timestamp > ago(7d)
| summarize
avg(duration)
by
Date = format_datetime(timestamp, "yyyy-MM-dd")
This produces something similar to this:
In the above I have converted datetimes to string and thus effectively "rounded them down" to the precision of one day. This may be ugly, but it's the easiest way I could think of in order to group all results from a given day. It would be trivial to round down to months or years with the same technique.
But what if I want to group datetimes by week? Is there a nice way to do that?
I do not care whether my "weeks" start on Monday or Sunday or January 1st or whatever. I just want to group a collection of KQL datetimes into 7-day chunks. How can I do that?
Thanks in advance!
Looks like you are looking for the "bin()" function:
requests
| where timestamp > ago(7d)
| summarize
avg(duration)
by
bin(timestamp, 1d) // one day, for 7 days change it to 7d
I found out that I can use the week_of_year function to split datetimes by week number:
requests
| where timestamp > ago(30d)
| summarize
avg(duration)
by
Week = week_of_year(timestamp)
| sort by Week
In below query I am looking at one API (foo/bar1) duration in 80th percentile that called in given date range so that I can see if there is any spike or degradation. (image below)
let dataset = requests
| where name == "GET foo/bar1"
and timestamp between(datetime("2020-10-15") .. datetime('2020-10-28'));
dataset
| summarize loadTime = round(percentile(duration, 80)) by format_datetime(timestamp, 'yyyy-MM-dd')
| order by timestamp desc
The challenge I'm facing is there can be more than one API (there are about 150 in my environment) and I also want to get those API's 80th percentile but having difficulty how to do it or even possible.
I might figure this out.. by removing 'name' from dataset then add 'name' to grouping section at the end of summarize row.
let dataset = requests
|
where timestamp between(datetime("2020-10-25") .. datetime('2020-10-28'));
dataset
| summarize loadTime = round(percentile(duration, 80)) by format_datetime(timestamp, 'yyyy-MM-dd'), name
| order by timestamp desc
I am trying to get the sum of all categories from a certain month from my transactions table in my sqlite database. Here is how the table is set up...
| id | transactionDate | transactionAmount | transactionCategory | transactionAccount |
Now, I want to specify three things:
The account name
The month
The year
And get the sum of the transactionAmount grouped by transactionCategory from the specified account, year, and month.
Here is what my SELECT statement looks like...
SELECT SUM(transactionAmount) AS total, transactionDate, transactionCategory
FROM transactions
WHERE transactionAccount=? AND Strftime(\"%m\", transactionDate)=? AND Strftime(\"%y\", transactionDate)=?
GROUP BY transactionCategory ORDER BY transactionCategory
Unfortunately, this returns zero rows. I am able to get accurate results if I don't try and select the month and year, but I would like to see the data from specific ranges of time...
I figured out the issue. I was simply formatting the year incorrectly. It should have been strftime('%Y', transactionDate)=? NOT strftime('%y', transactionDate)=? - the difference being a capital Y vs. a lowercase one.
I have a data set that looks like the following that I'd like to expand to a monthly panel data set.
ID | start_date | end_date | event_type |
1 | 01/01/97 | 08/01/98 | 1 |
2 | 02/01/97 | 10/01/97 | 1 |
3 | 01/01/96 | 12/01/04 | 2 |
Some cases last longer than others. I've figured out how to expand the data to a yearly configuration by pulling out the year from each date and then using:
year <- ddply(df, c("ID"), summarize, year = seq(startyear, endyear))
followed by:
month <- ddply(year, c("ID"), summarize, month = seq(1, 12))
The problem with this approach is that it doesn't assign the correct number for the month, i.e. January = 1, and so it doesn't play well with an event data set that I would like to eventually merge it with, where I would be matching on year, ID, and month. Help would be appreciated. Here is a direct link to the data set I am trying to expand (.xls): http://db.tt/KeLRCzr9. Hopefully I've included enough information, but please let me know if there is any other information needed.
You could try something more like this:
ddply(df,.(ID),transform,dt = seq.Date(as.Date(start_date,"%m/%d/%Y"),as.Date(end_date,"%m/%d/%Y"),by = "month"))
There will probably be a lot of warnings having to do with the row names, and I can't guarantee that this will work, since the data set you link to does not match the example you provide. For starters, I'm assuming that you cleaned up the start and end dates, since they appear in various formats in the .xls file.
ddply(df, .(ID), summarize, dt = seq.Date(start_date, end_date, by = "month"))
Assuming start_date and end_date are date objects already. Joran got me close though, so again, thanks for the help on that.