Assuming I have a definition of a user I can calculate sum of all daily users and all monthly users.
customEvents
| where timestamp > ago(30d)
| where <condition>
| summarize by <user>, bin(timestamp, 1d)
| summarize count() by bin(timestamp, 1d)
| summarize DAU=sum(count_)
customEvents
| where timestamp > ago(30d)
| where <condition>
| summarize by <user>
| MAU=30*count
The question is how to calculate DAU/MAU? Some join magic?
Edit:
There is a much easier way to calculate usage metrics now - "evaluate activity_engagement":
union *
| where timestamp > ago(90d)
| evaluate activity_engagement(user_Id, timestamp, 1d, 28d)
| project timestamp, Dau_Mau=activity_ratio*100
| render timechart
-------
The DAU is really stright forward in Analytics - just use a dcount.
The tricky part of course is calculating the 28-day rolling MAU.
I wrote a post detailing exactly how to calculate stickiness in app analytics a few weeks back - The trick is that you have to use hll() and hll_merge() to calculate the intermediate dcount results for each day, and then merge them together.
The end result looks like this:
let start=ago(60d);
let period=1d;
let RollingDcount = (rolling:timespan)
{
pageViews
| where timestamp > start
| summarize hll(user_Id) by bin(timestamp, period)
| extend periodKey = range(bin(timestamp, period), timestamp+rolling, period)
| mvexpand periodKey
| summarize rollingUsers = dcount_hll(hll_merge(hll_user_Id)) by todatetime(periodKey)
};
RollingDcount(28d)
| join RollingDcount(0d) on periodKey
| where periodKey < now() and periodKey > start + 28d
| project Stickiness = rollingUsers1 *1.0/rollingUsers, periodKey
| render timechart
Looks like this query does it:
let query = customEvents
| where timestamp > datetime("2017-02-01T00:00:00Z") and timestamp < datetime("2017-03-01T00:00:00Z")
| where **<optional condition>**;
let DAU = query
| summarize by **<user>**, bin(timestamp, 1d)
| summarize count() by bin(timestamp, 1d)
| summarize DAU=sum(count_), _id=1;
let MAU = query
| summarize by **<user>**
| summarize MAU=count(), _id=1;
DAU | join (MAU) on _id
| project ["DAU/MAU"] = todouble(DAU)/30/MAU*100, ["Sum DAU"] = DAU, ["MAU"] = MAU
Any suggestions how to calculate it per last few months?
Zaki, your queries calculate a point in time MAU/DAU. If you need a rolling MAU you can use the HLL approach suggested by Asaf. Or the following which is my preferred rolling MAU which is using make-series and fir(). You can play with it hands on using this link to the analytics demo portal.
The two approaches require some time to get used to... and from what I have seen both are blazing fast. One advantage to the make-series and fir() approach is that it is 100% accurate while the HLL approach is heuristic and has some level of error. Another bonus is that it is really easy to configure the level of user engagement that would make the user eligible for the count.
let endtime=endofday(datetime(2017-03-01T00:00:00Z));
let window=60d;
let starttime=endtime-window;
let interval=1d;
let user_bins_to_analyze=28;
let moving_sum_filter=toscalar(range x from 1 to user_bins_to_analyze step 1 | extend v=1 | summarize makelist(v));
let min_activity=1;
customEvents
| where timestamp > starttime
| where customDimensions["sourceapp"]=="ai-loganalyticsui-prod"
| where (name == "Checkout")
| where user_AuthenticatedId <> ""
| make-series UserClicks=count() default=0 on timestamp in range(starttime, endtime-1s, interval) by user_AuthenticatedId
// create a new column containing a sliding sum. Passing 'false' as the last parameter to fir() prevents normalization of the calculation by the size of the window.
| extend RollingUserClicks=fir(UserClicks, moving_sum_filter, false)
| project User_AuthenticatedId=user_AuthenticatedId , RollingUserClicksByDay=zip(timestamp, RollingUserClicks)
| mvexpand RollingUserClicksByDay
| extend Timestamp=todatetime(RollingUserClicksByDay[0])
| extend RollingActiveUsersByDay=iff(toint(RollingUserClicksByDay[1]) >= min_activity, 1, 0)
| summarize sum(RollingActiveUsersByDay) by Timestamp
| where Timestamp > starttime + 28d
| render timechart
Related
Background
First, let me know if this is more appropriate for the DBA StackExchange. Happy to move it there.
I've got a dataset, db1_dummy with ~100 million rows worth of car and motorcycle insurance claims that I'm prepping for statistical analysis. It's in PostgreSQL v13, which I have running on a local 64bit Windows machine and accessing through DataGrip. db1_dummy has ~15 variables, but for this question only 3 are relevant. Here's a toy version of the dataset:
+-------------------+------------+--+
|member_composite_id|service_date|id|
+-------------------+------------+--+
|eof81j4 |2010-01-12 |1 |
|eof81j4 |2010-06-03 |2 |
|eof81j4 |2011-01-06 |3 |
|eof81j4 |2011-05-21 |4 |
|j42roit |2015-11-29 |5 |
|j42roit |2015-11-29 |6 |
|j42roit |2015-11-29 |7 |
|p8ur0fq |2014-01-13 |8 |
|p8ur0fq |2014-01-13 |9 |
|p8ur0fq |2016-04-04 |10|
|vplhbun |2019-08-15 |11|
|vplhbun |2019-08-15 |12|
|vplhbun |2019-08-15 |13|
|akj3vie |2009-03-31 |14|
+-------------------+------------+--+
id is unique (a primary key), and as you can see member_composite_id identifies policyholders and can have multiple entries (an insurance policyholder can have multiple claims). service_date is just the date a policyholder's vehicle was serviced for an insurance claim.
I need to get the data into a certain format in order to run my analyses, all of which are regression-based implementations of survival analysis in R (Cox proportional hazards models with shared frailty, if anyone's interested). Three main things need to happen:
service_date needs to be converted into an integer counted up from 2009-01-01 -- days since January 1st, 2009, in other words. service_date needs to be renamed service_date_2.
A new column, service_date_1, needs to be created, and it needs to contain one of two things for each row: the cell should be 0 if that row is the first for that member_composite_id, or, if it isn't the first, it should contain the value of service_date_2 for that member_composite_id's previous row.
Since the interval (the difference) between service_date_1 and service_date_2 cannot equal zero, a small amount (0.1) should be subtracted from service_date_1 in such cases.
That may sound confusing, so let me just show you. Here's what I need the dataset to look like:
+--+-------------------+--------------+--------------+
|id|member_composite_id|service_date_1|service_date_2|
+--+-------------------+--------------+--------------+
|1 |eof81j4 |0 |376 |
|2 |eof81j4 |376 |518 |
|3 |eof81j4 |518 |735 |
|4 |eof81j4 |735 |870 |
|5 |j42roit |0 |2523 |
|6 |j42roit |2522.9 |2523 |
|7 |j42roit |2522.9 |2523 |
|8 |p8ur0fq |0 |1838 |
|9 |p8ur0fq |1837.9 |1838 |
|10|p8ur0fq |1838 |2650 |
|11|vplhbun |0 |3878 |
|12|vplhbun |3877.9 |3878 |
|13|vplhbun |3877.9 |3878 |
|14|akj3vie |0 |89 |
+--+-------------------+--------------+--------------+
The good news: I have a query that can do this -- indeed, this query spat out the output above. Here's the query:
CREATE TABLE db1_dummy_2 AS
SELECT
d1.id
, d1.member_composite_id
,
CASE
WHEN (COALESCE(MAX(d2.service_date)::TEXT,'') = '') THEN 0
WHEN (MAX(d2.service_date) - '2009-01-01'::DATE = d1.service_date - '2009-01-01'::DATE) THEN d1.service_date - '2009-01-01'::DATE - 0.1
ELSE MAX(d2.service_date) - '2009-01-01'::DATE
END service_date_1
, d1.service_date - '2009-01-01'::DATE service_date_2
FROM db1_dummy d1
LEFT JOIN db1_dummy d2
ON d2.member_composite_id = d1.member_composite_id
AND d2.service_date <= d1.service_date
AND d2.id < d1.id
GROUP BY
d1.id
, d1.member_composite_id
, d1.service_date
ORDER BY
d1.id;
The Problem
The bad news is that while this query runs very speedily on the dummy dataset I've given you all here, it takes interminably long on the "real" dataset of ~100 million rows. I've waited as much as 9.5 hours for this thing to finish working, but have had zero luck.
My question is mainly: is there a faster way to do what I'm asking Postgres to do?
What I've tried
I'm not database genius by any means, so the best I've come up with here is to index the variables being used in the query:
create index index_member_comp_id on db1_dummy(member_composite_id)
And so on like that for id, too. But it doesn't seem to make a dent, time-wise. I'm not sure how to benchmark code in Postgres, but it's a bit of a moot point if I can't get the query to run after 10 hours. I've also thought of trimming some variables in the dataset (ones I won't need for analysis), but that only gets me down from ~15 columns to ~11.
I had outside help with the query above, but they're unsure (for now) about how to approach this issue, too. So I decided to see if the boffins on SO have any ideas. Thanks in advance for your kind help.
EDIT
Per Laurenz's request, here's the output for EXPLAIN on the version of the query I've given you here:
+-------------------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------------------+
|GroupAggregate (cost=2.98..3.72 rows=14 width=76) |
| Group Key: d1.id |
| -> Sort (cost=2.98..3.02 rows=14 width=44) |
| Sort Key: d1.id |
| -> Hash Left Join (cost=1.32..2.72 rows=14 width=44) |
| Hash Cond: (d1.member_composite_id = d2.member_composite_id) |
| Join Filter: ((d2.service_date <= d1.service_date) AND (d2.id < d1.id))|
| -> Seq Scan on db1_dummy d1 (cost=0.00..1.14 rows=14 width=40) |
| -> Hash (cost=1.14..1.14 rows=14 width=40) |
| -> Seq Scan on db1_dummy d2 (cost=0.00..1.14 rows=14 width=40) |
+-------------------------------------------------------------------------------------+
Your query is a real server killer(*). Use the window function lag().
select
id,
member_composite_id,
case service_date_1
when service_date_2 then service_date_1- .1
else service_date_1
end as service_date_1,
service_date_2
from (
select
id,
member_composite_id,
lag(service_date, 1, '2009-01-01') over w - '2009-01-01' as service_date_1,
service_date - '2009-01-01' as service_date_2
from db1_dummy
window w as (partition by member_composite_id order by id)
) main_query
order by id
Create the index before running the query
create index on db1_dummy(member_composite_id, id)
Read in the docs:
3.5. Window Functions
9.22. Window Functions
4.2.8. Window Function Calls
(*) The query produces several additional records for each member_composite_id. In the worst case, this is half the Cartesian product. So before the server can group and calculate aggregates, it has to create some several hundred million rows. My laptop couldn't stand it, the server run out of memory on a table with a million rows. Self-joins always are suspicious, especially on large tables.
I'm doing a query in Kusto on Azure to bring the memory fragmentation value of Redis, this value is obtained by dividing the RSS memory by the memory used, the problem is that I am not able to do the calculation using these two different fields because it is necessary to filter the value of the "Average" field of the "usedmemoryRss" and "usedmemory" fields when I do the filter on the extend line the query returns no value, the code looks like this:
AzureMetrics
| extend m1 = Average | where MetricName == "usedmemoryRss" and
| extend m2 = Average | where MetricName == "usedmemory"
| extend teste = m1 / m2
When I remove the "where" clauyse from the lines it divides the value of each record by itself and return 1. Is it possible to do that? Thank you in advance for your help.
Thanks for the answer Justin you gave me an idea and i solved this way
let m1 = AzureMetrics | where MetricName == "usedmemoryRss" | where Average != 0 | project Average;
let m2 = AzureMetrics | where MetricName == "usedmemory" | where Average != 0 | project Average;
print memory_fragmentation=toscalar(m1) / toscalar(m2)
let Average=datatable (MetricName:string, Value:long)
["usedmemoryRss", 10,
"usedmemory", "5"];
let m1=Average
| where MetricName =="usedmemoryRss" | project Value;
let m2=Average
| where MetricName =="usedmemory" | project Value;
print teste=toscalar(m1) / toscalar (m2)
I have Sessions table
Sessions
|Timespan|Name |No|
|12:00:00|Start|1 |
|12:01:00|End |2 |
|12:02:00|Start|3 |
|12:04:00|Start|4 |
|12:04:30|Error|5 |
I need to extract from it duration of each session using KQL (but if you could give me suggestion how I can do it with some other query language it would be also very helpful). But if next row after start is also start, it means session was abandoned and we should ignore it.
Expected result:
|Duration|SessionNo|
|00:01:00| 1 |
|00:00:30| 4 |
You can try something like this:
Sessions
| order by No asc
| extend nextName = next(Name), nextTimestamp = next(timestamp)
| where Name == "Start" and nextName != "Start"
| project Duration = nextTimestamp - timestamp, No
When using the operator order by, you are getting a Serialized row set, which then you can use operators such as next and prev. Basically you are seeking rows with No == "Start" and next(Name) == "End", so this is what I did,
You can find this query running at Kusto Samples open database.
let Sessions = datatable(Timestamp: datetime, Name: string, No: long) [
datetime(12:00:00),"Start",1,
datetime(12:01:00),"End",2,
datetime(12:02:00),"Start",3,
datetime(12:04:00),"Start",4,
datetime(12:04:30),"Error",5
];
Sessions
| order by No asc
| extend Duration = iff(Name != "Start" and prev(Name) == "Start", Timestamp - prev(Timestamp), timespan(null)), SessionNo = prev(No)
| where isnotnull(Duration)
| project Duration, SessionNo
The following query returns the data that I need:
let timeSpn = bin(ago(60m),1m);
requests
| where cloud_RoleName == "myApp"
| where success == "False"
| where timestamp > timeSpn
| make-series count() on timestamp from timeSpn to now() step 1m by application_Version
The problem is that the result consist of 2 lines (one for each application_Version and not 120 lines (one for each minute and for each version).
I have to use make-series and not the simple summarize because I need the "zero" values.
You can do it using the mv-expand operator
Here's an example from Back-fill Missing Dates With Zeros in a Time Chart:
let start=floor(ago(3d), 1d);
let end=floor(now(), 1d);
let interval=5m;
requests
| where timestamp > start
| make-series counter=count() default=0
on timestamp in range(start, end, interval)
| mvexpand timestamp, counter
| project todatetime(timestamp), toint(counter)
| render timechart
I'm sending customEvents to Azure Application Insights that look like this:
timestamp | name | customDimensions
----------------------------------------------------------------------------
2017-06-22T14:10:07.391Z | StatusChange | {"Status":"3000","Id":"49315"}
2017-06-22T14:10:14.699Z | StatusChange | {"Status":"3000","Id":"49315"}
2017-06-22T14:10:15.716Z | StatusChange | {"Status":"2000","Id":"49315"}
2017-06-22T14:10:21.164Z | StatusChange | {"Status":"1000","Id":"41986"}
2017-06-22T14:10:24.994Z | StatusChange | {"Status":"3000","Id":"41986"}
2017-06-22T14:10:25.604Z | StatusChange | {"Status":"2000","Id":"41986"}
2017-06-22T14:10:29.964Z | StatusChange | {"Status":"3000","Id":"54234"}
2017-06-22T14:10:35.192Z | StatusChange | {"Status":"2000","Id":"54234"}
2017-06-22T14:10:35.809Z | StatusChange | {"Status":"3000","Id":"54234"}
2017-06-22T14:10:39.22Z | StatusChange | {"Status":"1000","Id":"74458"}
Assuming that status 3000 is an error status, I'd like to get an alert when a certain percentage of Ids end up in the error status during the past hour.
As far as I know, Insights cannot do this by default, so I would like to try the approach described here to write an Analytics query that could trigger the alert. This is the best I've been able to come up with:
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize failures = sum(isError), successes = sum(1 - isError) by timestamp bin = 1h
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")
However, for my alert to work properly, the query should:
Return "PASSED" even if there are no events within the hour (another alert will take care of the absence of events)
Only take into account the final status of each Id within the hour.
As the request is written, if there are no events, the query returns neither "PASSED" nor "FAILED".
It also takes into account any records with Status == 3000, which means that the example above would return "FAILED" (5 out of 10 records have Status 3000), while in reality only 1 out of 4 Ids ended up in error state.
Can someone help me figure out the correct query?
(And optional secondary questions: Has anyone setup a similar alert using Insights? Is this a correct approach?)
As mentioned, since you're only querying on a singe hour your don't need to bin the timestamp, or use it as part of your aggregation at all.
To answer your questions:
The way to overcome no data at all would be to inject a synthetic row into your table which will translate to a success result if no other result is found
If you want your pass/fail criteria to be based on the final status for each ID, then you need to use argmax in your summarize - it will return the status corresponding to maximal timestamp.
So to wrap it all up:
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize argmax(timestamp, isError) by tostring(customDimensions.Id)
| summarize failures = sum(max_timestamp_isError), successes = sum(1 - max_timestamp_isError)
| extend ratio = todouble(failures) / todouble(failures+successes)
| extend failure_Percent = ratio * 100
| project Result = iff(failure_Percent < 50, "PASSED", "FAILED"), IsSynthetic = 0
| union (datatable(Result:string, IsSynthetic:long) ["PASSED", 1])
| top 1 by IsSynthetic asc
| project Result
Regarding the bonus question - you can setup alerting based on Analytics queries using Flow. See here for a related question/answer
I'm presuming that the query returns no rows if you have no data in the hour, because the timestamp bin = 1h (aka bin(timestamp,1h)) doesn't return any bins?
but if you're only querying the last hour, i don't think you need the bin on timestamp at all?
without having your data it's hard to repro exactly but... you could try something like (beware syntax errors):
customEvents
| where timestamp > ago(1h)
| extend isError = iff(toint(customDimensions.Status) == 3000, 1, 0)
| summarize totalCount = count(), failures = countif(isError == 1), successes = countif(isError ==0)
| extend ratio = iff(totalCount == 0, 0, todouble(failures) / todouble(failures+successes))
| extend failure_Percent = ratio * 100
| project iff(failure_Percent < 50, "PASSED", "FAILED")
hypothetically, getting rid of the hour binning should just give you back a single row here of
totalCount = 0, failures = 0, successes = 0, so the math for failure percent should give you back 0 failure ratio, which should get you "PASSED".
without being to try it i'm not sure if that works or still returns you no row if there's no data?
for your second question, you could use something like
let maxTimestamp = toscalar(customEvents where timestamp > ago(1h)
| summarize max(timestamp));
customEvents | where timestamp == maxTimestamp ...
// ... more query here
to get just the row(s) that have that have a timestamp of the last event in the hour?