Remove duplicate custom metric events from application insights before plotting in Azure portal - azure-application-insights

I'm logging some custom metrics in Application insights using the TelemetryClient.TrackMetric method in .NET, and I've noticed that occasionally some of the events are duplicated when I view them in the Azure portal.
I've drilled into the data, and the duplicate events have the same itemId and timestamp, but if I show the ingestion time by adding | extend ingestionTime = ingestion_time() to the query then I can see that the ingestion times are different.
This GitHub issue indicates that this behavior is expected, as AI uses at-least-once delivery.
I plot these metrics in charts in the Azure portal using a sum aggregation, however these duplicates are creating trust issues with the charts as the duplicates are simply treated as two separate events.
Is there a way to de-dupe the events based on itemId before plotting the data in the Azure portal?
Update
A more specific example:
I'm running an algorithm, triggered by an event, which results in a reward. The algorithm may be triggered several dozen times a day, and the reward is a positive or negative floating point value. It logs the reward each time to Application Insights as a custom metric (called say custom-reward), along with some additional properties for data splitting.
In the Azure portal I'm creating a simple chart by going to Application Insights -> Metrics and customising the chart. I select my custom-reward metric in the Metric dropdown, and select Sum as the aggregation. I may or may not apply splitting. I save the chart to my dashboard.
This simple chart gives me a nice way of monitoring the system to make sure nothing unexpected is happening, and the Sum value in the bottom left of the chart allows me to quickly see whether the sum of the rewards is positive or negative over the chart's range, and by how much.
However, on occasion I've been surprised by the result (say over the last 12 hours the sum of the rewards was surprisingly negative), and on closer inspection I discovered that a few large negative results have been duplicated. Further investigation shows this has been happening with other events, but with smaller results I tend not to notice.
I'm not that familiar with the advanced querying bit of Application Insights, I actually just used it for the first time today to dig into the events. But it does sound like there might be something I can do there to create a query that I can then plot, with the results deduped?
Update 2
I've managed to make progress with this thanks to the tips by #JohnGardner, so I'll mark that as the answer. I've deduped and plotted the results by adding the following line to the query:
| summarize timestamp=any(timestamp), value=any(value), name=any(name), customDimensions=any(customDimensions) by itemId
Update 3
Adding the following line to the query allowed me to split on custom data (in this case splitting by algorithm ID):
| extend algorithmId = tostring(customDimensions.["algorithm-id"])
With that line added, when you select "Chart" in the query results, algorithmId now shows up as an option in the split dropdown. After that you can click "Pin to dashboard". You lose the handy "sum over the time period" indicator in the bottom left of the chart which you get via the simple "Metrics" chart, however I'm sure I'll be able to recreate that in other ways.

if you are doing your own queries, you would generally be using something like summarize or makeseries to do this deduping for a chart. you wouldn't generally plot individual items unless you are looking at a very small time range?
so instead of something like
summarize count() ...
you could do
summarize dcount(itemId) ...
or you might add a "fake" summarize to a query that didn't need it before with by itemId to coalesce multiple rows into just one, using any(x) to grab any individual row's value for each column for each itemId.
but it really depends on what you are doing in your specific query. if you were using something like sum(itemCount) to also deal with sampling, you have other odd cases now, where the at-least-once delivery might have duplicated sampled items? (updating your question to add a specific query and hypothetical result would possibly lead to a more specific answer).

Related

Average scroll rate in Google Studio

I want to calculate and display the average scroll depth in Data Studio from analytics.
I’m looking to get an average scroll depth in Studio. I’ve got the 10%,25%, etc scroll depth data coming in, but I now need to be able to calculate the average scroll % from this data.
To calculate the average scroll depth:
multiply the scrolled threshold by the number of events (10x500) + (20x400) + (30x475) +(40x300) + (50x200) + (60x100) +(70x75) +(80x60) + (90x20) + (100x10)
Then, take that total divided by the total number of events. 500 + 400 + 475... etc
Because I can’t reference cells in Studio I can’t get it to work. I’ve also tried Google Sheets, which does work to do the calculation, but then I can’t use Data Studios filter to provide a specific page path?
I'm thinking that perhaps the calculation will need to be done at data source, but I am not sure how to reference a 'cell'?
Data Studio doesn't work based on a concept of "cells", it works based on a concept of "fields"—which are basically properties of the data source. Similarly, you don't have "formulas" per se, but rather "calculated fields". These fields can be created either at the chart-level (single-use, but doesn't require permissions to modify the data source) or in the data source (reusable across many charts, requires permissions to modify the data source). Most fields also have an aggregation type, which tells the report how to aggregate it in charts by default (e.g. Sum or Average).
When you either edit your data source and hit "Add Field" or the option with the same name under the "Add metric" or "Add dimension" menus on a chart, you'll be presented with a box to input the formula. To access a field, just type its name (of if you're in the data source, select it from the list on the left). The editor will also typically give you an auto-complete list below your cursor based on what you're typing. Once your entry matches a field, it will get a highlight box around it (the color is based on the type; green = dimension/string,blue = metric/number). The functions available are sort of a mash-up of something between what you'd expect in Google Sheets and in a SQL query, but with more constraints on when you're allowed to use certain functions.
The documentation for calculated fields is pretty simple, so I'd recommend starting there before you try to do too much heavy-lifting in Data Studio. Because of constraints in Data Studio's data model, you'll often find that you need to create separate calculated fields for different parts of the formula, and then combine them in a new calculated field. I'll warn you that the error messages in the field editor aren't super helpful sometimes, so you may need to re-read the documentation for the functions and field types you're working with to ensure you get a valid result.
If you're running into problems, including the field names and values that you need in your calculation may help, including the source of the data (are these GA events?). The more details you give, including what you've already tried, the more helpful we can be. Also, make sure to read the docs first to make sure you have a good handle on the product you're using and the terminology the community is most likely to understand.

Unique Users in Google Analytics

I'm trying to get all unique visitors for a selected time period, but I want to filter them by date on the server. However, the sum of unique visitors for each day isn't the number of unique visitors for the time period.
For example:
Monday: 2 unique visitors
Tuesday: 3 unique visitors
The unique visitors for the two days period isn't necessarily 5.
Is there a way to get the results I want using the Google Analytics API (v3)?
You're right that Users aren't additive, so you can't simply add them day by day. There are several ways around this.
The fist and most obvious is that if you've implemented the User-ID you should be able to straight up pull and interrogate the data about which users saw your site on which days.
Another way I've implemented before is to dynamically pull the number of Users from the Google Analytics API whenever you need it. Obviously this only works if you're populating a live web dashboard or similar, but since it's just the one figure you're asking for, it wouldn't slow down the load time by much. Eg. if you're using a dashboarding tool such as Klipfolio, you may be able to define a dynamic data source, and query Google whenever you needthe figure (https://support.klipfolio.com/hc/en-us/articles/216183237-BETA-Working-with-dynamic-data-sources)
You could also limit the number of ways that the data can be interrogated, and calculate all of them. For example, if you only allow users to look at data month-by-month or day-by-day, then you only need those figures.
Finally, you can estimate the figure with reasonable accuracy by splitting it into two parts. New Users are equal to New Sessions (you're only new on your first Session), which is additive, so that figure can be separated out and combined as required.
Then, you could take a rough ratio of new to returning Users (% New Users) from, say, 1 year of data, and use that with the New Users figure to generate an average on any level.

Discrepancy in Google Analytics data when using segments

I'm having a tough time with Google Analytics, trying to understand why the value of metrics changes when segments are applied.
There is a standard audience overview report, which is based on 100% of sessions (no sampling) and the view is not filtered. The period is March of 2017.
Standard "All visitors" segment looks like this:
Then, there is another built-in segment called "Bounced Sessions". When I apply this segment, the "All visitors" values changes:
Amount of users increases, but the count of pageviews decreases.
Any ideas how to explain this?.. Thank you in advance!
Oki, there can be, multiple reasons. Let me explain first how these numbers are calculated, then we move on to your query.
There two types of data gathering and manipulation from google.
Pre-calculated data -- pre-aggregated tables
These are the precalculated data that Google uses to speed up the UI. Google does not specify when this is done but it can be at any point of the time. These are known as pre-aggregated tables
Data calculated on the fly
Some that you do which result in computation or manipulation falls under this category. Like using segments, creating custom reports etc.
Coming to your problem. When you apply segment, every metric that it effects will be calculated again. Thus it may result in numbers greater than you see in normal view.
Standard audience overview report is pre-aggregated at some point of the day. When you apply segment, the results will be calculated with the fresh data. Since latter is the latest, it will automatically give you increased number of the metrics. Even you can see a decrease as well, all depends on your data and user behavior.
Resolution: If you are a premium user, use Big Query. You must rely on big query for every metric as they are fresh and calculated on the fly

Single-stat percentage change from initial value in graphite/grafana?

Is there a way to simply show the change of a value over the selected time period? All I'm interested in is the offset of the last value compared to the initial one. The values can vary above and below these over the time period, it's not really relevant (and would be exceptions in my case).
For an initial value of 100 and an final value of 105, I'd expect a single stat box displaying 5%.
I have the feeling I'm missing something obvious obvious, but can't find a method to display this deceptively simple task.
Edit:
I'm trying to create a scripted Grafana dashboard that will automatically populate disk consumption growth for all our various volumes. The data is already in Graphite, but for purposes of capacity management and finance planning (which projects/departments gets billed) it would be helpful for managers to have a simple and coarse overview of which volumes grow outside expected parameters.
The idea was to create a list of single-stat values with color coding that could easily be scrolled through to find abnormalities. Disk usage would obviously never be negative, but volatility in usage between the start and end of the time period would be lost in this view. That's not a big concern for us as this is all shared storage and such usage is expected to a certain degree.
The perfect solution would be to have the calculations change dynamically based on the selected time period.
I'm thinking that this is not really possible (at least not easily) to do with just Graphite and Grafana and have started looking for alternative methods. We might have to implement a different reporting system for this purpose.
Edit 2
I've tried implementing the suggested solution from Leonid, and it works after a fashion. The calculations seems somewhat off from what I expected though.
My test dashboard looks like follows:
If I were to calculate the change manually, I'd end up with roughly 24% change between the start (7,23) and end (8.96) value. Graphite calculates this to 19%. It's probably a reason for the discrepancy, probably something to do with it being a time-series and not discreet values?
As a sidenote: The example is only 30 days, even though the most interesting number would be a year. We don't have quite a year of data in Graphite yet and having a 30 day view is also interesting. It seems I have to implement several dashboards with static times.
You certainly can do that for some fixed period. For example following query take absolute difference betweent current metric value and value that metric has one minute ago (i.e. initial value) and then calculate it's percentage of inital value.
asPercent(absolute(diffSeries(my_metric, timeShift(my_metric, '1m'))), timeShift(my_metric, '1m'))
I believe you can't do that for time period selected in Grafana picker.
But is that really what you need? It's seems strange because as you said value can change in both directions. Maybe standard deviation would be more suitable for you? It's available in Graphite as stdev function.

Charting a count of records returned

I am trying to create some charts in Flex/FlashBuilder 4.5
The issue I have it that the information I wish to display is the number of events within an area. I am using HTTP service to access a rails controller which is returning an XML list of the events.
I need to figure out how to chart the number of events, eg number of records returned. There is no numeric value within chart, as there would be if I was charting sales or prices for example.
I'm a little stumped as to the best way to do this?
Use a dataFunction. Check out http://flexdiary.blogspot.com/2008/08/charting-example.html for an example that uses one.

Resources