How to create percentile-based metric chart? - stackdriver

My application generates "score" values for a particular use case. These scores generally are anywhere in the range of 0-120, but most cluster in the range of 60-95.
I currently have a stat chart using counts with cardinality, e.g., 0, 1-12, 13-24, 25-36, ... 97-108, and 109+.
I'd like to instead create a percentile chart with time series lines showing percentile scores in increments of 10%, i.e., 10% score line, 20% score line, 40% score line, etc., up to 90% score line.
Is that even possible? How do I do that, beginning with recording the stat using OpenCensus Java?

Cloud Monitoring doesn't really have the ability to calculate percentages at display time the way you're looking for. You can use OpenCensus to write a distribution with buckets, and you could then query their boundaries and counts - here's an example:
https://cloud.google.com/solutions/identifying-causes-of-app-latency-with-stackdriver-and-opencensus
Specifically, I'm quoting from the Accuracy section:
Monitoring computes latency percentiles for distribution metrics based on bucket boundaries at numeric intervals. This method is a common method used by Monitoring and OpenCensus where OpenCensus represents and exports metrics data to Monitoring. The TimeSeries.list method for the Cloud Monitoring API returns the bucket counts and boundaries for your project and metric types. You can retrieve the bucket boundaries in the Cloud Monitoring API BucketOptions object, which you can experiment with in the API Explorer for TimeSeries.list

Related

Is there a suitable way to cluster time series where only four values are possible(0,1,2,4) and length is not fixed?

I am trying to cluster customers consumption behaviors using time series techniques. Customers buy tokens and use them whenever they want (a max of 4 tokens per day).
This is a sample of what the customers journeys time series (x = days after first order , y = number of tokens consumed per day) and it look alike the image below.
I tried clustering with derived variables (median delay between two events, standard deviation of the delays, total number of tokens, time between first and last consumption, mean number of tokens consumed per consumption event ...). I used K-means and this gave me some good results but it wasn't enough to spot all patterns in data. I looked at some papers about the use of Dynamic time warping in such cases but I have never used such algorithms..
Is there any materials (demos) on the use of such algorithms to cluster such time series ?
Yes.
There are many techniques that can be useful here.
The obvious approach from literature would be HAC with DTW.

Setting minimum sample size for multiple sub-populations based on smallest sub-population

So I have 1 population of users, this population is split into sub-populations of users based on their date of birth. There are about 20 different buckets of users that fall into the desired age groups.
The question is to see how different bucket interacts with a system over time.
Each bucket has varied size, biggest bucket has about 20,000 users (at the mid point of the distribution) with both tail ends having <200 users each.
To answer the question of system usage over time I have cleaned the data and am taking a sample of .9 of the lowest sup-population from each of the buckets.
Then I re-sample with replacement N number of times (can be between 100 to 10000 or whatnot). The average of these re-samples closely approaches the sub-population mean of each bucket, what I find that pretty much over time for most metrics of interaction (1,2,3,4,5,6 months) the tail end with the lowest number of users is the most active. (this could suggest that higher member buckets contain a large proportion of users who are not active or those users that are active are just not as active different user buckets).
I took a quick summary of each of the buckets to make sure that there are no irregularities and indeed the data shows that the lowest bucket does have higher quartiles, mean, lowest and highest data values compared to the other buckets.
I went over the data collection methodology to make sure that there are no errors in obtaining the data and looking through various data points it does support the result of graphing the re-sampled values.
My question is, should I take sample size based on each individual bucket independently, my gut tells me no as all the buckets belong to the same population and if I sample on the buckets each sample has to be fair and thus use N number of data points from the smallest bucket.
There is no modelling involved, this is just looking at the average number of usage of each user bucket per month.
Is my approach more or less on the right track?

How does Graphite handle oversamples

I am trying to understand how Graphite treats over samples. I read the documentation but could not find the answer.
For example, If I specify in Graphite that the retention policy should be 1 sample in 60 seconds and graphite receives something like 200 values in 60 seconds, what will be stored exactly ? Will graphite take an average or a random point in those 200 points ?
Short answer: it depends on the configuration, default is to take the last one.
Long answer, Graphite can configure, using regexp a strategy to aggregate several points in one sample.
These strategies are configured in storage-aggregations.conf file, using regexp to select metrics:
[all_min]
pattern = \.min$
aggregationMethod = min
This example conf, will aggregate points using their minimum.
By default, the last point to arrive wins.
This strategy will always be used to aggregate from higher resolutions to lower resolutions.
For example, if storage-schemas.conf contains:
[all]
pattern = .*
retentions = 1s:8d,1h:1y
Given the sum aggregation method, all points arrived for the same second will be summed and stored with a second resolution.
Points older than 8 days will be summed again to one hour resolution.
The aggregation configuration only applies when moving from archive i to archive i+1. For oversampling, it's always pick the last sample in the period.
The recommendation is to match sampling rate with the configuration.
see graphite issue

Transform for graphite counter

I'm using the incr function from the python statsd client. The key I'm sending for the name is registered in graphite but it shows up as a flat line on the graph. What filters or transforms do I need to apply to get the rate of the increments over time? I've tried an apply function > transform > integral and an apply function > special > aggregate by sum but no success yet.
Your requested function is "Summarize" - see it over here: http://graphite.readthedocs.org/en/latest/functions.html
In order to the totals over time just use the summarize functions with the "alignToFrom =
true".
For example:
You can use the following metric for 1 day period:
summarize(stats_counts.your.metrics.path,"1d","sum",true)
See graphite summarize datapoints
for more details.
The data is there, it just needs hundreds of counts before you start to be able to see it on the graph. Taking the integral also works and shows number of cumulative hits over time, have had to multiple it by x100 to get approximately the correct value.

Graphite does not graph values correctly when using long durations?

I'm trying to graph data using statsd and graphite. I have a simple counter, I increment it by 1, and then when I graph the values for the counter over the day, I see strange values like 0.09 as the peak in my graph (see http://i.stack.imgur.com/o4gmz.png)
This graph should be showing 2 logins, but instead it's showing 0.09. If I change the time scale from 1 day to the last 15 minutes, then it correctly shows the two logins (see http://i.stack.imgur.com/23vDJ.png)
I've set up my finest retention to be in 10s increments in storage-schemas.conf:
retentions = 10s:7d,1m:21d,24h:5y
I've set up my storage-aggregation.conf file to sum counts:
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
(And, before you ask, yes; this is a .count).
If I try my URL with &rawData=true then in either case I see some Nones, some 0.0s, and a pair of 1.0s separated by some 0.0s. I never see these fractional values that somehow show up on the graph. So... Is this a bug? Am I doing something wrong?
There's also consolidateBy function which tells graphite what to do if there's no enough pixels to draw everything accurately. By default it's using "avg" function and therefore strange results when time ranges are greater. Here excerpt from documentation:
When a graph is drawn where width of the graph size in pixels is
smaller than the number of datapoints to be graphed, Graphite
consolidates the values to to prevent line overlap. The
consolidateBy() function changes the consolidation function from the
default of ‘average’ to one of ‘sum’, ‘max’, or ‘min’. This is
especially useful in sales graphs, where fractional values make no
sense and a ‘sum’ of consolidated values is appropriate.
Another function that could be useful is hitcount. Short excerpt from here why it's useful:
This function is like summarize(), except that it compensates
automatically for different time scales (so that a similar graph
results from using either fine-grained or coarse-grained records) and
handles rarely-occurring events gracefully.
I spent some time scratching my head why I get fractions for my counter with time ranges longer than couple hours when my aggregation rule is max. It's pretty confusing, especially at the beginning when you play with single counters to see if everything works. Checking rawData is quite a good way for debugging sanity check ;)

Resources