graphite: how to get per-second metrics from batch metrics? - graphite

I'm trying to measure a online mini-batch processing system with a per-second metrics (total query per second). For every batch, a metric (e.g. "stats.gauges.<host>.query.count") will be send to graphite. batches are processed in several different hosts in parallel and a batch of data take about 5 seconds to process.
I've tried:
simply sum series: sumSeries(stats.gauges.*.query.count),
the result metrics is many times greater than the actual value;
scale to 1 second:
scaleToSeconds(sumSeries(stats.gauges.*.query.count),
1), the result metrics is much less than the actual value;
integral then derivative: nonNegativeDerivative(sumSeries(integral(stats.gauges.*.query.count))), same as the first case ...
send gauges with
delta=True param, then derivative. the result is about 20% greater
than the actual value
so, how to get per-second metrics from batch metrics? what is the best practice?

You should use carbon-aggregator service to add several metrics together as they come in. There is an example which fits your case at http://graphite.readthedocs.io/en/latest/config-carbon.html#aggregation-rules-conf
As your batch takes 5 secs to process, frequency should be 5 to buffer all the metrics. After five seconds, aggregator will sum them up and write to graphite.

Related

DynamoDB ConsumedWriteCapacityUnits vs Consumed CloudWatch Metrics for 1 second Period

I am confused by this live chart, ConsumedWriteCapacityUnits is exceeding the provisioned units, while "consumed" is way below. Do I have a real problem or not?
This seems to only show for the
One Second Period
One Minute Period
Your period is wrong for the metrics. DynamoDB emits metrics at the following periods:
ConsumedCapacity: 1 min
ProvisionedCapacity: 5 min
For ConsumedCapacity you should divide the metric by the period but only at a minimum of 1min.
Exceeding provisioned capacity for short periods of time is fine, as burst capacity will allow you to do so. But if you exceed it for long periods it will lead to throttling.

AWS DynamoDB matrix doesn't work correctly

I am using dynamoDB and curious of this
When I see table matrix with period of 10, 30 seconds, it seems that it exceeds provisioned value
Then, when I see table matrix with period of 1 minutes, it doesn't reach provisioned value at all
I want to know why this happens.
This isn't a DynamoDB thing. It's a CloudWatch rendering thing.
Consumed capacity has a natural period of 1 minute. When you set your graph period to 1 minute everything is correct. Consumed is below provisioned.
If you change the graph period to 30 seconds, your consumed view adjusts and you see consumption that's double what's real. The math behind the scenes divides by the wrong period. Graph period of 10 seconds, you get 6x reality. Graph period of 5 minutes, you get 1/5th reality.
The Provisioned line isn't based on an equation involving Period so it's not affected by the chosen period.
Maybe someone can comment on why the user is allowed to control the Period of the view but it just messes things up when it doesn't match the natural period of the data.

How to find overall CPU usage in a multi-tenant environment?

I have a single server/multiple workers architecture to run distributed queries. Once a query finishes, it reports back to the server the total time taken for completion and I maintain a running counter of the total CPU time (totalCpuTime) for completion of ALL the queries. With this count I want to expose the overall CPU Usage of the cluster on the server.
I was thinking of polling for the totalCpuTime every 3 minutes (sampling rate). Lets say I have two data-points d1 and d2 which are sampling_rate = 1800s apart. I did cpuUsage = (d2 - d1)/total_number_of_workers * sampling_rate but this is giving me a number greater than 100 and I don't know how to make sense of it.
Any ideas or other approaches?

Graphite: show change from previous value

I am sending Graphite the time spent in Garbage Collection (getting this from jvm via jmx). This is a counter that increases. Is their a way to have Graphite graph the change every minute so I can see a graph that shows time spent in GC by minute?
You should be able to turn the counter into a hit-rate with the Derivative function, then use the summarize function to the counter into the time frame that your after.
&target=summarize(derivative(java.gc_time), "1min") # time spent per minute
derivative(seriesList)
This is the opposite of the integral function. This is useful for taking a
running totalmetric and showing how many requests per minute were handled.
&target=derivative(company.server.application01.ifconfig.TXPackets)
Each time you run ifconfig, the RX and TXPackets are higher (assuming there is network traffic.)
By applying the derivative function, you can get an idea of the packets per minute sent or received, even though you’re only recording the total.
summarize(seriesList, intervalString, func='sum', alignToFrom=False)
Summarize the data into interval buckets of a certain size.
By default, the contents of each interval bucket are summed together.
This is useful for counters where each increment represents a discrete event and
retrieving a “per X” value requires summing all the events in that interval.
Source: http://graphite.readthedocs.org/en/0.9.10/functions.html

How to intrepret values of the ASP.NET Requests/sec performance counter?

If I monitor ASP.NET Requests/sec performance counter every N seconds, how should I interpret the values? Is it number of requests processed during sample interval divided by N? or is it current requests/sec regardless the sample interval?
It is dependent upon your N number of seconds value. Which coincidentally is also why the performance counter will always read 0 on its first nextValue() after being initialized. It makes all of its calculations relatively from the last point at which you called nextValue().
You may also want to beware of calling your counters nextValue() function at intervals less than a second as it can tend to produce some very outlandish outlier results. For my uses, calling the function every 5 seconds provides a good balance between up to date information, and smooth average.

Resources