How do you sum a statsd counter over a large time range with correct values? - graphite

Background
A basic use case for statsd & grafana is learning how many times a function has been called over a time range-- whether that is "last 6h", "since beginning of today", "since beginning of time", etc.
What I'm struggling to find is the correct function to achieve this. I'm using a hosted solution; however, I can confirm that data is being flushed from StatsD to Graphite in 10s intervals.
Current Setup
StatsD Flush: 10s
Graph Function: hitcount(counters.login.employer.count, "10seconds")
Time Range: 24h
Problem
When using hitcount(counters.login.employer.count, "10seconds"), the data returned is incorrect. In fact, I can do 24h, 23h, 22h, and note the values are actually increasing.
I've performed all testing here in a controlled environment, only my machine is sending metrics to StatsD. This is not yet in production code.
Any idea what could be going on here?

The way counters work is that on each interval the value of the counter is sent to graphite and reset in statsd, so what you're looking for is the sum of the series.
You can do that using consolidateBy('sum') combined with maxDataPoints=1.
Be aware that if your series is being aggregated in graphite you'll need to make sure that the aggregation is by sum, otherwise when values get rolled up from the individual values reported by statsd into aggregated buckets they'll be averaged, and your sum won't work across longer intervals. You can read more about configuring aggregation in Graphite here.

Related

Build a cumulative variable/aggregate from raw data

I got a power consumption sensor (kWh) sending data to my TSI Gen2 environment, and it is malfunctioning in a way that it is losing its accumulated measuremente value when it is shut down. I need to create a new aggregate/variable that would "stack" the measurements , never letting it drop to zero, but always adding to the last greatest value.
I thought about creating a dataset with values from differences from right to left over a fixed timespan, if positive, and then I could create a SUM aggregation over the bucket period on top of it. I am clueless on how to do such thing based on the poor official documentation provided by Microsoft. Any Ideas?
Here are a couple of pictures illustrating my problem and What I am trying to accomplish:
You probably need to add something in the middle (before the IoT Hub/Event Hub) to save the last state of the sensor, and do the appropriate sum if if detects the device was rebooted.

Averaging when events are slower than measurement time

I'm trying to come up with a better way of providing an instantaneous average when input signal is very slow. This seems like a math-y kinda question so if it should be over there let me know.
I have events that are measured as a pulse. Normally I can collect the pulses in a counter and then read the counter value at a fixed interval, say 1/4 second. I can then take the count value and divide by the number of seconds so n/0.25 and get a rate. I then apply a low pass filter to clean up the average and that works great normally.
What do I do when the events happen once every 1-60 seconds? The obvious choice is to wait until I have a sufficient number of counts and divide by total time. However, I need to provide the user with a reading every few seconds so waiting is not an option. I need some way to estimate the value.
I've thought of one solution that's kinda hard to explain. I was wondering if there was a standard way of doing this. I'm pretty sure I have to utilize a different kind of "data," the lack of an event. The goal is to estimate until enough time/events have passed to really calculate a rate and to transition from estimate to real rate seamlessly.

How can I determine the amount of time a rvest query is for the http response

I have been using the rvest package to perform screen scraping for some data analytics, but there are some queries that are taking a few seconds each to actually collect the data. e.g.
sectorurl = paste("http://finance.yahoo.com/q/pr?s=,",ticker,"+Profile", sep= "")
index <- read_html( sectorurl)
The second step is that one that is taking the time, so I was wondering if there were any diagnostics in the background of R or a clever package that could be run that will determine "network wait time" as opposed to CPU time, or something similar.
I would like to know if I'm stuck with the performance I have, or if actually my R code is performing well and it is http response that is limiting my process speed.
I don't think you will be able to separate the REST call from the client side code. However, my experience with accessing web services is that the network time generally dominates the total running time, with the "CPU" time being an order of magnitude, or more, less.
One option for you to try would be to paste your URL, which appears to be a GET, into a web browser and see how long it takes to complete from the console. You can compare this time against the total time taken in R for the same call. For this, try using system.time, which returns the CPU time used by a given expression.
require(stats)
system.time(read_html(sectorurl))
Check out the documentation for more information.

Graphite, datapoints disappear if I choose a wider time range

If I ask for this data:
https://graphite.it.daliaresearch.com/render?from=-2hours&until=now&target=my.key&format=json
I get, among other datapoints, this one:
[
2867588,
1398790800
]
If I ask for this data:
https://graphite.it.daliaresearch.com/render?from=-10hours&until=now&target=my.key&format=json
The datapoint looks like this:
[
null,
1398790800
]
Why this datapoint is being nullified when I choose a wider time range?
Update
I'm seeing that for a chosen date range smaller than 7 hours the resolution of the datapoints are every 10 seconds and when the date range chosen is 7 hours or bigger the the resolution goes to one datapoint every 1 minute.. and continue this diretion as the date range chosen is getting bigger to one datapoint every 10 minutes and so.
So when the resolution of the datapoints is every 10 seconds the data is there, when the resolution is every 1 minute or more, then the datapoint has not the value :/
I'm sending a data point every 1 hour, maybe it is a conflict with the resolutions configuration and me sending only one datapoint per hour
There are several things happening here, but basically the problem is that you have misconfigured graphite (or at least, configured it in a way that makes it do things that you aren't expecting!)
Specifically, you should set xFilesFactor = 0.0 in your storage-aggregation.conf file. Since you are new at this, you probably just want this (mine is in /opt/graphite/conf/storage-aggregation.conf):
[default]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average
The graphite docs describe xFilesFactor like this:
xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5.
But wait! This wont change existing statistics! These aggregation settings are set once per metric at the time the metric is created. Since you are new at this, the easy way out is to just go to your whisper directory and delete the prior data and start over:
cd /opt/graphite/storage/whisper/my/
rm key.wsp
your root whisper directory may be different depending on platform, etc. After removing the data files graphite should recreate them automatically upon the next metric write and they should get your updated settings (dont forget to restart carbon-cache after changing your storage-aggregation settings).
Alternatively, if you need to keep your old data you will need to run whisper-resize.py against your whisper (.wsp) data files with --xFilesFactor=0.0 and also likely all of your retention settings from storage-schemas.conf (also viewable with whisper-info.py)
Finally, I should add that the reason you get non-null data in your first query, but null data in your second is because graphite will try to pick the best available retention period from which to serve your request based on the time window you requested. For the smaller window, graphite is deciding that it can serve your request using the highest precision data (i.e., non aggregated) and so you are seeing your raw metrics. For the longer time window, graphite is finding that the high precision, non-aggregated data is not available for the entire window -- these periods are configured in storage-schemas.conf -- so it skips to the next highest-precision data set available (i.e. first aggregation tier) and returns only aggregated data. Because your aggregation config is writing null data, you are therefore seeing null metrics! So fix the aggregation, and you should fix the null data problem. But remember that graphite never combines aggregation tiers in a single request/response, so anytime you see differences between results from the same query when all you are changing is the from / to params, the problem is pretty much always due to aggregation configs.
I'm not quite sure about your specific situation, but I think I can give you some general pointers.
First off, you are right about the changing resolution depending on the time range. This is configured in storage-schemas.conf and is done to save space when storing data over large periods of time. An example could be: 15s:7d,1m:21d,15m:5y, meaning 15 seconds resolution for 7 days, then 1 minute resolution for 21 days, then 15min for 5 years.
Then there is the way Graphite does the actual aggregation from one resolution to the other. This is configured in: storage-aggregation.conf. The default settings are: xFilesFactor=0.5 and aggregationMethod=average. The xFilesFactor setting is saying that a minimum of 50% of the slots in the previous retention level must have values for next retention level to contain an aggregate. The aggregationMethod is saying that all the values of the slots in the previous retention level will be combined by averaging. My guess is that your stat doesn't have enough data points to fulfill the 50% requirement, resulting in a null value.
For more information, check out the docs, they are pretty complete: http://graphite.readthedocs.org/en/latest/config-carbon.html

Graphite: show change from previous value

I am sending Graphite the time spent in Garbage Collection (getting this from jvm via jmx). This is a counter that increases. Is their a way to have Graphite graph the change every minute so I can see a graph that shows time spent in GC by minute?
You should be able to turn the counter into a hit-rate with the Derivative function, then use the summarize function to the counter into the time frame that your after.
&target=summarize(derivative(java.gc_time), "1min") # time spent per minute
derivative(seriesList)
This is the opposite of the integral function. This is useful for taking a
running totalmetric and showing how many requests per minute were handled.
&target=derivative(company.server.application01.ifconfig.TXPackets)
Each time you run ifconfig, the RX and TXPackets are higher (assuming there is network traffic.)
By applying the derivative function, you can get an idea of the packets per minute sent or received, even though you’re only recording the total.
summarize(seriesList, intervalString, func='sum', alignToFrom=False)
Summarize the data into interval buckets of a certain size.
By default, the contents of each interval bucket are summed together.
This is useful for counters where each increment represents a discrete event and
retrieving a “per X” value requires summing all the events in that interval.
Source: http://graphite.readthedocs.org/en/0.9.10/functions.html

Resources