Graphite does not graph values correctly when using long durations? - graphite

I'm trying to graph data using statsd and graphite. I have a simple counter, I increment it by 1, and then when I graph the values for the counter over the day, I see strange values like 0.09 as the peak in my graph (see http://i.stack.imgur.com/o4gmz.png)
This graph should be showing 2 logins, but instead it's showing 0.09. If I change the time scale from 1 day to the last 15 minutes, then it correctly shows the two logins (see http://i.stack.imgur.com/23vDJ.png)
I've set up my finest retention to be in 10s increments in storage-schemas.conf:
retentions = 10s:7d,1m:21d,24h:5y
I've set up my storage-aggregation.conf file to sum counts:
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
(And, before you ask, yes; this is a .count).
If I try my URL with &rawData=true then in either case I see some Nones, some 0.0s, and a pair of 1.0s separated by some 0.0s. I never see these fractional values that somehow show up on the graph. So... Is this a bug? Am I doing something wrong?

There's also consolidateBy function which tells graphite what to do if there's no enough pixels to draw everything accurately. By default it's using "avg" function and therefore strange results when time ranges are greater. Here excerpt from documentation:
When a graph is drawn where width of the graph size in pixels is
smaller than the number of datapoints to be graphed, Graphite
consolidates the values to to prevent line overlap. The
consolidateBy() function changes the consolidation function from the
default of ‘average’ to one of ‘sum’, ‘max’, or ‘min’. This is
especially useful in sales graphs, where fractional values make no
sense and a ‘sum’ of consolidated values is appropriate.
Another function that could be useful is hitcount. Short excerpt from here why it's useful:
This function is like summarize(), except that it compensates
automatically for different time scales (so that a similar graph
results from using either fine-grained or coarse-grained records) and
handles rarely-occurring events gracefully.
I spent some time scratching my head why I get fractions for my counter with time ranges longer than couple hours when my aggregation rule is max. It's pretty confusing, especially at the beginning when you play with single counters to see if everything works. Checking rawData is quite a good way for debugging sanity check ;)

Related

How to stop munin from calculating the values in millis?

I wrote my own munin-plugin and I'm confused, how munin represents values between 0 and 1.
The 2nd line of values uses an notation with 'n'. What does it mean and how can I avoid it? I just want a common floating point value like 0.33!
What I'm doing wrong?
Here is the plugin configuration:
graph_title Title
graph_args --base 1000 -l 0
graph_vlabel label
graph_category backup
Every hint is welcome.
UPDATE
Ok, I finally found it: https://serverfault.com/questions/123761/what-does-the-m-unit-in-munin-mean
'm' stands for milli!
I am a bit confused, why munin is using it in this context!
Is there any possibility to avoid this?
Some configuration items that may be of interest. See Munin Plugin Reference.
You may change the resolution in numeric resolution in the table data with the below. Use C/C++ formatting, and spacing may be an issue in end result.
graph_printf %5.2le # default is %6.2lf
# appears l (for long) is needed
For certain graph types you can also alter the display period for the munin graph. Munin graphs the average data between last two (nominally 5 minute spaced) measurements - with default of per/sec units on the result. This is based on the var.type attribute which defaults to GAUGE. If your for example data is reporting accumulated counts you can use:
type DERIVE # data displayed is change from last
# reading
type COUNTER # accumulating counter with reset on rollover
for each of these you can change the vertical axis to use per/minute or per/hour with:
graph_period minute # if type DERIVE or COUNTER
# display avg/period vs default of avg/sec
Note that the table shows the derived data for the graph_period.
See RRDCreate == rrdtool create for implementation particulars- munin used rrd underneath.

How does Graphite handle oversamples

I am trying to understand how Graphite treats over samples. I read the documentation but could not find the answer.
For example, If I specify in Graphite that the retention policy should be 1 sample in 60 seconds and graphite receives something like 200 values in 60 seconds, what will be stored exactly ? Will graphite take an average or a random point in those 200 points ?
Short answer: it depends on the configuration, default is to take the last one.
Long answer, Graphite can configure, using regexp a strategy to aggregate several points in one sample.
These strategies are configured in storage-aggregations.conf file, using regexp to select metrics:
[all_min]
pattern = \.min$
aggregationMethod = min
This example conf, will aggregate points using their minimum.
By default, the last point to arrive wins.
This strategy will always be used to aggregate from higher resolutions to lower resolutions.
For example, if storage-schemas.conf contains:
[all]
pattern = .*
retentions = 1s:8d,1h:1y
Given the sum aggregation method, all points arrived for the same second will be summed and stored with a second resolution.
Points older than 8 days will be summed again to one hour resolution.
The aggregation configuration only applies when moving from archive i to archive i+1. For oversampling, it's always pick the last sample in the period.
The recommendation is to match sampling rate with the configuration.
see graphite issue

RRDTOOL one second logging, values missing

I spent more than two months with RRDTOOL to find out how to store and visualize data on graph. I'm very close now to my goal, but for some reason I don't understand why it is happening that some data are considered to be NaN in my case.
I counting lines in gigabytes sized of log files and have feeding the result to an rrd database to visualize events occurrence. The stepping of the database is 60 seconds, the data is inserted in seconds base whenever it is available, so no guarantee the the next timestamp will be withing the heartbeat or within the stepping. Sometimes no data for minutes.
If have such big distance mostly my data is considered to be NaN.
b1_5D.rrd
1420068436:1
1420069461:1
1420073558:1
1420074583:1
1420076632:1
1420077656:1
1420079707:1
1420080732:1
1420082782:1
1420083807:1
1420086881:1
1420087907:1
1420089959:1
1420090983:1
1420094055:1
1420095080:1
1420097132:1
1420098158:1
1420103284:1
1420104308:1
1420107380:1
1420108403:1
1420117622:1
1420118646:1
1420121717:1
1420122743:1
1420124792:1
1420125815:1
1420131960:1
1420134007:1
1420147326:1
1420148352:1
rrdtool create b1_4A.rrd --start 1420066799 --step 60 DS:Value:GAUGE:120:0:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:10:1008 RRA:AVERAGE:0.5:30:1440 RRA:AVERAGE:0.5:360:1460
The above gives me an empty graph for the input above.
If I extend the heart beat, than it will fill the time gaps with the same data. I've tried to insert zero values, but that will average out the counts and bring results in mils.
Maybe I taking something wrong regarding RRDTool.
It would be great if someone could explain what I doing wrong.
Thank you.
It sounds as if your data - which is event-based at irregular timings - is not suitable for an RRD structure. RRD prefers to have its data at constant, regular intervals, and will coerce the incoming data to match its requirements.
Your RRD is defined to have a 60s step, and a 120s heartbeat. This means that it expects one sample every 60s, and no further apart than 120s.
Your DS is a gauge, and so the values you enter (all of them '1' in your example) will be the values stored, after any time normalisation.
If you increase the heartbeat, then a value received within this time will be used to make a linear approximation to fill in all samples since the last one. This is why doing so fills the gaps with the same data.
Since your step is 60s, the smallest sample time sidth will be 1 minute.
Since you are always storing '1's, your graph will therefore either show '1' (when the sample was received in the heartbeart window) or Unknown (when the heartbeat expired).
In other words, your graph is showing exactly what you gave it. You data are being coerced into a regular set of numerical values at a 1-minute step, each being 1 or Unknown.

how to do a sum in graphite but exclude cases where not all data is present

We have 4 data series and once in a while one of the 4 has a null as we missed reading the data point. This makes the graph look like we have awful spikes in loss of volume coming in which is not true as we were just missing the data point.
I am doing a basic sumSeries(server*.InboundCount) right now for server 1, 2, 3, 4 where the * is.
Is there a way where graphite can NOT sum the locations on the line and just have sum for those points in time be also null so it connects the line from the point where there is data to the next point where there is data.
NOTE: We also display the graphs server*.InboundCount individually to watch for spikes on individual servers.
or perhaps there is function such that it looks at all the series and if any of the values is null, it returns null for every series that it takes X series and returns X series points to the sum function as null+null+null+null hopefully doesn't result in a spike and shows null.
thanks,
Dean
This is an old question but still deserves an answer as a point of reference, what you're after I believe is the function KeepLastValue
Takes one metric or a wildcard seriesList, and optionally a limit to the number of ‘None’ values to skip over. Continues the line with the last received value when gaps (‘None’ values) appear in your data, rather than breaking your line.
This would make your function
sumSeries(keepLastValue(server*.InboundCount))
This will work ok if you have a single null datapoint here and there. If you have multiple consecutive null data points you can specify how far back before a null breaks your data. For example, the following will look back up to 10 values before the sumSeries breaks:
sumSeries(keepLastValue(server*.InboundCount, 10))
I'm sure you've since solved your problems, but I hope this helps someone.

Graphite: append a "current time" point to the end of a series

I have a "succeeded" metric that is just the timestamp. I want to see the time between successive successes (this is how long the data is stale for). I have
derivative(Success)
but I also want to know how long between the last success time and the current time. since derivative transforms xs[n] to xs[n+1] - xs[n], the "last" delta doesn't exist. How can I do this? Something like:
derivative(append(Success, now()))
I don't see any graphite functions for appending series, and I don't see any user-defined graphite functions.
The general problem is to be alerted when the data is stale, via graphite monitoring. There may be a better solution than the one I'm thinking about.
identity is a function whose value at any given time is the timestamp of that time.
keepLastValue is a function that takes a series and replicates data points forward over gaps in the data.
So then diffSeries(identity("now"), keepLastValue(Success)) will be a "sawtooth" series that climbs steadily while Success isn't updated, and jumps down to zero (or close to it — there might be some time skew) every time Success has a data point. If you use graphite monitoring to get the current value of that expression and compare it to some threshold, it will probably do what you want.

Resources