How does Graphite handle oversamples

How does Graphite handle oversamples - graphite

I am trying to understand how Graphite treats over samples. I read the documentation but could not find the answer.
For example, If I specify in Graphite that the retention policy should be 1 sample in 60 seconds and graphite receives something like 200 values in 60 seconds, what will be stored exactly ? Will graphite take an average or a random point in those 200 points ?

Short answer: it depends on the configuration, default is to take the last one.
Long answer, Graphite can configure, using regexp a strategy to aggregate several points in one sample.
These strategies are configured in storage-aggregations.conf file, using regexp to select metrics:
[all_min]
pattern = \.min$
aggregationMethod = min
This example conf, will aggregate points using their minimum.
By default, the last point to arrive wins.
This strategy will always be used to aggregate from higher resolutions to lower resolutions.
For example, if storage-schemas.conf contains:
[all]
pattern = .*
retentions = 1s:8d,1h:1y
Given the sum aggregation method, all points arrived for the same second will be summed and stored with a second resolution.
Points older than 8 days will be summed again to one hour resolution.

The aggregation configuration only applies when moving from archive i to archive i+1. For oversampling, it's always pick the last sample in the period.
The recommendation is to match sampling rate with the configuration.
see graphite issue

Related

RRDTOOL one second logging, values missing

I spent more than two months with RRDTOOL to find out how to store and visualize data on graph. I'm very close now to my goal, but for some reason I don't understand why it is happening that some data are considered to be NaN in my case.
I counting lines in gigabytes sized of log files and have feeding the result to an rrd database to visualize events occurrence. The stepping of the database is 60 seconds, the data is inserted in seconds base whenever it is available, so no guarantee the the next timestamp will be withing the heartbeat or within the stepping. Sometimes no data for minutes.
If have such big distance mostly my data is considered to be NaN.
b1_5D.rrd
1420068436:1
1420069461:1
1420073558:1
1420074583:1
1420076632:1
1420077656:1
1420079707:1
1420080732:1
1420082782:1
1420083807:1
1420086881:1
1420087907:1
1420089959:1
1420090983:1
1420094055:1
1420095080:1
1420097132:1
1420098158:1
1420103284:1
1420104308:1
1420107380:1
1420108403:1
1420117622:1
1420118646:1
1420121717:1
1420122743:1
1420124792:1
1420125815:1
1420131960:1
1420134007:1
1420147326:1
1420148352:1
rrdtool create b1_4A.rrd --start 1420066799 --step 60 DS:Value:GAUGE:120:0:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:10:1008 RRA:AVERAGE:0.5:30:1440 RRA:AVERAGE:0.5:360:1460
The above gives me an empty graph for the input above.
If I extend the heart beat, than it will fill the time gaps with the same data. I've tried to insert zero values, but that will average out the counts and bring results in mils.
Maybe I taking something wrong regarding RRDTool.
It would be great if someone could explain what I doing wrong.
Thank you.

It sounds as if your data - which is event-based at irregular timings - is not suitable for an RRD structure. RRD prefers to have its data at constant, regular intervals, and will coerce the incoming data to match its requirements.
Your RRD is defined to have a 60s step, and a 120s heartbeat. This means that it expects one sample every 60s, and no further apart than 120s.
Your DS is a gauge, and so the values you enter (all of them '1' in your example) will be the values stored, after any time normalisation.
If you increase the heartbeat, then a value received within this time will be used to make a linear approximation to fill in all samples since the last one. This is why doing so fills the gaps with the same data.
Since your step is 60s, the smallest sample time sidth will be 1 minute.
Since you are always storing '1's, your graph will therefore either show '1' (when the sample was received in the heartbeart window) or Unknown (when the heartbeat expired).
In other words, your graph is showing exactly what you gave it. You data are being coerced into a regular set of numerical values at a 1-minute step, each being 1 or Unknown.

Graphite: append a "current time" point to the end of a series

I have a "succeeded" metric that is just the timestamp. I want to see the time between successive successes (this is how long the data is stale for). I have
derivative(Success)
but I also want to know how long between the last success time and the current time. since derivative transforms xs[n] to xs[n+1] - xs[n], the "last" delta doesn't exist. How can I do this? Something like:
derivative(append(Success, now()))
I don't see any graphite functions for appending series, and I don't see any user-defined graphite functions.
The general problem is to be alerted when the data is stale, via graphite monitoring. There may be a better solution than the one I'm thinking about.

identity is a function whose value at any given time is the timestamp of that time.
keepLastValue is a function that takes a series and replicates data points forward over gaps in the data.
So then diffSeries(identity("now"), keepLastValue(Success)) will be a "sawtooth" series that climbs steadily while Success isn't updated, and jumps down to zero (or close to it — there might be some time skew) every time Success has a data point. If you use graphite monitoring to get the current value of that expression and compare it to some threshold, it will probably do what you want.

graphite summarize function not working as expected

I am feeding data into a metric, let say it is "local.junk". What I send is just that metric, a 1 for the value and the timestamp
local.junk 1 1394724217
Where the timestamp changes of course. I want to graph the total number of these instances over a period of time so I used
summarize(local.junk, "1min")
Then I went and made some data entries, I expected to see the number of requests that it received in each minute but it always just shows the line at 1. If I summarize over a longer period like 5 mins, It is showing me some random number... I tried 10 requests and I see the graph at like 4 or 5. Am I loading the data wrong? Or using the summarize function wrong?

The method summarize() just sums up your data values so co-relate and verify that you indeed are sending correct values.
Also, to localize weather the function or data has issues, you can run it on metricsReceived:
summarize(carbon.agents.ip-10-0-0-1-a.metricsReceived,"1hour")
Which version of Grahite are you running?

You may want to check your carbon aggregator settings. By default carbon aggregates data for every 10 seconds. Without adding any entry in aggregation-rules.conf, Graphite only saves last metric it receives in the 10second duration.
You are seeing above problem because of that behaviour. You need to add an entry for your metric in the aggregation-rules.conf with sum method like this
local.junk (10) = sum local.junk

Graphite does not graph values correctly when using long durations?

I'm trying to graph data using statsd and graphite. I have a simple counter, I increment it by 1, and then when I graph the values for the counter over the day, I see strange values like 0.09 as the peak in my graph (see http://i.stack.imgur.com/o4gmz.png)
This graph should be showing 2 logins, but instead it's showing 0.09. If I change the time scale from 1 day to the last 15 minutes, then it correctly shows the two logins (see http://i.stack.imgur.com/23vDJ.png)
I've set up my finest retention to be in 10s increments in storage-schemas.conf:
retentions = 10s:7d,1m:21d,24h:5y
I've set up my storage-aggregation.conf file to sum counts:
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
(And, before you ask, yes; this is a .count).
If I try my URL with &rawData=true then in either case I see some Nones, some 0.0s, and a pair of 1.0s separated by some 0.0s. I never see these fractional values that somehow show up on the graph. So... Is this a bug? Am I doing something wrong?

There's also consolidateBy function which tells graphite what to do if there's no enough pixels to draw everything accurately. By default it's using "avg" function and therefore strange results when time ranges are greater. Here excerpt from documentation:
When a graph is drawn where width of the graph size in pixels is
smaller than the number of datapoints to be graphed, Graphite
consolidates the values to to prevent line overlap. The
consolidateBy() function changes the consolidation function from the
default of ‘average’ to one of ‘sum’, ‘max’, or ‘min’. This is
especially useful in sales graphs, where fractional values make no
sense and a ‘sum’ of consolidated values is appropriate.
Another function that could be useful is hitcount. Short excerpt from here why it's useful:
This function is like summarize(), except that it compensates
automatically for different time scales (so that a similar graph
results from using either fine-grained or coarse-grained records) and
handles rarely-occurring events gracefully.
I spent some time scratching my head why I get fractions for my counter with time ranges longer than couple hours when my aggregation rule is max. It's pretty confusing, especially at the beginning when you play with single counters to see if everything works. Checking rawData is quite a good way for debugging sanity check ;)

Graphite is not graphing anything for ranges bigger than 7 hours

My current retention rule is like so:
[whatever]
priority = 110
pattern = ^stats\.whatever\..*
retentions = 60:10080,600:262974
If I understand correctly, this will save 2 days of 1 minute data and 5 years of ten minute data.
I have been sending data to graphite for the last couple of hours and I can see the a graph of this data but only for ranges less than 7 hours. If I try to visualize this data for a range of, for example, 1 day, the resulting graph doesn't show a single data point.
Is this caused by my retention rule?
thanks in advance.

I had this same problem. After you change your retention rules, you need to restart carbon-cache.py. If you want to keep the data you have you need to run whisper-resize.py on your whisper files (.wsp).
This link should help too:
https://answers.launchpad.net/graphite/+question/140289
However in that link, the parameters passed to whisper-resize.py are in the wrong order. It should be whisper-resize.py <file> <retention rate>
Here's a helpful command for resizing:
find /opt/graphite/storage/whisper -type f -name "*.wsp" -exec whisper-resize.py {} <retention rate> \;
Adjust it as needed.

I had a similar problem; for me it wasn't the retention rules, but the aggregation rules. By default, my counters were being assigned to --agggregationMethod average and -xFilesFactor 0.5. But my data was nowhere near that dense, so the aggregator was throwing away my data on the grounds that there wasn't a statistically significant sample available.
In my particular use case, I was interested in the peak value over some time period, so I used whisper-resize.py to reconfigure my database: --aggregationMethod max, --xFilesFactor 0.0 gave me the behavior I was expecting.
See also storage-aggregation.conf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex