Graphite, sending multiple values for same key: only one remains - graphite

I have this configuration in my Graphite:
# go-carbon.aggregation
[sum_counts]
pattern = ^stats_counts.*
xFilesFactor = 0
aggregationMethod = sum
# /go-carbon.schemas
[default_1min_for_7days_and_1hour_for_5years]
pattern = .*
retentions = 1m:30d,1h:5y
I am sending same key different values in the same minute. But I see how some of them are ignored.
I was expecting the values to be sum as is defined in the aggregationMethod.
To be specific: I am sending 1 signal every 10 seconds to Graphite directly to this key:
stats_counts.test.monitor.remote.datapoint
But when I check what Graphite is storing I see that is only counting 1 signal per minute.
Is there any possibility to request Graphite to aggregate the incoming signals?
Note: we are using StatsD to aggregate these signals and it works, the problem is now we want to add several StatsD and then we need Graphite to aggregate the signals coming from different StatsD daemons.

For me, it looks like you mixed up storage aggregation, which happens when metric become too old according to the retention rules, with carbon-aggregator daemon which acts as proxy and does exactly what you need

Related

Retentions policy understand fro Graphite DB

I have the below retentions policy mention in storage-schemas.conf file
[metrics]
pattern = ^metrics.api.*
retentions = 10s:5m,1m:1d,1h:30d,1d:1y,30d:10y
Below is my understanding
this policy runs for the matched pattern starting with metrics.api*
1st: 10s:5m -> 1 or more times record inserted at 10s then its will take the latest record and maintain 1 datapoint , till 5min its maintains the history say suppose in 5m 5 datapoints added for the metrics key.
2nd:1m:1d -> this second run after the 5min overs for the same metrics key ,1 or more times record inserted at 1m then its will take the latest record and maintain 1 datapoint,till 1d its maintains the history say suppose in 1d 15 datapoints added for the metrics key.
so my question is what happens for these 2 retention is it will do the average 1st 5+15/2= 10 ? and get one average data point out of this 1st and 2nd rentions
--- its goes till 10years of data to be stored
can you please explain on the above retention polciy
aggregationMethod will be applied on this retention policy when switching boundaries.
First retention - 10s:5m means Graphite will store 30 datapoints (every 10 seconds for last 5 minutes) in archive 0.
Please note, that it will always store these datapoints, even if no data arrived. In that case Graphite will put NULLs there.
Then next retention - 1m:1d means that every minute whisper will take 6 of these 10s datapoints from archive 0, apply average() function and store them in archive 1.
But please note that Whisper will do so only if at least 3 (number of datapoints - 6 multiplied by xFilesFactor = 0.5) or more points in archive 0 have values (i.e. not NULLs). Otherwise Whisper decides that it has not enough data to propagate and put also NULL instead.
Etc - third retention 1h:30d means that 60 of datapoints from archive 1 will be aggregated using average function and propagated to archive 2, but only if at least 30 of them have value, etc.

How do you sum a statsd counter over a large time range with correct values?

Background
A basic use case for statsd & grafana is learning how many times a function has been called over a time range-- whether that is "last 6h", "since beginning of today", "since beginning of time", etc.
What I'm struggling to find is the correct function to achieve this. I'm using a hosted solution; however, I can confirm that data is being flushed from StatsD to Graphite in 10s intervals.
Current Setup
StatsD Flush: 10s
Graph Function: hitcount(counters.login.employer.count, "10seconds")
Time Range: 24h
Problem
When using hitcount(counters.login.employer.count, "10seconds"), the data returned is incorrect. In fact, I can do 24h, 23h, 22h, and note the values are actually increasing.
I've performed all testing here in a controlled environment, only my machine is sending metrics to StatsD. This is not yet in production code.
Any idea what could be going on here?
The way counters work is that on each interval the value of the counter is sent to graphite and reset in statsd, so what you're looking for is the sum of the series.
You can do that using consolidateBy('sum') combined with maxDataPoints=1.
Be aware that if your series is being aggregated in graphite you'll need to make sure that the aggregation is by sum, otherwise when values get rolled up from the individual values reported by statsd into aggregated buckets they'll be averaged, and your sum won't work across longer intervals. You can read more about configuring aggregation in Graphite here.

StatsD sends Average or Graphite stores Average instead of Sum

I am using StatsD to record Requests send to my Server, and Graphite to collect the statistics. But when I try to display the statistics, instead of a sum aggregated over a minute, I get averages.
My retention rate for the requests is 1m:7d,5m:35d,1d:1y. My xFilesFactor is 0 and my aggregationMethod is sum. The FlushInterval of StatsD is set to 1m. What am I doing wrong?
Statsd normalizes the point it sends to graphite over that time period on a per second basis.
The aggregationMethod is a config for Graphite retention schemas which specifies how points are aggregation as you go from (in your case) a 1m representation to a 5 minute per point representation (and so on).
If you want the number of requests over that minute period, in graphite you can multiply the series by a constant 60 and get the result.

Graphite, datapoints disappear if I choose a wider time range

If I ask for this data:
https://graphite.it.daliaresearch.com/render?from=-2hours&until=now&target=my.key&format=json
I get, among other datapoints, this one:
[
2867588,
1398790800
]
If I ask for this data:
https://graphite.it.daliaresearch.com/render?from=-10hours&until=now&target=my.key&format=json
The datapoint looks like this:
[
null,
1398790800
]
Why this datapoint is being nullified when I choose a wider time range?
Update
I'm seeing that for a chosen date range smaller than 7 hours the resolution of the datapoints are every 10 seconds and when the date range chosen is 7 hours or bigger the the resolution goes to one datapoint every 1 minute.. and continue this diretion as the date range chosen is getting bigger to one datapoint every 10 minutes and so.
So when the resolution of the datapoints is every 10 seconds the data is there, when the resolution is every 1 minute or more, then the datapoint has not the value :/
I'm sending a data point every 1 hour, maybe it is a conflict with the resolutions configuration and me sending only one datapoint per hour
There are several things happening here, but basically the problem is that you have misconfigured graphite (or at least, configured it in a way that makes it do things that you aren't expecting!)
Specifically, you should set xFilesFactor = 0.0 in your storage-aggregation.conf file. Since you are new at this, you probably just want this (mine is in /opt/graphite/conf/storage-aggregation.conf):
[default]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average
The graphite docs describe xFilesFactor like this:
xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5.
But wait! This wont change existing statistics! These aggregation settings are set once per metric at the time the metric is created. Since you are new at this, the easy way out is to just go to your whisper directory and delete the prior data and start over:
cd /opt/graphite/storage/whisper/my/
rm key.wsp
your root whisper directory may be different depending on platform, etc. After removing the data files graphite should recreate them automatically upon the next metric write and they should get your updated settings (dont forget to restart carbon-cache after changing your storage-aggregation settings).
Alternatively, if you need to keep your old data you will need to run whisper-resize.py against your whisper (.wsp) data files with --xFilesFactor=0.0 and also likely all of your retention settings from storage-schemas.conf (also viewable with whisper-info.py)
Finally, I should add that the reason you get non-null data in your first query, but null data in your second is because graphite will try to pick the best available retention period from which to serve your request based on the time window you requested. For the smaller window, graphite is deciding that it can serve your request using the highest precision data (i.e., non aggregated) and so you are seeing your raw metrics. For the longer time window, graphite is finding that the high precision, non-aggregated data is not available for the entire window -- these periods are configured in storage-schemas.conf -- so it skips to the next highest-precision data set available (i.e. first aggregation tier) and returns only aggregated data. Because your aggregation config is writing null data, you are therefore seeing null metrics! So fix the aggregation, and you should fix the null data problem. But remember that graphite never combines aggregation tiers in a single request/response, so anytime you see differences between results from the same query when all you are changing is the from / to params, the problem is pretty much always due to aggregation configs.
I'm not quite sure about your specific situation, but I think I can give you some general pointers.
First off, you are right about the changing resolution depending on the time range. This is configured in storage-schemas.conf and is done to save space when storing data over large periods of time. An example could be: 15s:7d,1m:21d,15m:5y, meaning 15 seconds resolution for 7 days, then 1 minute resolution for 21 days, then 15min for 5 years.
Then there is the way Graphite does the actual aggregation from one resolution to the other. This is configured in: storage-aggregation.conf. The default settings are: xFilesFactor=0.5 and aggregationMethod=average. The xFilesFactor setting is saying that a minimum of 50% of the slots in the previous retention level must have values for next retention level to contain an aggregate. The aggregationMethod is saying that all the values of the slots in the previous retention level will be combined by averaging. My guess is that your stat doesn't have enough data points to fulfill the 50% requirement, resulting in a null value.
For more information, check out the docs, they are pretty complete: http://graphite.readthedocs.org/en/latest/config-carbon.html

Graph old data using graphite and statsd

Can I enter timestamp to send data to graphite via statsd(javascript statsd)? I need to graph old data.
No, you can't do that with statsd, however you can achieve the same by sending your data directly to carbon which accepts tiemstamps.
Statsd just collect real time data and on a configured period sums or average each metric received on that period and send it to graphite carbon daemon with current timestamp.
Sending data to carbon daemon it's very straight forward, you just need to open a socket to carbon common port (has another port if you want to use pickle), and then print on that socket one metric per line with following values:
metric_name metric_value metric_timestamp
Carbon will store that value in that timestamp, and you can use any timestamp you want as long as it's in the range configured on the storage of that metric.
There are many examples around, like this one to send with netcat
There's also a Graphite client written in C
I wanted to use statsd but not in real time, because I process log files once an hour. So I modified the server code to accept a timestamp, and modified the client code to send one. It ended up working for me although it feels very "home grown" and I can't update to newer versions of statsd without extra work. The tricky part is that the server does some aggregation into 10-second buckets. In real time, this is pretty easy to do, but if you are going to accept a timestamp, you have to keep a lot more data around. For me, since my data can only be around an hour old, it wasn't too hard, but my solution doesn't really work for a general case.
Looks like there is a way to send raw data via STATSD but it won't be aggregated:
def send(self, subname, value, timestamp=None):
'''Send the data to statsd via self.connection
:keyword subname: The subname to report the data to (appended to the
client name)
:keyword value: The raw value to send
'''
name = self._get_name(self.name, subname)
return statsd.Client._send(self, {name: '%s|r|%s' % (value, ts)})
see:
http://python-statsd.readthedocs.org/en/latest/_modules/statsd/raw.html
https://github.com/chuyskywalker/statsd/blob/master/README.md

Resources