Is Graphite usable with very occasional counters? - graphite

Using statsd, configured with flushInterval: 1000 and communicating with graphite's carbon-cache. I wish I can see very occasional counters.
I have the following configuration for carbon:
storage-schemas.conf:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[default_30s_for_1day]
pattern = .*
retentions = 30s:1d
Sending a unique counter that way:
$ echo "foobar:1|c" > /dev/udp/127.0.0.1/8125
I can see the packet received by statsd:
9 Jul 14:43:05 - DEBUG: foobar:1|c
as well as the data sent to carbon-cache (tcpdump extract):
stats.foobar 1 1404909785
In graphite, looking at data for "foobar" I can see that something happened at that moment (thin line, see red circle in picture), but the result is always on "0":
Am I missing something?
If there is much more frequent results, then I can see numbers that looks correct.
Is there a minimum amount of stats to be sent to be taken into account? Is it configurable?
Note: Maybe for such occasional data StatsD / Graphite is not worth but since there are other very frequent data collected for the same project, those will be used anyway and hope one unique solution can be used, even for rare counters.

The issue in my setup was that the flushInterval (1 sec.) was lower than my smallest retention (30 sec.)
Every (my case: 30 sec.), carbon-cache stores the result, however if, e.g., at second "10" the count "1" is sent for my foobar key, statsd will send the count "0" every second (every flushInterval to be more precise) after. This means that at second "30", the last value seen for foobar is "0". The only chance that the value "1" would be taken into account is if that was sent at the very latest moment, just before second "30".
One can blame me not to use statsd deleteIdleStats or deleteCounters settings to not send "0". But doing so, having two times the counter set to "1" in the same 30 sec. slot would miss the first counter, resulting in a count of "1" being registered instead of "2".
The real solution is to align statsd's flushInterval with the minimum retention of carbon:
statsd config.js:
flushInterval: 10000
carbon's storage-schemas.conf:
retentions = 10s:1d,1m:30d,5m:90d

Related

How to get accumulate value of openTSDB use bosun?

I have a counter, which named,for example "mysvr.method_name1" with 3 tagk/v.It's a counter type of openTSDB which means query times in my situation.How can I get the accumulate value of it in past 30 days(in my situation, total requests number in 30 days).
I use q method like below:
q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800")
but it looks like the number series not monotone increasing due to server restart, missing tagk/v or some other reasons.
So it's seam like the below query will not meet my requirement:
diff(q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
how shall I do to fetch the accumulate value for counter in the given time period?
The only thing I can make sure is the below is mean QPS in my situation:
avg(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
sum(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
works for my situation,the gap is multiply by sample time duration which in my situation is 30 seconds.

Monitor and alert prometheus anomaly in number of metrics

We have a number of prometheus servers, each one monitors its own region (actually 2 per region), there are also thanos servers that can query multiple regions, and we also use alertmanager for the alerting.
Lately, we had an issue that few metrics stopped to report and we only discovered it when we needed the metrics.
We are trying to find out how to monitor the changes in the number of reported metrics in a scalable system that grow and shrink as required.
I'll be glad about your advice.
You can either count the number of timeseries in the head chunk (last 0-2 hours) or the rate at which you're ingesting samples:
prometheus_tsdb_head_series
or
rate(prometheus_tsdb_head_samples_appended_total[5m])
Then you compare said value with itself a few minutes/hours ago, e.g.
prometheus_tsdb_head_series / prometheus_tsdb_head_series offset 5m
and see whether it fits within an expected range (say 90-110%) and alert otherwise.
Or you can look at the metrics with the highest cardinality only:
topk(100, count({__name__=~".+"}) by (__name__))
Note however that this last expression can be quite costly to compute, so you may want to avoid it. Plus the comparison with 5 minutes ago will not be as straightforward:
label_replace(topk(100, count({__name__=~".+"}) by (__name__)), "metric", "$1", "__name__", "(.*)")
/
label_replace(count({__name__=~".+"} offset 5m) by (__name__), "metric", "$1", "__name__", "(.*)")
You need the label_replace there because the match for the division is done on labels other than __name__. Computing this latest expression takes ~10s on my Prometheus instance with 150k series, so it's anything but fast.
And finally, whichever approach you choose, you're likely to get a lot of false positives (whenever a large job is started or taken down), to the point that it's not going to be all that useful. I would personally not bother trying.

Graphite: sumSeries() does not sum

since this morning at 6 I'm experiencing a strange behavior of graphite.
We have two machine that collects date about calls received, I plot the charts and I also plot the sum of these two charts.
While the charts of single machine are fine, the sum is not working anymore.
This is a screenshot of graphtite and also grafana that shows how 4+5=5 (my math teacher is going to die for this)
This wrong sum happens also for other metrics. And I don't get why.
storage-scheams.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,1h:7d,1d:1y,7d:5y
storage-aggregations.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[time_data]
pattern = ^stats\.timers.*
xFilesFactor = 0.5
aggregationMethod = average
[storage_space]
pattern = \.postgresql\..*
xFilesFactor = 0.1
aggregationMethod = average
[default_1min_for_1day]
pattern = .*
xFilesFactor = 0
aggregationMethod = sum
aggregation-rules.conf This may be the cause, but it was working before 6AM. But anyway i don' see the stats_count.all metric.
stats_counts.all.rest.req (60) = sum stats_counts.srv_*_*.rest.req
stats_counts.all.rest.res (60) = sum stats_counts.srv_*_*.rest.res
It seems that the two series were not alligned by the timestamp, so the sum could not summarize the points. This is visible i the following chart, where selecting a time highliths point in two diffrent minute (charts from grafana).
I don't know why this happened. I resetarted some services (This charts comes from statsd for python and bucky). Maybe was the fault of one of those.
NOTE. Now this works, however, I would like to know if someone knows the reason and how I can solve it.
One thing you need to ensure is that the services sending metrics to Graphite do it at the same granularity as your smallest retention period or the period you will be rendering your graphs in. If the data points in the graph will be every 60 seconds, you need to send metrics every 60 seconds from each service. If the graph will be showing a data point for every hour, you can send your metrics every hour. In your case the smallest period is every 60 seconds.
I encountered a similar problem in our system - graphite was configured with the smallest retention period of 10s:6h, but we had 7 instances of the same service generating lots of metrics and configured them to send data every 20 seconds in order to avoid overloading our monitoring. This caused an almost unavoidable misalignment, where the series from the different instances will have a datapoint every 20 seconds, but some would have it at 10, 30, 50 and others will have it at 0, 20, 40. Depending on how many services were aligned, we would get a very jagged graph, looking similar to yours.
What I did to solve this problem for time periods that were returning data in 10 second increments was to use the keepLastValue function -> keepLastValue(1). I used 1 as parameter, because I only wanted to skip 1 None value, because I knew our service causes this by sending once every 20 seconds rather than every 10. This way the series generated by different services never had gaps, so sums were closer to the real number and the graphs stopped having the jagged look. I guess this introduced a bit of extra lag in the monitoring, but this is acceptable for our use case.

How do Cron "Steps" Work?

I'm running into a situation where a cron job I thought was running every 55 minutes is actually running at 55 minutes after the hour and at the top of the hour. Actually, it's not a cron job, but it's a PHP scheduling application that uses cron syntax.
When I ask this application to schedule a job every 55 minutes, it creates a crontab line like the following.
*/55 * * * *
This crontab line ends up not running a job every 55 minutes. Instead a job runs at 55 minutes after the hours, and at the top of the hour. I do not desire this. I've run this though a cron tester, and it verifies the undesired behavior is correct cron behavior.
This leads me to looking up what the / actually means. When I looked at the cron manual I learned the slash indicated "steps", but the manual itself is a little fuzzy on that that means
Step values can be used in conjunction with ranges. Following a range with "<number>" specifies skips of the number's value through the range. For example, "0-23/2" can be used in the hours field to specify command execution every other hour (the alternative in the V7 standard is "0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an asterisk, so if you want to say "every two hours", just use "*/2".
The manual's description ("specifies skips of the number's value through the range") is a little vague, and the "every two hours" example is a little misleading (which is probably what led to the bug in the application)
So, two questions:
How does the unix cron program use the "step" information (the number after a slash) to decide if it should skip running a job? (modular division? If so, on what? With what conditions deciding a "true" run, and which decisions not? Or is it something else?)
Is it possible to configure a unix cron job to run every "N" minutes?
Step values can be used in conjunction with ranges. Following a range
with "<number>" specifies skips of the number's value through the range. For
example, "0-23/2" can be used in the hours field to specify command
execution every other hour (the alternative in the V7 standard is
"0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an
asterisk, so if you want to say "every two hours", just use "*/2".
The "range" being referred to here is the range given before the /, which is a subrange of the range of times for the particular field. The first field specifies minutes within an hour, so */... specifies a range from 0 to 59. A first field of */55 specifies all minutes (within the range 0-55) that are multiples of 55 -- i.e., 0 and 55 minutes after each hour.
Similarly, 0-23/2 or */2 in the second (hours) field specifies all hours (within the range 0-23) that are multiples of 2.
If you specify a range starting other than at 0, the number (say N) after the / specifies every Nth minute/hour/etc starting at the lower bound of the range. For example, 3-23/7 in the second field means every 7th hour starting at 03:00 (03:00, 10:00, 17:00).
This works best when the interval you want happens to divide evenly into the next higher unit of time. For example, you can easily specify an event to occur every 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 minutes, or every 1, 2, 3, 4, 6, or 12 hours. (Thank the Babylonians for choosing time units with so many nice divisors.)
Unfortunately, cron has no concept of "every 55 minutes" within a time range longer than an hour.
If you want to run a job every 55 minutes (say, at 00:00, 00:55, 01:50, 02:45, etc.), you'll have to do it indirectly. One approach is to schedule a script to run every 5 minutes; the script then checks the current time, and does its work only once every 11 times it's called.
Or you can use multiple lines in your crontab file to run the same job at 00:00, 00:55, 01:50, etc. -- except that a day is not a multiple of 55 minutes. If you don't mind having a longer or shorter interval once a day, week, or month, you can write a program to generate a large crontab with as many entries as you need, all running the same command at a specified time.
I came across this website that is helpful with regard to cron jobs.
https://crontab.guru
And specific to your case with * /55
https://crontab.guru/#*/55_*_*_*_*
It helped to get a better understanding of the concept behind it.
There is another tool named at that should be considered. It can be used instead of cron to achieve what the topic starter wants. As far as I remember, it is pre-installed in OS X but it isn't bundled with some Linux distros like Debian (simply apt install at).
It runs a job at a specific time of day and that time can be calculated using a complex specification. In our case the following can be used:
You can also give times like now + count time-units, where the time-units can be minutes, hours, days, or weeks and you
can tell at to run the job today by suffixing the time with today and to run the job tomorrow by suffixing the time with tomorrow.
The script every2min.sh is executed every 2 minutes. It delays next execution every time the instance is running:
#!/bin/sh
at -f ./every2min.sh now + 2 minutes
echo "$(date +'%F %T') running..." >> /tmp/every2min.log
Which outputs
2019-06-27 14:14:23 running...
2019-06-27 14:16:00 running...
2019-06-27 14:18:00 running...
As at does not know about "seconds" unit, the execution time will be rounded to full minute after the first run. But for a given task (with 55 minutes range) it should not be a big problem.
There also might be security considerations
For both at and batch, commands are read from standard input or the file specified with the -f option and executed. The working directory, the environment (except for the variables BASH_VERSINFO, DISPLAY, EUID, GROUPS, SHELLOPTS, TERM, UID, and _) and the umask are retained from the time of invocation.
This is the easiest way to schedule something to be ran every X minutes I've seen so far.

Getting accurate graphite stats_counts

We have etsy/statsd node application running that flushes stats to carbon/whisper every 10 seconds. If you send 100 increments (counts), in the first 10 seconds, graphite displays them properly, like:
localhost:3000/render?from=-20min&target=stats_counts.test.count&format=json
[{"target": "stats_counts.test.count", "datapoints": [
[0.0, 1372951380], [0.0, 1372951440], ...
[0.0, 1372952460], [100.0, 1372952520]]}]
However, 10 seconds later, and this number falls to 0, null and or 33.3. Eventually it settles at a value 1/6th of the initial number of increments, in this case 16.6.
/opt/graphite/conf/storage-schemas.conf is:
[sixty_secs_for_1_days_then_15m_for_a_month]
pattern = .*
retentions = 10s:10m,1m:1d,15m:30d
I would like to get accurate counts, is graphite averaging the data over the 60 second windows rather than summing it perhaps? Using the integral function, after some time has passed, obviously gives:
localhost:3000/render?from=-20min&target=integral(stats_counts.test.count)&format=json
[{"target": "stats_counts.test.count", "datapoints": [
[0.0, 1372951380], [16.6, 1372951440], ...
[16.6, 1372952460], [16.6, 1372952520]]}]
Graphite data storage
Graphite manages the retention of data using a combination of the settings stored in storage-schemas.conf and storage-aggregation.conf. I see that your retention policy (the snippet from your storage-schemas.conf) is telling Graphite to only store 1 data point for it's highest resolution (e.g.10s:10m) and that it should manage the aggregation of those data points as the data ages and moves into the older intervals (with the lower resolution defined - e.g. 1m:1d). In your case, the data crosses into the next retention interval at 10 minutes, and after 10 minutes the data will roll up according the settings in the storage-aggregation.conf.
Aggregation / Downsampling
Aggregation/downsampling happens when data ages and falls into a time interval that has lower resolution retention specified. In your case, you'll have been storing 1 data point for each 10 second interval but once that data is over 10 minutes old graphite now will store the data as 1 data point for a 1 minute interval. This means you must tell graphite how it should take the 10 second data points (of which you have 6 for the minute) and aggregate them into 1 data point for the entire minute. Should it average? Should it sum? Depending on the type of data (e.g. timing, counter) this can make a big difference, as you hinted at in your post.
By default graphite will average data as it aggregates into lower resolution data. Using average to perform the aggregation makes sense when applied to timer (and even gauge) data. That said, you are dealing with counters so you'll want to sum.
For example, in storage-aggregation.conf:
[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
UI (and raw data) aggregation / downsampling
It is also important to understand how the aggregated/downsampled data is represented when viewing a graph or looking at raw (json) data for different time periods, as the data retention schema thresholds directly impact the graphs. In your case you are querying render?from=-20min which crosses your 10s:10m boundary.
Graphite will display (and perform realtime downsampling of) data according to the lowest-resolution precision defined. Stated another way, it means if you graph data that spans one or more retention intervals you will get rollups accordingly. An example will help (assuming the retention of: retentions = 10s:10m,1m:1d,15m:30d)
Any graph with data no older than the last 10 minutes will be displaying 10 second aggregations. When you cross the 10 minute threshold, you will begin seeing 1 minute worth of count data rolled up according to the policy set in the storage-aggregation.conf.
Summary / tldr;
Because you are graphing/querying for 20 minutes worth of data (e.g. render?from=-20min) you are definitely falling into a lower precision storage setting (i.e. 10s:10m,1m:1d,15m:30d) which means that aggregation is occurring according to your aggregation policy. You should confirm that you are using sum for the correct pattern in the storage-aggregation.conf file. Additionally, you can shorten the graph/query time range to less than 10min which would avoid the dynamic rollup.

Resources