Storing ever-increasing value in Graphite - graphite

I am monitoring a queue to track the number of error messages over time in Graphite. The errored messages sit in an end state status of 'error' and thus the simple sending of the count to a gauge or counter does not work as expected.
To illustrate:
minute path result
--- ---- ------
1 queue.errors 3000 3000 errors (the baseline)
2 queue.errors 3002 2 errors
3 queue.errors 3005 3 errors
4 queue.errors 3010 5 errors
Is there a way via statsd or graphite to handle this sort of metric? The queue count of errors will increase forever but the meaningful value is really only the change since the last value.
I've read about statsd's Gauge Delta which looks to support the opposite of what I need:
statsd.gauge('foo', 70) # Set the 'foo' gauge to 70.
statsd.gauge('foo', 1, delta=True) # Set 'foo' to 71.
But, what I really need is this:
statsd.gauge('foo', 70) # Set the 'foo' gauge to 70.
statsd.gauge('foo', 71, ?????=True) # Set 'foo' to 1.

Funny that you should ask, there is an ongoing thread about the ability to track absolute values in statsd and get the derivative. See https://github.com/etsy/statsd/issues/324
The gist of it is that statsd does not support counters as defined as an ever-increasing natural integer. Older tools, like rrdtool, do but they don't offer nearly the same flexibility as statsd.

If you want to get the rate of change from an gauge or counter value, just enclose it in either the derivative() or nonNegativeDerivate() functions.

Related

How to get accumulate value of openTSDB use bosun?

I have a counter, which named,for example "mysvr.method_name1" with 3 tagk/v.It's a counter type of openTSDB which means query times in my situation.How can I get the accumulate value of it in past 30 days(in my situation, total requests number in 30 days).
I use q method like below:
q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800")
but it looks like the number series not monotone increasing due to server restart, missing tagk/v or some other reasons.
So it's seam like the below query will not meet my requirement:
diff(q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
how shall I do to fetch the accumulate value for counter in the given time period?
The only thing I can make sure is the below is mean QPS in my situation:
avg(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
sum(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
works for my situation,the gap is multiply by sample time duration which in my situation is 30 seconds.

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

Monitor and alert prometheus anomaly in number of metrics

We have a number of prometheus servers, each one monitors its own region (actually 2 per region), there are also thanos servers that can query multiple regions, and we also use alertmanager for the alerting.
Lately, we had an issue that few metrics stopped to report and we only discovered it when we needed the metrics.
We are trying to find out how to monitor the changes in the number of reported metrics in a scalable system that grow and shrink as required.
I'll be glad about your advice.
You can either count the number of timeseries in the head chunk (last 0-2 hours) or the rate at which you're ingesting samples:
prometheus_tsdb_head_series
or
rate(prometheus_tsdb_head_samples_appended_total[5m])
Then you compare said value with itself a few minutes/hours ago, e.g.
prometheus_tsdb_head_series / prometheus_tsdb_head_series offset 5m
and see whether it fits within an expected range (say 90-110%) and alert otherwise.
Or you can look at the metrics with the highest cardinality only:
topk(100, count({__name__=~".+"}) by (__name__))
Note however that this last expression can be quite costly to compute, so you may want to avoid it. Plus the comparison with 5 minutes ago will not be as straightforward:
label_replace(topk(100, count({__name__=~".+"}) by (__name__)), "metric", "$1", "__name__", "(.*)")
/
label_replace(count({__name__=~".+"} offset 5m) by (__name__), "metric", "$1", "__name__", "(.*)")
You need the label_replace there because the match for the division is done on labels other than __name__. Computing this latest expression takes ~10s on my Prometheus instance with 150k series, so it's anything but fast.
And finally, whichever approach you choose, you're likely to get a lot of false positives (whenever a large job is started or taken down), to the point that it's not going to be all that useful. I would personally not bother trying.

Graphite: sumSeries() does not sum

since this morning at 6 I'm experiencing a strange behavior of graphite.
We have two machine that collects date about calls received, I plot the charts and I also plot the sum of these two charts.
While the charts of single machine are fine, the sum is not working anymore.
This is a screenshot of graphtite and also grafana that shows how 4+5=5 (my math teacher is going to die for this)
This wrong sum happens also for other metrics. And I don't get why.
storage-scheams.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,1h:7d,1d:1y,7d:5y
storage-aggregations.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[time_data]
pattern = ^stats\.timers.*
xFilesFactor = 0.5
aggregationMethod = average
[storage_space]
pattern = \.postgresql\..*
xFilesFactor = 0.1
aggregationMethod = average
[default_1min_for_1day]
pattern = .*
xFilesFactor = 0
aggregationMethod = sum
aggregation-rules.conf This may be the cause, but it was working before 6AM. But anyway i don' see the stats_count.all metric.
stats_counts.all.rest.req (60) = sum stats_counts.srv_*_*.rest.req
stats_counts.all.rest.res (60) = sum stats_counts.srv_*_*.rest.res
It seems that the two series were not alligned by the timestamp, so the sum could not summarize the points. This is visible i the following chart, where selecting a time highliths point in two diffrent minute (charts from grafana).
I don't know why this happened. I resetarted some services (This charts comes from statsd for python and bucky). Maybe was the fault of one of those.
NOTE. Now this works, however, I would like to know if someone knows the reason and how I can solve it.
One thing you need to ensure is that the services sending metrics to Graphite do it at the same granularity as your smallest retention period or the period you will be rendering your graphs in. If the data points in the graph will be every 60 seconds, you need to send metrics every 60 seconds from each service. If the graph will be showing a data point for every hour, you can send your metrics every hour. In your case the smallest period is every 60 seconds.
I encountered a similar problem in our system - graphite was configured with the smallest retention period of 10s:6h, but we had 7 instances of the same service generating lots of metrics and configured them to send data every 20 seconds in order to avoid overloading our monitoring. This caused an almost unavoidable misalignment, where the series from the different instances will have a datapoint every 20 seconds, but some would have it at 10, 30, 50 and others will have it at 0, 20, 40. Depending on how many services were aligned, we would get a very jagged graph, looking similar to yours.
What I did to solve this problem for time periods that were returning data in 10 second increments was to use the keepLastValue function -> keepLastValue(1). I used 1 as parameter, because I only wanted to skip 1 None value, because I knew our service causes this by sending once every 20 seconds rather than every 10. This way the series generated by different services never had gaps, so sums were closer to the real number and the graphs stopped having the jagged look. I guess this introduced a bit of extra lag in the monitoring, but this is acceptable for our use case.

Calculation of data delta

I'm writing a server that sends a "coordinates buffer" of game objects to clients every 300ms. But I don't want to send the full data each time. For example, suppose I have an array with elements that change over time:
0 0 100 50 -100 -50 at time t
0 10 100 51 -101 -50 at time t + 300ms
You can see that only the 2nd, 4th, and 5th elements have changed.
What is the right way to send not all the elements, but only the delta? Ideally I'd like a function that returns the complete data the first time and empty data when there are no changes.
Thanks.
Are you looking to optimize for efficiency, or is this a learning exercise? Some thoughts:
Unless there's a lot of data, it's probably easiest, and not terribly inefficient, to send all the data each time.
If you send deltas for all of the data points each time, you won't save much by sending zeroes for unchanged points instead of re-sending the previous vales.
If you send data for only those points that change, you'll need to provide an index for each value. For example, if point 3 increases by 5 and point 8 decreases by 2, then you might send 3 5 8 -2. But now, since you're sending two values for each point that changes, you'll only win if fewer than half the points change.
If the values change relatively slowly, as compared to the rate at which you transmit updates, you might increase efficiency by transmitting the delta for each data point, but using only a few bits. For example, with 4 bits you can transmit values from -8 to +7. That would work as long as the deltas are never larger than that, or if it's ok to transmit several deltas before they "catch up" to the actual values.
It may not be worthwhile to have 2 different mechanisms: one to send the initial values, and another to send deltas. If you can tolerate the lag, it may make more sense to assume some constant initial value for every point, and then transmit only deltas.
There are lots of options. If most data isn't changing, just send (index,value) pairs of the changed elements. If most values change but the changes are small, compute deltas and gzip (or run length encode, or lots of other possibilities) the result.

Resources