How to get accumulate value of openTSDB use bosun? - opentsdb

I have a counter, which named,for example "mysvr.method_name1" with 3 tagk/v.It's a counter type of openTSDB which means query times in my situation.How can I get the accumulate value of it in past 30 days(in my situation, total requests number in 30 days).
I use q method like below:
q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800")
but it looks like the number series not monotone increasing due to server restart, missing tagk/v or some other reasons.
So it's seam like the below query will not meet my requirement:
diff(q("sum:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
how shall I do to fetch the accumulate value for counter in the given time period?
The only thing I can make sure is the below is mean QPS in my situation:
avg(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))

sum(q("sum:rate{counter}:mysvr.method_name1{tag1=v1}", "1590940800", "1593532800"))
works for my situation,the gap is multiply by sample time duration which in my situation is 30 seconds.

Related

Equation to distribute items unevenly

I'm writing a javascript program that sends a list of MIDI signals over a specified period of time.
If the signals are sent evenly, it's easy to determine how long to wait in between each signal: it's just the total duration divided by the number of signals.
However, I want to be able to offer a setting where the signals aren't sent equally: either the signals are sent with increasing or decreasing speed. In either case, the number of signals and the total amount of time remain the same.
Here's a picture to visualize what I'm talking about
Is there a simple logarithmic/exponential function where I can compute what these values are? I'm especially hoping it might be possible to use the same equation for both, simply changing a variable.
Thank you so much!
Since you do not give any method to get a pulse value, from the previous value or any other way, I assume we are free to come up with our own.
In both of your cases, it looks like you start with an initial time interval: let's call it a. Then the next interval is that value multiplied by a constant ratio: let's call it r. In the first decreasing case, your value of r is between zero and one (it looks like around 0.6), while in the second case your value of r is greater than one (around 1.6). So your time intervals, in Python notation, are
a, a*r, a*r**2, a*r**3, ...
Then the time of each signal is the sum of a geometric series,
a * (1 - r**n) / (1 - r)
where n is the number of the pulse (1 for the first, 2 for the second, etc.). That formula is valid if r is not one, but if r is one then the sequence is a trivial sequence of a regular signal and the nth signal is given at time
a * n
This is not a "fixed result" since you have two degrees of freedom--you can choose values of a and of r.
If you want to spread the signals more evenly, just bring r closer to one. A value of one is perfectly even, a value farther from one is more clumped at one end. One disadvantage of this method is that if the signal intervals are decreasing then the signals will completely stop at some point, namely at
a / (1 - r)
If you have signals already sent or received and you want to find the value of r, just find the time interval between three consecutive signals, and r is the value of the time interval between the 2nd and 3rd signal divided by the time interview between the 1st and 2nd signal. If you want to see if this model is a good one for a given set of signals, check the value of r at multiple signals--if the value of r is nearly constant then this is a good model.

Graphite: sumSeries() does not sum

since this morning at 6 I'm experiencing a strange behavior of graphite.
We have two machine that collects date about calls received, I plot the charts and I also plot the sum of these two charts.
While the charts of single machine are fine, the sum is not working anymore.
This is a screenshot of graphtite and also grafana that shows how 4+5=5 (my math teacher is going to die for this)
This wrong sum happens also for other metrics. And I don't get why.
storage-scheams.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,1h:7d,1d:1y,7d:5y
storage-aggregations.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[time_data]
pattern = ^stats\.timers.*
xFilesFactor = 0.5
aggregationMethod = average
[storage_space]
pattern = \.postgresql\..*
xFilesFactor = 0.1
aggregationMethod = average
[default_1min_for_1day]
pattern = .*
xFilesFactor = 0
aggregationMethod = sum
aggregation-rules.conf This may be the cause, but it was working before 6AM. But anyway i don' see the stats_count.all metric.
stats_counts.all.rest.req (60) = sum stats_counts.srv_*_*.rest.req
stats_counts.all.rest.res (60) = sum stats_counts.srv_*_*.rest.res
It seems that the two series were not alligned by the timestamp, so the sum could not summarize the points. This is visible i the following chart, where selecting a time highliths point in two diffrent minute (charts from grafana).
I don't know why this happened. I resetarted some services (This charts comes from statsd for python and bucky). Maybe was the fault of one of those.
NOTE. Now this works, however, I would like to know if someone knows the reason and how I can solve it.
One thing you need to ensure is that the services sending metrics to Graphite do it at the same granularity as your smallest retention period or the period you will be rendering your graphs in. If the data points in the graph will be every 60 seconds, you need to send metrics every 60 seconds from each service. If the graph will be showing a data point for every hour, you can send your metrics every hour. In your case the smallest period is every 60 seconds.
I encountered a similar problem in our system - graphite was configured with the smallest retention period of 10s:6h, but we had 7 instances of the same service generating lots of metrics and configured them to send data every 20 seconds in order to avoid overloading our monitoring. This caused an almost unavoidable misalignment, where the series from the different instances will have a datapoint every 20 seconds, but some would have it at 10, 30, 50 and others will have it at 0, 20, 40. Depending on how many services were aligned, we would get a very jagged graph, looking similar to yours.
What I did to solve this problem for time periods that were returning data in 10 second increments was to use the keepLastValue function -> keepLastValue(1). I used 1 as parameter, because I only wanted to skip 1 None value, because I knew our service causes this by sending once every 20 seconds rather than every 10. This way the series generated by different services never had gaps, so sums were closer to the real number and the graphs stopped having the jagged look. I guess this introduced a bit of extra lag in the monitoring, but this is acceptable for our use case.

How to multiply two series lists in Grafana / Graphite?

I have data in graphite in following format:
app.service.method_*.m1_rate (rate of calls per minute)
app.service.method_*.avg_time (avg response time per minute)
I would like to get graph with total estimated time given method is running per minute. In other words - multiply rate by avg time so I can learn from one graph what calls are taking most. If I can get it going I can then limit this (I know how :) ) to top N results of such multiplication.
Neither rate itself does not give me that information (high rate of very fast calls is not a problem) nor avg time (high average time on a service called once per 5 minutes is also not a problem).
Any suggestions?
This can be done with multiplySeriesWithWildcards
multiplySeriesWithWildcards(app.service.method_*.{m1_rate,avg_time}, 3)
May be multiplySeries could help you.

Say a customer could enter a bank randomly every 2-6 seconds, what would be the statistical percentage of a person entering each second?

I'm writing a bank simulation program and I'm trying to find that percent to know how fast to program a new person coming in based on a timer that executes code every second. Sorry if it sounds kinda confusing, but I appreciate any help!
If you need to generate a new person entity every 2-6 seconds, why not generate a random number between 2 and 6, and set the timer to wait that amount of time. When the timer expires, generate the new customer.
However, if you really want the equivalent probability, you can get it by asking what it represents: the stochastic experiment is "at any given second of the clock, what is proability of a client entering, such that it will result in one client every 2-6 seconds?". Pick a specific incidence: say one client every 2 seconds. If on average you get 1 client every 2 seconds, then clearly the probability of getting a client at any given second is 1/2. If on average you get 1 client every 6 seconds, the probability of getting a client at any given second is 1/6.
The Poisson distribution gives the probability of observing k independant events in a period for which the average number of events is λ
P(k) = λk e-λ / k!
This covers the case of more than one customer arriving at the same time.
The easiest way to generate Poisson distributed random numbers is to repeatedly draw from the exponential distribution, which yields the waiting time for the next event, until the total time exceeds the period.
int k = 0;
double t = 0.0;
while(t<period)
{
t += -log(1.0-rnd())/lambda;
if(t<period) ++k;
}
where rnd returns a uniform random number between 0 and (strictly less than) 1, period is the number of seconds and lambda is the average number of arrivals per second (or, as noted in the previous answer, 1 divided by the average number of seconds between arrivals).

Getting accurate graphite stats_counts

We have etsy/statsd node application running that flushes stats to carbon/whisper every 10 seconds. If you send 100 increments (counts), in the first 10 seconds, graphite displays them properly, like:
localhost:3000/render?from=-20min&target=stats_counts.test.count&format=json
[{"target": "stats_counts.test.count", "datapoints": [
[0.0, 1372951380], [0.0, 1372951440], ...
[0.0, 1372952460], [100.0, 1372952520]]}]
However, 10 seconds later, and this number falls to 0, null and or 33.3. Eventually it settles at a value 1/6th of the initial number of increments, in this case 16.6.
/opt/graphite/conf/storage-schemas.conf is:
[sixty_secs_for_1_days_then_15m_for_a_month]
pattern = .*
retentions = 10s:10m,1m:1d,15m:30d
I would like to get accurate counts, is graphite averaging the data over the 60 second windows rather than summing it perhaps? Using the integral function, after some time has passed, obviously gives:
localhost:3000/render?from=-20min&target=integral(stats_counts.test.count)&format=json
[{"target": "stats_counts.test.count", "datapoints": [
[0.0, 1372951380], [16.6, 1372951440], ...
[16.6, 1372952460], [16.6, 1372952520]]}]
Graphite data storage
Graphite manages the retention of data using a combination of the settings stored in storage-schemas.conf and storage-aggregation.conf. I see that your retention policy (the snippet from your storage-schemas.conf) is telling Graphite to only store 1 data point for it's highest resolution (e.g.10s:10m) and that it should manage the aggregation of those data points as the data ages and moves into the older intervals (with the lower resolution defined - e.g. 1m:1d). In your case, the data crosses into the next retention interval at 10 minutes, and after 10 minutes the data will roll up according the settings in the storage-aggregation.conf.
Aggregation / Downsampling
Aggregation/downsampling happens when data ages and falls into a time interval that has lower resolution retention specified. In your case, you'll have been storing 1 data point for each 10 second interval but once that data is over 10 minutes old graphite now will store the data as 1 data point for a 1 minute interval. This means you must tell graphite how it should take the 10 second data points (of which you have 6 for the minute) and aggregate them into 1 data point for the entire minute. Should it average? Should it sum? Depending on the type of data (e.g. timing, counter) this can make a big difference, as you hinted at in your post.
By default graphite will average data as it aggregates into lower resolution data. Using average to perform the aggregation makes sense when applied to timer (and even gauge) data. That said, you are dealing with counters so you'll want to sum.
For example, in storage-aggregation.conf:
[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
UI (and raw data) aggregation / downsampling
It is also important to understand how the aggregated/downsampled data is represented when viewing a graph or looking at raw (json) data for different time periods, as the data retention schema thresholds directly impact the graphs. In your case you are querying render?from=-20min which crosses your 10s:10m boundary.
Graphite will display (and perform realtime downsampling of) data according to the lowest-resolution precision defined. Stated another way, it means if you graph data that spans one or more retention intervals you will get rollups accordingly. An example will help (assuming the retention of: retentions = 10s:10m,1m:1d,15m:30d)
Any graph with data no older than the last 10 minutes will be displaying 10 second aggregations. When you cross the 10 minute threshold, you will begin seeing 1 minute worth of count data rolled up according to the policy set in the storage-aggregation.conf.
Summary / tldr;
Because you are graphing/querying for 20 minutes worth of data (e.g. render?from=-20min) you are definitely falling into a lower precision storage setting (i.e. 10s:10m,1m:1d,15m:30d) which means that aggregation is occurring according to your aggregation policy. You should confirm that you are using sum for the correct pattern in the storage-aggregation.conf file. Additionally, you can shorten the graph/query time range to less than 10min which would avoid the dynamic rollup.

Resources