graphite + statsd, missing stats? - graphite

we use statsd as aggregator that forwards to graphite after 60secs.
i can see graphite filling the "stats.timers" buckets. but not all of the expected ones.
on the graphite machine:
graphite:/opt/graphite # find .../xxx/desktopapp/members/contacting -name "*.wsp"
.../xxx/desktopapp/members/contacting/lastVisitors/mean_90.wsp
.../xxx/desktopapp/members/contacting/lastVisitors/sum.wsp
.../xxx/desktopapp/members/contacting/lastVisitors/std.wsp
.../xxx/desktopapp/members/contacting/welcome/count_ps.wsp
.../xxx/desktopapp/members/contacting/feditWelcome/mean.wsp
.../xxx/desktopapp/members/contacting/contacts/count.wsp
.../xxx/desktopapp/members/contacting/contacts/sum_90.wsp
.../xxx/desktopapp/members/contacting/preContact/count_ps.wsp
.../xxx/desktopapp/members/contacting/preContact/mean_90.wsp
.../xxx/desktopapp/members/contacting/preContact/sum.wsp
.../xxx/desktopapp/members/contacting/preContact/std.wsp
.../xxx/desktopapp/members/contacting/preContact/count.wsp
.../xxx/desktopapp/members/contacting/preContact/sum_90.wsp
.../xxx/desktopapp/members/contacting/fedit/upper.wsp
.../xxx/desktopapp/members/contacting/preWelcome/count_ps.wsp
.../xxx/desktopapp/members/contacting/preWelcome/sum.wsp
.../xxx/desktopapp/members/contacting/preWelcome/std.wsp
.../xxx/desktopapp/members/contacting/contact/count_ps.wsp
.../xxx/desktopapp/members/contacting/contact/sum.wsp
.../xxx/desktopapp/members/contacting/contact/std.wsp
.../xxx/desktopapp/members/contacting/favorite/median.wsp
looking at the statsd source code (https://github.com/etsy/statsd/blob/master/lib/process_metrics.js) i would expect the follwoing metrics to appear (each as own bucket) for each thing i time.
source:
current_timer_data["std"] = stddev;
current_timer_data["upper"] = max;
current_timer_data["lower"] = min;
current_timer_data["count"] = timer_counters[key];
current_timer_data["count_ps"] = timer_counters[key] / (flushInterval / 1000);
current_timer_data["sum"] = sum;
current_timer_data["mean"] = mean;
current_timer_data["median"] = median;
anybody any idea why for some i only get "count_ps" and for others i get "upper". does it take some time for graphite to process its internal statistics queue(s)?
statsd log says roughly 500 numstats / min are sent:
13 Mar 10:13:53 - DEBUG: numStats: 498
13 Mar 10:14:53 - DEBUG: numStats: 506
13 Mar 10:15:53 - DEBUG: numStats: 491
13 Mar 10:16:53 - DEBUG: numStats: 500
13 Mar 10:17:53 - DEBUG: numStats: 488
13 Mar 10:18:53 - DEBUG: numStats: 482
13 Mar 10:19:53 - DEBUG: numStats: 486
any help highly appreciated
cheers
marcel

I have seen that for sparse data sets it seems to take a while for all the stats to show up in graphite. I don't know the exact threshold, but it does seem from my experience to require a certain amount of data to be pushed into graphite for a metric before it shows all the different timer stats.

#marcel , did you configure percentThreshold: in local.js of your statsd?
and to get "upper" metrics , first you need to ensure how metrics are being sent to statsd.
eg. To utilize timers buckets of statsd you need to send metrics with specifying it's type like
specifying the type of metrics as timers as :
echo "xx.yy.zz:<data point>|t"

Related

Power BI DAX - Throughput Board Time Intelligence Metrics with different time zones

Dealing with a bit of a head scratcher. This is more of a logic based issue rather than actual Power BI code. Hoping someone can help out! Here's the scenario:
Site
Shift Num
Start Time
End Time
Daily Output
A
1
8:00AM
4:00PM
10000
B
1
7:00AM
3:00PM
12000
B
2
4:00PM
2:00AM
7000
C
1
6:00AM
2:00PM
5000
This table contains the sites as well as their respective shift times. The master table above is part of an effort in order to capture throughput data from each of the following sites. This master table is connected to tables with a running log of output for each site and shift like so:
Site
Shift Number
Output
Timestamp
A
1
2500
9:45 AM
A
1
4200
11:15 AM
A
1
5600
12:37 PM
A
1
7500
2:15 PM
So there is a one-to-many relationship between the master table and these child throughput tables. The goal is to create use a gauge chart with the following metrics:
Value: Latest Throughput Value (Latest Output in Child Table)
Maximum Value: Throughput Target for the Day (Shift Target in Master Table)
Target Value: Time-dependent project target
i.e. if we are halfway through Site A's shift 1, we should be at 5000 units: (time passed in shift / total shift time) * shift output
if the shift is currently in non-working hours, then the target value = maximum value
Easy enough, but the problem we are facing is the target value erroring for shifts that cross into the next day (i.e. Site B's shift 2).
The shift times are stored as date-independent time values. Here's the code for the measure to get the target value:
Var CurrentTime = HOUR(UTCNOW()) * 60 + MINUTE(UTCNOW())
VAR ShiftStart = HOUR(MAX('mtb MasterTableUTC'[ShiftStartTimeUTC])) * 60 + MINUTE(MAX('mtb MasterTableUTC'[ShiftStartTimeUTC]))
VAR ShiftEnd = HOUR(MAX('mtb MasterTableUTC'[ShiftEndTimeUTC])) * 60 + MINUTE(MAX('mtb MasterTableUTC'[ShiftEndTimeUTC]))
VAR ShiftDiff = ShiftEnd - ShiftStart
Return
IF(CurrentTime > ShiftEnd || CurrentTime < ShiftStart, MAX('mtb MasterTableUTC'[OutputTarget]),
( (CurrentTime - ShiftStart) / ShiftDiff) * MAX('mtb MasterTableUTC'[OutputTarget]))
Basically, if the current time is outside the range of the shift, it should have the target value equal the total shift target, but if it is in the shift time, it calculates it as a ratio of time passed within the shift. This does not work with shifts that cross midnight as the shift end time value is technically earlier than the shift start time value. Any ideas on how to modify the measure to account for these shifts?

No returned data for a range of more than 6h

I use graphite 0.9.15.
I have metrics with few data (many nulls) over the last 6 hours, I can draw them when querying the last 6 hours, but if I ask for more (like 7 hours), I have no data. I tried to change the consolidation function with consolidateBy to max, but I have still no data. What should I do to plot my data over more than 6 hours?
I configured the storage aggregation with a xFilesFactor of 0 and the aggregationMethod to max, but my data are too young to be aggregated.
It looks like something is wrong with your storage-schemas.conf. E.g. you have only 6 hours of data there - "retentions = 1m:6h"
Or aggregation scheme also can be wrong - if your lowest retention is 10 seconds, and you sending data e.g. every 60 seconds then you will get one data point, and then five empty points - 80% of data is empty. Default xFilesFactor is 0.5, so it will discard your data if you have more than 50% of empty points.
Usually, your lowest retention should match with your metrics rate - or you need to set up proper aggregation for different metrics by regex.

How to calculate network system downtime

Here are two systems, A and B. How to calculate the downtime of each.
For A, should it be: 0.01 * 10 * 6 * 12 = 7.2 hours/year?
A system has 10 physical nodes, if any of those nodes failed, the whole system go down. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.
B system has 10 physical nodes, if 9 out of 10 nodes is running the whole system can function as normal. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.
We are talking about expected downtimes here, so we'll have to take a probabalistic approach.
We can take a Poisson approach to this problem. The expected failure rate is 1% per month for a single node, or 120% (1.2) for 10 nodes in 12 months. So you are correct that 1.2 failures/year * 6 hours/failure = 7.2 hours/year for the expected value of A.
You can figure out how likely a given amount of downtime is by using 7.2 as the lambda value for the poisson distribution.
Using R: ppois(6, lambda=7.2) = 0.42, meaning there is a 42% chance that you will have less than 6 hours of downtime in a year.
For B, it's also a Poisson, but what's important is the probability that a second node will fail in the six hours after the first failure.
The failure rate (assuming a 30 day month, with 120 6 hour periods) is 0.0083% per 6 hour period per node.
So we look at the chances of two failures within six hours, times the number of six hour periods in a year.
Using R: dpois(2.0, lambda=(0.01/120)) * 365 * 4 = 0.000005069
0.000005069 * 3 expected hours/failure = 54.75 milliseconds expected downtime per year. (3 expected hours per failure because the second failure should occur on average half way through the first failure.)
1% failure rate / month / node has a probability of 0,00138889% to fail at any given hour. I used binomial distribution in Excel to model the probability of N node failures when there are 8760 h/y * 10 nodes = 87600 "trials". I got these results:
0 failure: 29.62134067 %
1 failure: 36.03979837 %
2 failure: 21.92426490 %
3 failure: 8.89142792 %
4 failure: 2.70442094 %
5 failure: 0.65805485 %
6 failure: 0.13343314 %
...and so forth
N failures would cause 6N hours of downtime (asusming they are independent). Then for each 6N hours of single-node downtime the probability of having none of other 9 nodes to fail is (100% - 0,00138889%) ^ (9 * 6N).
Thus expected two-node downtime is P(1 node down) * (1 - P(no other node down)) * 6 hours / 2 (divided by two because on average 2nd failure occurs in mid-point of other node being repaired). When summed over all N numbers of failures I got expected two-node downtime of 9.8 seconds / year, now idea how correct estimate this is but should give a rough idea. Quite brute-force solution :/

Understanding the graphite response (datapoint) for a summarize query for last 24 hours

Request:
http://example.com:8081/render?format=json&target=summarize(stats.development.com.xxx.operation.yyy.*.*.rate, "24hours", "sum", true)&from=-24hours&tz=UTC
Response:
[{"datapoints":[[0.1,1386198900]],"target":"summarize(stats.development.com.xxx.operation.yyy.5.4.rate,
"24hours", "sum", true)"}]
What I wanted was the summary of last 24hours for the stat provided in query.
Can you please interpret "datapoint" for me?
What does "0.1" mean? What is its logarithmic scale?
What does 1386198900 mean?
[{
"datapoints" :
[[0.1,1386198900]],
"target":
"summarize(stats.development.com.xxx.operation.yyy.5.4.rate, "24hours", "sum", true)"
}]
Here, datapoints sent to the metric stats.development.com.xxx.operation.yyy.5.4.rate, when numerically summed on a 24 hour basis, is 0.1 for the epoch 1386198900, which is the system's way of saying Wed, 04 Dec 2013 23:15:00 GMT. The logarithmic scale is not involved here.
Consider the following example-
You create a metric- website.about-us-page.hits and start sending data every 10 seconds-
1386198900: 3
1386198910: 23
1386198920: 12
1386198930: 1
1386198940: 0
1386198950: 180
1386198960: 12
This URL API request to graphite-
target=summarize(stats.website.about-us-page.hits, "20seconds", "sum", true) will return something like-
[{
"datapoints" :
[[26,1386198900]], // sum of first two points
[[13,1386198920]], // sum of next two points
[[180,1386198940]],
[[12,1386198960]],
"target":
"summarize(stats.website.about-us-page.hits, "20seconds", "sum", true)"
}]
summarize() basically helps you see the data in different granularity. Like in cases when you need to know day-wise or hour-wise traffic, rather than on a 10s basis.

Number of seconds since January 1, 1970 00:00:00 GMT Erlang

I am interacting with a Remote Server. This Remote Server is in a different Time Zone. Part of the Authentication requires me to produce the:
"The number of seconds since January 1, 1970 00:00:00 GMT
The server will only accept requests where the timestamp
is within 600s of the current time"
The documentation of erlang:now(). reveals that it can get me the the elapsed time since 00:00 GMT, January 1, 1970 (zero hour)
on the assumption that the underlying OS supports this. It returns a size=3 tuple, {MegaSecs, Secs, MicroSecs}. I tried using element(2,erlang:now()) but the remote server sends me this message:
Timestamp expired: Given timestamp (1970-01-07T14:44:42Z)
not within 600s of server time (2012-01-26T09:51:26Z)
Which of these 3 parameters is the required number of seconds since Jan 1, 1970 ? What aren't i doing right ? Is there something i have to do with the universal time as in calendar:universal_time() ? UPDATEAs an update, i managed to switch off the time-expired problem by using this:
seconds_1970()->
T1 = {{1970,1,1},{0,0,0}},
T2 = calendar:universal_time(),
{Days,{HH,Mins,Secs}} = calendar:time_difference(T1,T2),
(Days * 24 * 60 * 60) + (HH * 60 * 60) + (Mins * 60) + Secs.
However, the question still remains. There must be a way, a fundamental Erlang way of getting this, probably a BIF, right ?
You have to calculate the UNIX time (seconds since 1970) from the results of now(), like this:
{MegaSecs, Secs, MicroSecs} = now().
UnixTime = MegaSecs * 1000000 + Secs.
Just using the second entry of the tuple will tell you the time in seconds since the last decimal trillionellium (in seconds since the UNIX epoch).
[2017 Edit]
now is deprecated, but erlang:timestamp() is not and returns the same format as now did.
Which of these 3 parameters is the required number of seconds since Jan 1, 1970 ?
All three of them, collectively. Look at the given timestamp. It's January 7, 1970. Presumably Secs will be between 0 (inclusive) and 1,000,000 (exclusive). One million seconds is only 11.574 days. You need to use the megaseconds as well as the seconds. Since the error tolerance is 600 seconds you can ignore the microseconds part of the response from erlang:now().

Resources