How to calc percentiles for TTFB from multiple Nginx instances in Prometheus? - nginx

Let's assume I have 3 nginx instances, and I wrote a log analyzing tool which capable of calculating percentiles for latency for all replies in one minute. I can ship this data to Prometheus and plot a chart - one curve for one percentile 50%p, 75%p, 90%p, etc.
How can I plot percentiles/quantiles in Prometheus for all replies from all three instances of Nginx? AFAIK I can't just recalculate overall percentiles/quantiles, having percentiles calculated for each instance independently.
I really appreciate any help you can provide.

Related

perform calculation on more than one series using kibana TSVB

I using open distro kibana in order to show up some visuals,
I am using TSVB in order to perform some KPIs,
the issue that I wan to do some calculation on two indexes therefore I have created two series and done the calculation for each one like the below image
what I need is two get the average on those two series ... any idea or advice ...thanks in advance.

Find the radius of a cluster, given that its center is the average of the centers of two other clusters

I do not know if it is possible to find it, but I am using Kmeans clustering with Mahout, and I am stuck to the following.
In my implementation, I create with two different threads the following clusters:
CL-1{n=4 c=[1.75] r=[0.82916]}
CL-1{n=2 c=[4.5] r=[0.5]}
So, I would like to finally combine these two clusters into one final cluster.
In my code, I manage to find that for the final cluster the total points are n=6, the new average of the centers is c=2.666 but I am not able to find the final combined radius.
I know that the radius is the Population Standard Deviation, and I can calculate it if I previously know each point that belongs to the cluster.
However, in my case I do not have previous knowledge of the points, so I need the "average" of the 2 radius I mentioned before, in order to finally have this: CL-1{n=6 c=[2.666] r=[???]}.
Any ideas?
Thanks for your help.
It's not hard. Remember how the "radius" (not a very good name) is computed.
It's probably the standard deviation; so if you square this value and multiply it by the number of objects, you get the sum of squares. You can aggregate the sum of squares, and then reverse this process to get a standard deviation again. It's pretty basic statistic knowledge; you want to compute the weighted quadratic mean, just like you computed the weighted arithmetic mean for the center.
However, since your data is 1 dimensional, I'm pretty sure it will fit into main memory. As long as your data fits into memory, stay away from Mahout. It's slooooow. Use something like ELKI instead, or SciPy, or R. Run benchmarks. Mahout will perform several orders of magnitude slower than all the others. You won't need all of this Canopy-thing then either.

How to find total number of nodes in a Distributed hash table

How to find total number of nodes in a Distributed hash table in efficient way?
You generally do that by estimating from a small sample of the network as enumerating all nodes of a large network is prohibitively expensive for most use-cases. And would still be inaccurate due to NAT anyway. So you have to consider that you are sampling the reachable nodes.
Assuming that nodes are randomly distributed throughout the keyspace and you have some sort of distance metric in your DHT (e.g. XOR metric in Kademlia's case) you can find the median of the distances of a sample and than calculate the keyspace size divided by the average distance between nodes times.
If you use the median you may have to compensate by some factor due to the skewedness of the distribution. but my statistics are rusty, maybe someone else can chip in on that
The result will be very noisy, so you'll want to keep enough samples around for averaging. Together with the skewed distribution and the fact that everything happens at an exponential scale (twiddle one bit to the left and the population estimate suddenly doubles or halves).
I would also suggest to only base estimates on outgoing queries that you control, not on incoming traffic, as incoming traffic may be biased by some implementation details.
Another, crude way to get rough estimates is simply extrapolating from your routing table structure, assuming it scales with the network size.
Depending on your statistics prowess you might either want to do some of the following: scientific papers describing the network, steal code from existing implementations that already do estimation or do simulations over broad ranges of population sizes - simply fitting a few million random node addresses into ram and doing some calculations on them shouldn't be too difficult.
Maybe also talk to developers of existing implementations.

How to normalize benchmark results to obtain distribution of ratios correctly?

To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly
I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.

What is statistically significant latency variation?

Consider the case where I have four identical routers, A, B, C, and D, running busybox and ptpd. A and B are connected by cable 1; C and D are connected by cable 2. I have a small C program on routers A and C that sends a very small packet over UDP to the opposite router, and I use pcap to detect the times that the packet was sent, and the times it arrived at the other end, and calculate the average and deviation for a thousand of these tests.
How do I tell if these cables are different?
Obviously if one is 500μs and the other is 10ms, they're different. But what if the results for one have average 200μs with standard deviation 8, and the results for the other have average 210μs and standard deviation 10. How probable is it that they are different? What calculations should I do to test this? And, on a more technical note, what is the expected variability in latency?
I understand any intermediate switches, hubs, routers etc will add to the latency and the variability of it, but if they are directly connected by a single cable, what is a normal variance?
Edit: Just to clarify a point - this isn't just a statistics question. I can use a t-test to determine probability of difference (thanks), but I'd also like to know how much variance can normally be attributed to different qualities in the network equipment. For example, if the average of the two means are 208.4 and 208.5, I would suspect that whatever the t-test might say, the cables are the same and the difference comes from the test machines. Or am I wrong? Do cables often vary by small amounts? I don't know - What's a normal variance between latencies? What test do I need to distinguish between a difference in the cables, and the equipment? (I can't switch the cables)
First, you need a primer on statistical hypothesis testing.
Then, there are several ways to answer your question, but the most classical one is to consider that the observed latency is a real variable (let's call those T, for time) which has a non-random component explained by the behaviour of each cable (let's call those C, for cable) and a random component which you cannot explain, which may come from random fluctuations or other things you forgot to take into account (let's call those E, for error).
Then, you will make a series of observations, for cable A-B, and your model is:
T1_i = C1 + E1_i
Where you believe the contribution of the cable remains fixed and only the random variable E1 is changing.
You will also make a series of observations for cable C-D, and your model is:
T2_i = C2 + E2_i
Where you believe the contribution of the cable remains fixed and only the random variable E2 is changing.
Now, you are pretty much solved. You'll ensure all systematic influences are eliminated, so E1 and E2 are really fluctuations. Under those conditions, you can assume they are normal (Gaussian).
Using this model you can use the independent two-sample t-test to check if C1 and C2 are different to any confidence you set beforehand.
What you want is a two-sample t-test. You don't need to make any of the assumptions about typical variance that you are worried about, they are built into the test. Please find the appropriate Wiki page here. Statistically different, however, isn't necessarily the same as economically different. You can confirm that the latency times between the two routers are indeed different, but different by enough to matter? Hard to say without knowing more what about your situation, but be wary of getting too far in the statistical weeds.
I honestly don't think statistics will contribute a great deal to what you're doing here. Your cost of collecting a datum is essentially zero, and you can collect arbitrarily huge volumes of it. Fire off a few million/billion packets through each cable and then plot the latencies on two histograms with the same scale. If you can't see a difference, there probably isn't a meaningful one.
Summary statistics destroy information. There are a lot of reasons why one might want to use them anyway, but I don't think they'll be all that useful here. If you want to learn the stats, I certainly applaud that - I think statistical literacy is a fundamental skill for people who want to be able to tell when somebody is feeding them a line of bullshit. But if you just want to understand the differences in latencies between these two cables, a well-done pair of histograms will be vastly more informative.

Resources