Consider the case where I have four identical routers, A, B, C, and D, running busybox and ptpd. A and B are connected by cable 1; C and D are connected by cable 2. I have a small C program on routers A and C that sends a very small packet over UDP to the opposite router, and I use pcap to detect the times that the packet was sent, and the times it arrived at the other end, and calculate the average and deviation for a thousand of these tests.
How do I tell if these cables are different?
Obviously if one is 500μs and the other is 10ms, they're different. But what if the results for one have average 200μs with standard deviation 8, and the results for the other have average 210μs and standard deviation 10. How probable is it that they are different? What calculations should I do to test this? And, on a more technical note, what is the expected variability in latency?
I understand any intermediate switches, hubs, routers etc will add to the latency and the variability of it, but if they are directly connected by a single cable, what is a normal variance?
Edit: Just to clarify a point - this isn't just a statistics question. I can use a t-test to determine probability of difference (thanks), but I'd also like to know how much variance can normally be attributed to different qualities in the network equipment. For example, if the average of the two means are 208.4 and 208.5, I would suspect that whatever the t-test might say, the cables are the same and the difference comes from the test machines. Or am I wrong? Do cables often vary by small amounts? I don't know - What's a normal variance between latencies? What test do I need to distinguish between a difference in the cables, and the equipment? (I can't switch the cables)
First, you need a primer on statistical hypothesis testing.
Then, there are several ways to answer your question, but the most classical one is to consider that the observed latency is a real variable (let's call those T, for time) which has a non-random component explained by the behaviour of each cable (let's call those C, for cable) and a random component which you cannot explain, which may come from random fluctuations or other things you forgot to take into account (let's call those E, for error).
Then, you will make a series of observations, for cable A-B, and your model is:
T1_i = C1 + E1_i
Where you believe the contribution of the cable remains fixed and only the random variable E1 is changing.
You will also make a series of observations for cable C-D, and your model is:
T2_i = C2 + E2_i
Where you believe the contribution of the cable remains fixed and only the random variable E2 is changing.
Now, you are pretty much solved. You'll ensure all systematic influences are eliminated, so E1 and E2 are really fluctuations. Under those conditions, you can assume they are normal (Gaussian).
Using this model you can use the independent two-sample t-test to check if C1 and C2 are different to any confidence you set beforehand.
What you want is a two-sample t-test. You don't need to make any of the assumptions about typical variance that you are worried about, they are built into the test. Please find the appropriate Wiki page here. Statistically different, however, isn't necessarily the same as economically different. You can confirm that the latency times between the two routers are indeed different, but different by enough to matter? Hard to say without knowing more what about your situation, but be wary of getting too far in the statistical weeds.
I honestly don't think statistics will contribute a great deal to what you're doing here. Your cost of collecting a datum is essentially zero, and you can collect arbitrarily huge volumes of it. Fire off a few million/billion packets through each cable and then plot the latencies on two histograms with the same scale. If you can't see a difference, there probably isn't a meaningful one.
Summary statistics destroy information. There are a lot of reasons why one might want to use them anyway, but I don't think they'll be all that useful here. If you want to learn the stats, I certainly applaud that - I think statistical literacy is a fundamental skill for people who want to be able to tell when somebody is feeding them a line of bullshit. But if you just want to understand the differences in latencies between these two cables, a well-done pair of histograms will be vastly more informative.
Related
I am simulating Transport of Diluted Species inside a pipe segment in COMSOL Multiphysics. I have specified an initial concentration which produces a concentration distribution around a slice through the pipe at t=0. Moreover, I have a point probe a little bit upstream (I am using laminar flow for convection). I am plotting the concentration at this point dependent on time.
To investigate whether the model produces accurate (i.e. physically realistic) results, I am varying the diffusion coefficient D. This is where i noticed unrealistic behavior: For a large range of different diffusion coefficients, the concentration graph at the point probe does not change. This is unphysical, since e.g. higher diffusion coefficients should lead to a more spread out distribution at the point probe.
I already did a mesh refinement study and found, that the result strongly depends on mesh resolution. Therefore, I am now using the highest mesh resolution (extremely fine). Regardless, the concentration results still do not change for varying diffusion coefficients.
What could be the reason for this unphysical behavior? I already know it is not due to mesh resolution or relative tolerance of the solver.
After a lot of time spent on this simulation, I concluded that the undesired effects are indeed due to numerical diffusion, as suggested by 2b-t. Of course, it is impossible to be certain that this is actually the reason. However, I investigated pretty much any other potential culprit in the simulation - without any new insights.
To work around this issue of numerical diffusion, I switched to Particle-Based Simulation (PBS) and approximated the concentration as the normalized number of particles inside a small receiver volume. This method provides a good approximation for the concentration for large particle numbers and a small receiver volume.
By doing this, I produced results that are in very good agreement with results know from the literature.
How to find total number of nodes in a Distributed hash table in efficient way?
You generally do that by estimating from a small sample of the network as enumerating all nodes of a large network is prohibitively expensive for most use-cases. And would still be inaccurate due to NAT anyway. So you have to consider that you are sampling the reachable nodes.
Assuming that nodes are randomly distributed throughout the keyspace and you have some sort of distance metric in your DHT (e.g. XOR metric in Kademlia's case) you can find the median of the distances of a sample and than calculate the keyspace size divided by the average distance between nodes times.
If you use the median you may have to compensate by some factor due to the skewedness of the distribution. but my statistics are rusty, maybe someone else can chip in on that
The result will be very noisy, so you'll want to keep enough samples around for averaging. Together with the skewed distribution and the fact that everything happens at an exponential scale (twiddle one bit to the left and the population estimate suddenly doubles or halves).
I would also suggest to only base estimates on outgoing queries that you control, not on incoming traffic, as incoming traffic may be biased by some implementation details.
Another, crude way to get rough estimates is simply extrapolating from your routing table structure, assuming it scales with the network size.
Depending on your statistics prowess you might either want to do some of the following: scientific papers describing the network, steal code from existing implementations that already do estimation or do simulations over broad ranges of population sizes - simply fitting a few million random node addresses into ram and doing some calculations on them shouldn't be too difficult.
Maybe also talk to developers of existing implementations.
I'm trying to determine the best DSP method for what I'm trying to accomplish, which is the following:
In real-time, detect the presence of a frequency from a set of different predefined frequencies (no more than 40 different frequencies all within a 1000Hz range). I need to be able to do this even when there are other frequencies (outside of this set or range) that are more dominant.
It is my understanding that FFT might not be the best method for this, because it tells you the most dominant frequency (magnitude) at any given time. This seems like it wouldn't work because if I'm trying to detect say a frequency at 1650Hz (which is present), but there's also a frequency at 500Hz which is stronger, then it's not going to tell me the current frequency is 1650Hz.
I've heard that maybe the Goertzel algorithm might be better for what I'm trying to do, which is to detect single frequencies or a set of frequencies in real-time, even within sounds that have more dominant frequencies than the ones trying to be detected .
Any guidance is greatly appreciated and please correct me if I'm wrong on these assumptions. Thanks!
In vague and somewhat inaccurate terms, the output of the FFT is the magnitude and phase of all[1] frequencies. That is, your statement, "[The FFT] tells you the most dominant frequency (magnitude) at any given time" is incorrect. The FFT is often used as a first step to determine the most dominant frequency, but that's not what it does. In fact, if you are interested in the most dominant frequency, you need to take extra steps over and beyond the FFT: you take the magnitude of all frequencies output by the FFT, and then find the maximum. The corresponding frequency is the dominant frequency.
For your application as I understand it, the FFT is the correct algorithm.
The Goertzel algorithm is closely related to the FFT. It allows for some optimization over the FFT if you are only interested in the magnitude and/or phase of a small subset of frequencies. It might be the right choice for your application depending on the number of frequencies in question, but only as an optimization -- other than performance, it won't solve any problems the FFT won't solve. Because there is more written about the FFT, I suggest you start there and use the Goertzel algorithm only if the FFT proves to not be fast enough and you can establish the Goertzel will be faster in your case.
[1] For practical purposes, what's most inaccurate about this statement is that the frequencies are grouped together in "bins". There's a limited resolution to the analysis which depends on a variety of factors.
I am leaving my other answer as-is because I think it stands on it's own.
Based on your comments and private email, the problem you are facing is most likely this: sounds, like speech, that are principally in one frequency range, have harmonics that stretch into higher frequency ranges. This problem is exacerbated by low quality microphones and electronics, but it is not caused by them and wouldn't go away even with perfect equipment. Once your signal is cluttered with noise in the same band, you can't really distinguish on from off in a simple and reliable way, because on could be caused by the noise. You could try to do some adaptive thresholding based on noise in other bands, and you'll probably get somewhere, but that's no way to build a robust system.
There are a number of ways to solve this problem, but they all involve modulating your signal and using error detection and correction. Basically, you are building a modem and/or radio. Ultimately, what I'm saying is this: you can't solve your problem on the detector alone. You need to build some redundancy into your signal, and you may need to think about other methods of detection. I know of three methods of sending complex signals:
Amplitude modulation, which is what it sounds like you are doing now.
Frequency modulation, which tends to be more robust in the face of ambient noise. (compare FM and AM radio)
Phase modulation, which is more subtle and tricky.
These methods can be combined and multiplexed in various ways. Read about them on wikipedia. Moreover, once your base signal is transmitted, you can add error correction and detection on top.
I am not an expert in this area, but off the top of my head, I am not sure you'll be able to use PM silently, and AM is simply too sensitive to noise, as you've discovered, although it might work with the right kind of redundancy. FM is probably your best bet.
To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly
I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.
I am designing a RPG game like final fantasy.
I have the programming part done but what I lack is the maths. I am ok at maths but I am having trouble incorporating the players stas into mu sums.
How can I make an action timer that is based on the players speed?
How can I use attack and defence so that it is not always exactly the same damage?
How can I add randomness into the equations?
Can anyone point me to some resources that I can read to learn this sort of stuff.
EDIT: Clarification Of what I am looking for
for the damage I have (player attack x move strength) / enemy defence.
This works and scales well but i got a look at the algorithms from final fantasy 4 a while a got and this sum alone was over 15 steps. mine has only 2.
I am looking for real game examples if possible but would settle for papers or books that have sections that explain how they get these complex sums and why they don't use simple ones.
I eventually intent to implement but am looking for more academic knowledge at the moment.
Not knowing Final fantasy at all, here are some thoughts.
Attack/Defence could either be a 'chance to hit/block' or 'damage done/mitigated' (or, possibly, a blend of both). If you decide to go for 'damage done/mitigated', you'll probably want to do one of:
Generate a random number in a suitable range, added/subtracted from the base attack/defence value.
Generate a number in the range 0-1, multiplied by the attack/defence
Generate a number (with a Gaussian or Poisson distribution and a suitable standard deviation) in the range 0-2 (or so, to account for the occasional crit), multiplied by the attack/defence
For attack timers, decide what "double speed" and "triple speed" should do for the number of attacks in a given time. That should give you a decent lead for how to implement it. I can, off-hand, think of three methods.
Use N/speed as a base for the timer (that means double/triple speed gives 2/3 times the number of attacks in a given interval).
Use Basetime - Speed as the timer (requires a cap on speed, may not be an issue, most probably has an unintuitive relation between speed stat and timer, not much difference at low levels, a lot of difference at high levels).
Use Basetime - Sqrt(Speed) as the timer.
I doubt you'll find academic work on this. Determining formulae for damage, say, is heuristic. People just make stuff up based on their experience with various functions and then tweak the result based on gameplay.
It's important to have a good feel for what the function looks like when plotted on a graph. The best advice I can give for this is to study a course on sketching graphs of functions. A Google search on "sketching functions" will get you started.
Take a look at printed role playing games like Dungeons & Dragons and how they handle these issues. They are the inspiration for computer RPGs. I don't know of academic work
Some thoughts: you don't have to have an actual "formula". It can be rules like "roll a 20 sided die, weapon does 2 points of damage if the roll is <12 and 3 points of damage if the roll is >=12".
You might want to simplify continuous variables down to small ranges of integers for testing. That way you can calculate out tables with all the possible permutations and see if the results look reasonable. Once you have something good, you can interpolate the formulas for continuous inputs.
Another key issue is play balance. There aren't necessarily formulas for telling you whether your game mechanics are balanced, you have to test.