how to compute the global variance (square standard deviation) in a parallel application? - math

I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...

The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.

You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!

Related

Is it feasible to denoise time irrelevant sensor reading with Kalman Filter and how to code it?

After I did some research, I can understand how to implement it with time relevant functions. However, I'm not very sure about whether can I apply it to time irrelevant scenarios.
Giving that we have a simple function y=a*x^2, where both y and x are measured at a constant interval (say 1 min/sample) and a is a constant. However, both y and x measurements have white noise.
More specifically, x and y are two independently measured variables. For example, x is air flow rate in a duct and the y is the pressure drop across the duct. Because the air flow is varying due to the variation of the fan speed, the pressure drop across the duct is also varying. The relation between the pressure drop y and flow rate x is y=a*x^2, however both measurement embedded white noise. Is that possible to use Kalman Filter to estimate a more accurate y? Both x and y are recorded in a constant time interval.
Here are my questions:
Is it feasible to implement Kalman Filter for the y reading noise reduction? Or in another word, have a better estimation of y?
If this is feasible, how to code it in R or C?
P.S.
I tried to apply Kalman Filter to single variable and it works well. The result is as below. I'll have a try Ben's suggestion then and have a look whether can I make it works.
I think you can apply some Kalman Filter like ideas here.
Make your state a, with variance P_a. Your update is just F=[1], and your measurement is just H=[1] with observation y/x^2. In other words, you measure x and y and estimate a by solving for a in your original equation. Update your scalar KF as usual. Approximating R will be important. If x and y both have zero mean Gaussian noise, then y/x^2 certainly doesn't, but you can come up with an approximation.
Now that you have a running estimate of a (which is a random constant, so Q=0 ideally, but maybe Q=[tiny] to avoid numerical issues) you can use it to get a better y.
You have y_meas and y_est=a*x_meas^2. Combine those using your variances as (R_y * a * x^2 + (P_a + R_x2) * y_meas) / (R_y + P_a + R_x2). Over time as P_a goes to zero (you become certain of your estimate of a) you can see you end up combining information from your x and y measurements proportional to your trust in them individually. Early on, when P_a is high you are mostly trusting the direct measurement of y_meas because you don't know the relationship.

FFT frequency domain values depend on sequence length?

I am extracting Heart Rate Variable (HRV) frequency domain features, e.g. LF, HF, using FFT. Currently, I have found out that LF and HF values in longer sequence length, e.g. 3 minutes, will be larger than shorter sequence length, e.g. 30 seconds. I wonder if this is a common observation or there are some bugs in my execution codes? Thanks in advance
Yes, the frequency in each bin depends on N, the sequence length.
See this related answer: https://stackoverflow.com/a/4371627/119527
An FFT by itself is a dimensionless basis transform. But if you know the sample rate (Fs) of the input data and the length (N) of the FFT, then the center frequency represented by each FFT result element or result bin is bin_index * (Fs/N).
Normally (with baseband sampling) the resulting range is from 0 (DC) up to Fs/2 (for strictly real input the rest of the FFT results are just a complex conjugate mirroring of the first half).
Added: Also many forward FFT implementations (but not all) are energy preserving. Since a longer signal of the same amplitude input into a longer FFT contains more total energy, the FFT result energy will also be greater by the same proportion, either by bin magnitude for sufficiently narrow-band components, and/or by distribution into more bins.
What you are observing is to be expected, at least with most common FFT implementations. Typically there is a scale factor of N in the forward direction and 1 in the reverse direction, so you need to scale the output of the FFT bins by a factor of 1/N if you are interested in calculating spectral energy (or power).
Note however that these scale factor are just a convention, e.g. some implementations have a sqrt(N) scale factor in both forward and reverse directions, so you need to check the documentation for your FFT library to be absolutely certain.

Error probability function

I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.
I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6]
that proposes this formula to calculate the probability mass function in such cases.
I implemented the formula with R as shown here
pcr.prob <- function(k,N,eps){
v = numeric(k)
for(i in 1:k) {
v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
}
1 - sum(v)
}
From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.
However when I use my implementation of the formula I get a different value.
pcr.prob(3,800,0.0000185)
[1] 5.323567e-07
What could I be doing wrong in my implementation? Am I misinterpreting something?
Thanks
I think they've got the right number (0.00113), but badly explained in their paper.
The calculation you want to be doing is:
pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.
Somewhat statsy, may be worth a move…
Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:
pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
will continue to work even when the error incorporation rate is very low.

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

Merging two statistical result sets

I have two sets of statistics generated from processing. The data from the processing can be a large amount of results so I would rather not have to store all of the data to recalculate the additional data later on.
Say I have two sets of statistics that describe two different sessions of runs over a process.
Each set contains
Statistics : { mean, median, standard deviation, runs on process}
How would I merge the two's median, and standard deviation to get a combined summary of the two describing sets of statistics.
Remember, I can't preserve both sets of data that the statistics are describing.
Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:
new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n
edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.
You can get the mean and standard deviation, but not the median.
new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n
new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2
where n(0) is the number of runs in the first data set, n(1) is the number of runs in the second, and so on, mean is the mean, and var is the variance (which is just standard deviation squared). n**2 means "n squared".
Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,
Var(X) = E(X^2) - E(X)^2
The var(n)+mean(n)**2 terms above give us the E(X^2) portion which we can then combine with other data sets, and then get the desired result.
In terms of medians:
If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.
If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.
Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.

Resources