How to obtain Normal and Covariance of the Estimated normal with noisy points in PCL? - point-cloud-library

I have a point cloud in which each point is represented by its mean and is covariance. I would like to obtain the mean and the covariance of the normal vector estimated for each point. How could I do this in PCL?
Thanks!

I think the following process can give you a good estimate:
Create a new point-cloud that is sampled from the distribution of your input cloud (mean and is covariance of each point).
Estimate normals for the above (sampled) cloud.
Repeat N times.
At the end of this process you will get N different normal vectors for each point. Now you just need to calculate their mean & covariance.

Related

Interpreting metaMDS procrustes rotation

Does the metaMDS function in vegan rotate the ordination solution so the first axis explains the most variance? If not, is there a way to achieve this?
Run 20 stress 0.09957583
... Procrustes: rmse 0.0001349268 max resid 0.0009665635
... Similar to previous best
I am also unsure about how to interpret the procrustes data. What do the values for RMSE and max residual represent?
Thanks!
Yes, it does. Documentation (?metaMDS) expresses it this way:
Principal components rotate the configuration so that
the variance of points is maximized on first dimension
The output also prints when PC rotation was used:
Scaling: centring, PC rotation, halfchange scaling
However, this has nothing to do with Procrustes rotation that is used to asses the similarity of solutions during iteration steps. PC rotation only concerns the final returned result.
About interpretation of the RMSE and max residual: they are statistics that compare two solutions during iteration. RMSE is a kind of average difference, and max is the maximum difference. If you they are small, two iterations yielded similar results. To see what they mean, run metaMDS with option plot = TRUE and all Procrustes rotation are plotted when running the results. The blue arrows in that plot will show the differences, and RMSE is an averaged arrow length, and max residual is the length of the longest arrow. If you won't see much arrows, then two solutions are so similar that differences are not visible.

Compare variances between two populations with different means

I would like to compare two populations which have different means. I want to find a way to compare their variances, to have an idea of which of the two populations have values that disperse further from the mean.
The issues is that I think I should need a variance standardized/normalized on the mean value of each distribution.
Suggestions?
The next step would be to get a function in R that it is able to do that.
You don't need to standardise/normalise because variance is calculated as distance from the mean so is already normalised around the sample mean.
To demonstrate this run the following code
x<-runif(10000,min=100,max=101)
y<-runif(10000,min=1,max=2)
mean(x)
mean(y)
var(x)
var(y)
You'll see while the mean is different the variance of the two samples is identical (allowing for some difference due to pseudo-random number generation and sample size)

how to compute the global variance (square standard deviation) in a parallel application?

I have a parallel application in which I am computing in each node the variance of each partition of datapoint based on the calculated mean, but how can I compute the global variance (sum of all the variances)?
I thought that it would be a simple sum of the variances and divided by the number of nodes, but it is not giving me a close result...
The global variation is a sum.
You can compute parts of the sum in parallel trivially, and then add them together.
sum(x1...x100) = sum(x1...x50) + sum(x51...x100)
The same way, you can compute the global averages - compute the global sum, compute the sum of the object counts, divide (don't divide by the number of nodes; but by the total number of objects).
mean = sum/count
Once you have the mean, you can compute the sum of squared deviations using the distributed sum formula above (applied to (xi-mean)^2), then divide by count-1 to get the variance.
Do not use E[X^2] - (E[X])^2
While this formula "mean of square minus square of mean" is highly popular, it is numerically unstable when you are using floating point math. It's known as catastrophic cancellation.
Because the two values can be very close, you lose a lot of digits in precision when computing the difference. I've seen people get a negative variance this way...
With "big data", numerical problems gets worse...
Two ways to avoid these problems:
Use two passes. Computing the mean is stable, and gets you rid of the subtraction of the squares.
Use an online algorithm such as the one by Knuth and Welford, then use weighted sums to combine the per-partition means and variances. Details on Wikipedia In my experience, this often is slower; but it may be beneficial on Hadoop due to startup and IO costs.
You need to add the sums and sums of squares of each partition to get the global sum and sum of squares and then use them to calculate the global mean and variance.
UPDATE: E[X2] - E[X]2 and cancellation...
To figure out how important cancellation error is when calculating the standard deviation with
σ = √(E[X2] - E[X]2)
let us assume that we have both E[X2] and E[X]2 accurate to 12 significant decimal figures. This implies that σ2 has an error of order 10-12 × E[X2] or, if there has been significant cancellation, equivalently 10-12 × E[X]2 when σ will have an error of approximate order 10-6 × E[X]; one millionth the mean.
For many, if not most, statistical analyses this is negligable, in the sense that it falls within other sources of error (like measurement error), and so you can in good consciense simply set negative variances to zero before you take the square root.
If you really do care about deviations of this magnitude (and can show that it's a feature of the thing you are measuring and not, for example, an artifact of the method of measurement) then you can start worrying about cancellation. That said, the most likely explanation is that you have used an inappropriate scale for your data, such as measuring daily temperatures in Kelvin rather than Celcius!

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

Scaling of covariance matrices

For the question "Ellipse around the data in MATLAB", in the answer given by Amro, he says the following:
"If you want the ellipse to represent
a specific level of standard
deviation, the correct way of doing is
by scaling the covariance matrix"
and the code to scale it was given as
STD = 2; %# 2 standard deviations
conf = 2*normcdf(STD)-1; %# covers around 95% of population
scale = chi2inv(conf,2); %# inverse chi-squared with dof=#dimensions
Cov = cov(X0) * scale;
[V D] = eig(Cov);
I don't understand the first 3 lines of the above code snippet. How is the scale calculated by chi2inv(conf,2), and what is the rationale behind multiplying it with the covariace matrix?
Additional Question:
I also found that if I scale it with 1.5 STD, i.e. 86% tiles, the ellipse can cover all of the points, my points set are clumping together, at almost all the cases. On the other hand, if I scale it with 3 STD, i.e. 99%tiles, the ellipse is far too big. Then how can I choose a STD to just tightly cover the clumping points?
Here is an example:
The inner ellipse corresponds to 1.5 STD and outer to 2.5 STD. why 1.5 STD is tightly cover the clumping white points? Is there any approach or reason to define it?
The objective of displaying an ellipse around the data points is to show the confidence interval, or in other words, "how much of the data is within a certain standard deviation way from the mean"
In the above code, he has chosen to display an ellipse that covers 95% of the data points. For a normal distribution, ~67% of the data is 1 s.d. away from the mean, ~95% within 2 s.d. and ~99% within 3 s.d. (the numbers are off the top of my head, but you can easily verify this by calculating the area under the curve). Hence, the value STD=2; You'll find that conf is approx 0.95.
The distance of the data points from the centroid of the data goes something like (xi^2+yi^2)^0.5, ignoring coefficients. Sums of squares of random variables follow a chi-square distribution and hence to get the corresponding 95 percentile, he uses the inverse chi-square function, with d.o.f. 2, as there are two variables.
Lastly, the rationale behind multiplying the scaling constant follows from the fact that for a square matrix A with eigenvalues a1,...,an, the eigenvalues of a matrix kA, where k is a scalar is simply ka1,...,kan. The eigenvalues give the corresponding lengths of the major/minor axis of the ellipse, and so scaling the ellipse or the eigenvalues to the 95%tile is equivalent to multiplying the covariance matrix with the scaling factor.
EDIT
Cheng, although you might already know this, I suggest that you also read this answer to a question on randomness. Consider a Gaussian random variable with zero mean, unit variance. The PDF of a collection of such random variables looks like this
Now, if I were to take two such collections of random variables, square them separately and add them to form a single collection of a new random variable, its distribution looks like this
This is the chi-square distribution with 2 degrees of freedom (since we added two collections).
The equation of the ellipse in the above code can be written as x^2/a^2 +y^2/b^2=k, where x,y are the two random variables, a and b are the major/minor axes, and k is some scaling constant that we need to figure out. As you can see, the above can be interpreted as squaring and adding two collections of Gaussian random variables, and we just saw above what its distribution looks like. So, we can say that k is a random variable that is chi-square distributed with 2 degrees of freedom.
Now all that needs to be done is to find a value for k such that 95%ile of the data is within it. Just like the 1s.d, 2s.d, 3s.d. percentiles that we're familiar with Gaussians, the 95%tile for chi-square with 2 degrees of freedom is around 6.18. This is what Amro obtains from the chi2inv function. He could have just as well written scale=chi2inv(0.95,2) and it would have been the same. It's just that talking in terms of n s.d. away from the mean is intuitive.
Just to illustrate, here's a PDF of the chi-square distribution above, with 95% of the area < some x shaded in red. This x is ~6.18.
Hope this helped.

Resources