Mathematical representation of a set of points in N dimensional space? - math

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?

In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.

Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

Related

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

Find vector in which points are more sparse

(1) I have n points in 3D space
(2) I have a random vector
(3) I project all n points into the vector
Then I find the average distance between all points
How could I find the vector in which after projecting the points into it, the average distance between points is the greatest?
Can this be done in O(n)?
There is one method which you can use from machine learning, specifically dimensionality reduction. (This is based on PCA which was mentioned in one of the comments.)
Compute the covariance matrix.
Find the eigenvalues and the eigenvectors.
The eigenvector with the largest eigenvalue will correspond to the direction of the most variance, so the direction in which the points are most spread out.
Map the points onto the line defined by the vector.
Centring the points around 0 before the projection, and then moving them back afterwards may be needed as well. The issue with this, is that it is quite expensive in terms of time. For more details looks at this question: How is the complexity of PCA O(min(p^3,n^3))?

FFT frequency domain values depend on sequence length?

I am extracting Heart Rate Variable (HRV) frequency domain features, e.g. LF, HF, using FFT. Currently, I have found out that LF and HF values in longer sequence length, e.g. 3 minutes, will be larger than shorter sequence length, e.g. 30 seconds. I wonder if this is a common observation or there are some bugs in my execution codes? Thanks in advance
Yes, the frequency in each bin depends on N, the sequence length.
See this related answer: https://stackoverflow.com/a/4371627/119527
An FFT by itself is a dimensionless basis transform. But if you know the sample rate (Fs) of the input data and the length (N) of the FFT, then the center frequency represented by each FFT result element or result bin is bin_index * (Fs/N).
Normally (with baseband sampling) the resulting range is from 0 (DC) up to Fs/2 (for strictly real input the rest of the FFT results are just a complex conjugate mirroring of the first half).
Added: Also many forward FFT implementations (but not all) are energy preserving. Since a longer signal of the same amplitude input into a longer FFT contains more total energy, the FFT result energy will also be greater by the same proportion, either by bin magnitude for sufficiently narrow-band components, and/or by distribution into more bins.
What you are observing is to be expected, at least with most common FFT implementations. Typically there is a scale factor of N in the forward direction and 1 in the reverse direction, so you need to scale the output of the FFT bins by a factor of 1/N if you are interested in calculating spectral energy (or power).
Note however that these scale factor are just a convention, e.g. some implementations have a sqrt(N) scale factor in both forward and reverse directions, so you need to check the documentation for your FFT library to be absolutely certain.

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

Resources