Find vector in which points are more sparse - math

(1) I have n points in 3D space
(2) I have a random vector
(3) I project all n points into the vector
Then I find the average distance between all points
How could I find the vector in which after projecting the points into it, the average distance between points is the greatest?
Can this be done in O(n)?

There is one method which you can use from machine learning, specifically dimensionality reduction. (This is based on PCA which was mentioned in one of the comments.)
Compute the covariance matrix.
Find the eigenvalues and the eigenvectors.
The eigenvector with the largest eigenvalue will correspond to the direction of the most variance, so the direction in which the points are most spread out.
Map the points onto the line defined by the vector.
Centring the points around 0 before the projection, and then moving them back afterwards may be needed as well. The issue with this, is that it is quite expensive in terms of time. For more details looks at this question: How is the complexity of PCA O(min(p^3,n^3))?

Related

Why find the Hamming Distance in Dynamical Networks?

In dynamical networks, one may calculate the Hamming distance to compare the similarity between two graphs, can anyone explain how?
Assuming that the Hamming distance of two graphs have equal edge density, what is the difference between Hamming distance and expected Hamming distance between two independent Erdos-Renyi random graphs? How does the later arise?
The Hamming distance measures the minimum number of substitutions required to change (transform) one mathematical 'object' (i.e. strings or binary) into another.
So in network theory it can be defined as a the number of different connections between two networks (it can be formulated also for not equally-sized networks and for weighted or directed graphs). In a simple case in which you have two Erdos-Renyi networks (the adjacency matrix has 1 if the node pair is connected and 0 if not) the distance is mathematically defined as follows:
The values that are subtracted are the two adjacency matrix. If you take two Erdos-Renyi networks with wiring probability of 0.5 and compute the hamming distance between them you should get a value around 0.5. I generated different Erdos-Renyi graph and their Hamming distances produced a Gaussian curve around 0.5 (as we can expect; see below).
If it is needed I can give you the code I used.

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

Calculate Euclidean Distance of pairs over 3 points?

MY DATA
I have a matrix Median that contains three qualities, Speed, Angle & Acceleration, in virtual 3D space. Each set of qualities belongs to an individual person, termed Class.
Speed<-c(18,21,25,19)
Angle<-c(90,45,90,120)
Acceleration<-c(4,5,9,4)
Class<-c("Nigel","Paul","Kelly","Steve")
Median = data.frame(Class,Speed,Angle,Acceleration)
mm = as.matrix(Median)
In the example above, Nigel's Speed, Angle and Acceleration qualities would be (18,90,4).
MY PROBLEM
I wish to know the euclidean distance between each individual person/class. For example, the euclidean distance between Nigel and Paul, Nigel and Kelly etc. I then wish to display the results in a dendrogram, as a result of hierarchical clustering.
WHAT I HAVE (UNSUCCESSFULLY) ATTEMPTED
I first used hc = hclust(dist(mm)) then plot(hc) although this results in a dendrogram of Speed only. It seems the function pdist() can compute distance between two matrices of observations, but I have three matrices. Is this possible in R? I am new to the language and have found a similar question in MATLAB here Calculating Euclidean distance of pairs of 3D points in matlab but how do I write this in R code?
Many thanks.
When you transform your data.frame into a matrix, all values become characters, I don't think that is what you want... (moreover, you're trying to compute distance with the "class" names as one of the variables...)
The best would be to put your "Class" as row.names and then compute your distances and hclust :
mm<-Median[,-1]
row.names(mm)<-Median[,1]
Then you can compute the euclidean distances between Class with
dist(mm,method="euclidean") :
> dist(mm,method="euclidean")
Nigel Paul Kelly
Paul 45.110974
Kelly 8.602325 45.354162
Steve 30.016662 75.033326 31.000000
Finally, perform your hierarchical classification :
hac<-hclust(dist(mm,method="euclidean"))
and plot(hac,hang=-1) to display the dendrogram.

Mahalonobis distance in R, error: system is computationally singular

I'd like to calculate multivariate distance from a set of points to the centroid of those points. Mahalanobis distance seems to be suited for this. However, I get an error (see below).
Can anyone tell me why I am getting this error, and if there is a way to work around it?
If you download the coordinate data and the associated environmental data, you can run the following code.
require(maptools)
occ <- readShapeSpatial('occurrences.shp')
load('envDat.Rdata')
#standardize the data to scale the variables
dat <- as.matrix(scale(dat))
centroid <- dat[1547,] #let's assume this is the centroid in this case
#Calculate multivariate distance from all points to centroid
mahalanobis(dat,center=centroid,cov=cov(dat))
Error in solve.default(cov, ...) :
system is computationally singular: reciprocal condition number = 9.50116e-19
The Mahalanobis distance requires you to calculate the inverse of the covariance matrix. The function mahalanobis internally uses solve which is a numerical way to calculate the inverse. Unfortunately, if some of the numbers used in the inverse calculation are very small, it assumes that they are zero, leading to the assumption that it is a singular matrix. This is why it specifies that they are computationally singular, because the matrix might not be singular given a different tolerance.
The solution is to set the tolerance for when it assumes that they are zero. Fortunately, mahalanobis allows you to pass this parameter (tol) to solve:
mahalanobis(dat,center=centroid,cov=cov(dat),tol=1e-20)
# [1] 24.215494 28.394913 6.984101 28.004975 11.095357 14.401967 ...
mahalanobis uses the covariance matrix, cov, (more precisely the inverse of it) to transform the coordinate system, then compute Euclidian distance in the new coordinates. A standard reference is Duda & Hart "Pattern Classification and Scene Recognition"
Looks like your cov matrix is singular. Perhaps there are linearly-dependent columns in "dat" that are unnecessary? Setting the tolerance to zero won't help if
the covariance matrix is truly singular. The first thing to do, instead, is look for columns that might be a rescaling of some other column, or might be just a sum of 2 or more other columns and remove them. Such columns are redundant for the mahalanobis distance.
BTW, since mahalanobis distance is effectively a rescaling and rotation, calling the scaling function looks superfluous - any reason why you want that?

How to square a symmetric matrix using only upper or lower triangle?

I am attempting to calculate a topological overlap measure matrix without using TOMsimilarity() in the WGCNA package. In this calculation, I need to square a large (44000x44000) symmetric matrix.
Is it possible to do this by only using either the upper or lower triangle of the matrix?
I've seen it completed by creating a distance matrix of the symmetric matrix, but I was hoping someone could guide me in another direction.
The goal would be to complete the calculation as quick as possible.
Currently, the code is as so:
correlation<-cor(data)
adjacency<-(0.5*(1+correlation))^2
sum<-apply(adjacency,1, sum)
summatrix<-matrix(sum,ncol=length(sum),nrow=length(sum))
min.k<-pmin(summatrix, t(summatrix))
num<-adjacency%*%adjacency+adjacency
den<-min.k+1-adjacency
tom<-num/den
diag(tom)=1
disstom<-1-tom
Thanks in advance!

Resources