Why find the Hamming Distance in Dynamical Networks? - graph

In dynamical networks, one may calculate the Hamming distance to compare the similarity between two graphs, can anyone explain how?
Assuming that the Hamming distance of two graphs have equal edge density, what is the difference between Hamming distance and expected Hamming distance between two independent Erdos-Renyi random graphs? How does the later arise?

The Hamming distance measures the minimum number of substitutions required to change (transform) one mathematical 'object' (i.e. strings or binary) into another.
So in network theory it can be defined as a the number of different connections between two networks (it can be formulated also for not equally-sized networks and for weighted or directed graphs). In a simple case in which you have two Erdos-Renyi networks (the adjacency matrix has 1 if the node pair is connected and 0 if not) the distance is mathematically defined as follows:
The values that are subtracted are the two adjacency matrix. If you take two Erdos-Renyi networks with wiring probability of 0.5 and compute the hamming distance between them you should get a value around 0.5. I generated different Erdos-Renyi graph and their Hamming distances produced a Gaussian curve around 0.5 (as we can expect; see below).
If it is needed I can give you the code I used.

Related

Find vector in which points are more sparse

(1) I have n points in 3D space
(2) I have a random vector
(3) I project all n points into the vector
Then I find the average distance between all points
How could I find the vector in which after projecting the points into it, the average distance between points is the greatest?
Can this be done in O(n)?
There is one method which you can use from machine learning, specifically dimensionality reduction. (This is based on PCA which was mentioned in one of the comments.)
Compute the covariance matrix.
Find the eigenvalues and the eigenvectors.
The eigenvector with the largest eigenvalue will correspond to the direction of the most variance, so the direction in which the points are most spread out.
Map the points onto the line defined by the vector.
Centring the points around 0 before the projection, and then moving them back afterwards may be needed as well. The issue with this, is that it is quite expensive in terms of time. For more details looks at this question: How is the complexity of PCA O(min(p^3,n^3))?

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

Calculate Euclidean Distance of pairs over 3 points?

MY DATA
I have a matrix Median that contains three qualities, Speed, Angle & Acceleration, in virtual 3D space. Each set of qualities belongs to an individual person, termed Class.
Speed<-c(18,21,25,19)
Angle<-c(90,45,90,120)
Acceleration<-c(4,5,9,4)
Class<-c("Nigel","Paul","Kelly","Steve")
Median = data.frame(Class,Speed,Angle,Acceleration)
mm = as.matrix(Median)
In the example above, Nigel's Speed, Angle and Acceleration qualities would be (18,90,4).
MY PROBLEM
I wish to know the euclidean distance between each individual person/class. For example, the euclidean distance between Nigel and Paul, Nigel and Kelly etc. I then wish to display the results in a dendrogram, as a result of hierarchical clustering.
WHAT I HAVE (UNSUCCESSFULLY) ATTEMPTED
I first used hc = hclust(dist(mm)) then plot(hc) although this results in a dendrogram of Speed only. It seems the function pdist() can compute distance between two matrices of observations, but I have three matrices. Is this possible in R? I am new to the language and have found a similar question in MATLAB here Calculating Euclidean distance of pairs of 3D points in matlab but how do I write this in R code?
Many thanks.
When you transform your data.frame into a matrix, all values become characters, I don't think that is what you want... (moreover, you're trying to compute distance with the "class" names as one of the variables...)
The best would be to put your "Class" as row.names and then compute your distances and hclust :
mm<-Median[,-1]
row.names(mm)<-Median[,1]
Then you can compute the euclidean distances between Class with
dist(mm,method="euclidean") :
> dist(mm,method="euclidean")
Nigel Paul Kelly
Paul 45.110974
Kelly 8.602325 45.354162
Steve 30.016662 75.033326 31.000000
Finally, perform your hierarchical classification :
hac<-hclust(dist(mm,method="euclidean"))
and plot(hac,hang=-1) to display the dendrogram.

how to cluster curve with kmeans?

I want to cluster some curves which contains daily click rate.
The dataset is click rate data in time series.
y1 = [time1:0.10,time2:0.22,time3:0.344,...]
y2 = [time1:0.10,time2:0.22,time3:0.344,...]
I don't know how to measure two curve's similarity using kmeans.
Is there any paper for this purpose or some library?
For similarity, you could use any kind of time series distance. Many of these will perform alignment, also of sequences of different length.
However, k-means will not get you anywhere.
K-means is not meant to be used with arbitrary distances. It actually does not use distance for assignment, but least-sum-of-squares (which happens to be squared euclidean distance) - aka: variance.
The mean must be consistent with this objective. It's not hard to see that the mean also minimizes the sum of squares. This guarantees convergence of k-means: in each single step (both assignment and mean update), the objective is reduced, thus it must converge after a finite number of steps (as there are only a finite number of discrete assignments).
But what is the mean of multiple time series of different length?

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

Resources