Can we test if distance matrices are significantly farther apart? - r

I work in the field of linguistics, and my current project involves lots of distance matrices generated from language data, which measure the distance and similarity of dialects. Concretely, my distance matrices range between 0 and 1, where 0 represents no distance and 1 represents maximal distance between dialects. Now, I am wondering if there exists statistical significance tests or something like that, wherein we can test if dialect A and dialect B are significantly farther apart than B and C? Alternatively, is there a customary threshold, say 0.5, whereby distances > 0.5 indicates dialects are more different than similar? For instance, consider the distances between CMM_Press and CTM_Press on one hand, and CMM_Other and CTM_Other on the other hand in the distance matrices below.
In the four distance matrices above, I am especially interested in the distances of the following pairs:
CMM_Press and CTM_Press: 0.2, 0.19, 0.5, 0.4;
CMM_Other and CTM_Other: 0.6, 0.41, 0.4, 0.69.
Is there significance tests with which I can test if, say, CMM_Other and CTM_Other are significantly farther apart than CMM_Press and CTM_Press?
To facilitate the answering of the questions, you can find the dataset, a R markdown file containing the distance matrices and the scripts for the analysis in this OSF link.
In addition, I would like to know if is there exists a good reference on how to interpret distance matrices (e.g., from an ecology point of view where the Mantel test was invented).

Related

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?
First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

Can a very large (or very small) value in feature vector using SVC bias results? [scikit-learn]

I am trying to better understand how the values of my feature vector may influence the result. For example, let's say I have the following vector with the final value being the result (this is a classification problem using an SVC, for example):
0.713, -0.076, -0.921, 0.498, 2.526, 0.573, -1.117, 1.682, -1.918, 0.251, 0.376, 0.025291666666667, -200, 9, 1
You'll notice that most of the values center around 0, however, there is one value that is orders of magnitude smaller, -200.
I'm concerned that this value is skewing the prediction and is being weighted unfairly heavier than the rest simply because the value is so much different.
Is this something to be concerned about when creating a feature vector? Or will the statistical test I use to evaluate my vector control for this large (or small) value based on the training set I provide it with? Are there methods available in sci-kit learn specifically that you would recommend to normalize the vector?
Thank you for your help!
Yes, it is something you should be concerned about. SVM is heavily influenced by any feature scale variances, so you need a preprocessing technique in order to make it less probable, from the most popular ones:
Linearly rescale each feature dimension to the [0,1] or [-1,1] interval
Normalize each feature dimension so it has mean=0 and variance=1
Decorrelate values by transformation sigma^(-1/2)*X where sigma = cov(X) (data covariance matrix)
each can be easily performed using scikit-learn (although in order to achieve the third one you will need a scipy for matrix square root and inversion)
I am trying to better understand how the values of my feature vector may influence the result.
Then here's the math for you. Let's take the linear kernel as a simple example. It takes a sample x and a support vector sv, and computes the dot product between them. A naive Python implementation of a dot product would be
def dot(x, sv):
return sum(x_i * sv_i for x_i, sv_i in zip(x, sv))
Now if one of the features has a much more extreme range than all the others (either in x or in sv, or worse, in both), then the term corresponding to this feature will dominate the sum.
A similar situation arises with the polynomial and RBF kernels. The poly kernel is just a (shifted) power of the linear kernel:
def poly_kernel(x, sv, d, gamma):
return (dot(x, sv) + gamma) ** d
and the RBF kernel is the square of the distance between x and sv, times a constant:
def rbf_kernel(x, sv, gamma):
diff = [x_i - sv_i for x_i, sv_i in zip(x, sv)]
return gamma * dot(diff, diff)
In each of these cases, if one feature has an extreme range, it will dominate the result and the other features will effectively be ignored, except to break ties.
scikit-learn tools to deal with this live in the sklearn.preprocessing module: MinMaxScaler, StandardScaler, Normalizer.

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

Choosing eps and minpts for DBSCAN (R)?

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:
library(fpc)
ds <- dbscan(USArrests,eps=20)
Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".
This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.
There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.
For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.
OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.
Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.
The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.
MinPts
As Anony-Mousse explained, 'A low minPts means it will build more clusters from noise, so don't choose it too small.'.
minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is use ln(n), where n is the total number of points to be clustered.
epsilon
There are several ways to determine it:
1) k-distance plot
In a clustering with minPts = k, we expect that core pints and border points' k-distance are within a certain range, while noise points can have much greater k-distance, thus we can observe a knee point in the k-distance plot. However, sometimes there may be no obvious knee, or there can be multiple knees, which makes it hard to decide
2) DBSCAN extensions like OPTICS
OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original author of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper.
3) sensitivity analysis
Basically we want to chose a radius that is able to cluster more truly regular points (points that are similar to other points), while at the same time detect out more noise (outlier points). We can draw a percentage of regular points (points belong to a cluster) VS. epsilon analysis, where we set different epsilon values as the x-axis, and their corresponding percentage of regular points as the y axis, and hopefully we can spot a segment where the percentage of regular points value is more sensitive to the epsilon value, and we choose the upper bound epsilon value as our optimal parameter.
One common and popular way of managing the epsilon parameter of DBSCAN is to compute a k-distance plot of your dataset. Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPTS (which strongly depends on your data), you fix k to that value. Then you use as epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope.
For details on choosing parameters, see the paper below on p. 11:
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
For two-dimensional data: use default value of minPts=4 (Ester et al., 1996)
For more than 2 dimensions: minPts=2*dim (Sander et al., 1998)
Once you know which MinPts to choose, you can determine Epsilon:
Plot the k-distances with k=minPts (Ester et al., 1996)
Find the 'elbow' in the graph--> The k-distance value is your Epsilon value.
If you have the resources, you can also test a bunch of epsilon and minPts values and see what works. I do this using expand.grid and mapply.
# Establish search parameters.
k <- c(25, 50, 100, 200, 500, 1000)
eps <- c(0.001, 0.01, 0.02, 0.05, 0.1, 0.2)
# Perform grid search.
grid <- expand.grid(k = k, eps = eps)
results <- mapply(grid$k, grid$eps, FUN = function(k, eps) {
cluster <- dbscan(data, minPts = k, eps = eps)$cluster
sum <- table(cluster)
cat(c("k =", k, "; eps =", eps, ";", sum, "\n"))
})
See this webpage, section 5: http://www.sthda.com/english/wiki/dbscan-density-based-clustering-for-discovering-clusters-in-large-datasets-with-noise-unsupervised-machine-learning
It gives detailed instructions on how to find epsilon. MinPts ... not so much.

how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
how to generate pseudo-random positive definite matrix with constraints on the off-diagonal elements?
The user wants to impose a unique, non-trivial, upper/lower bound on the correlation between every pair of variable in a var/covar matrix.
For example: I want a variance matrix in which all variables have 0.9 > |rho(x_i,x_j)| > 0.6, rho(x_i,x_j) being the correlation between variables x_i and x_j.
Thanks.
There are MANY issues here.
First of all, are the pseudo-random deviates assumed to be normally distributed? I'll assume they are, as any discussion of correlation matrices gets nasty if we diverge into non-normal distributions.
Next, it is rather simple to generate pseudo-random normal deviates, given a covariance matrix. Generate standard normal (independent) deviates, and then transform by multiplying by the Cholesky factor of the covariance matrix. Add in the mean at the end if the mean was not zero.
And, a covariance matrix is also rather simple to generate given a correlation matrix. Just pre and post multiply the correlation matrix by a diagonal matrix composed of the standard deviations. This scales a correlation matrix into a covariance matrix.
I'm still not sure where the problem lies in this question, since it would seem easy enough to generate a "random" correlation matrix, with elements uniformly distributed in the desired range.
So all of the above is rather trivial by any reasonable standards, and there are many tools out there to generate pseudo-random normal deviates given the above information.
Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range. You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense. Thus, as the sample size goes to infinity, you should expect to see the specified distribution parameters. But any small sample set will not necessarily have the desired parameters, in the desired ranges.
For example, (in MATLAB) here is a simple positive definite 3x3 matrix. As such, it makes a very nice covariance matrix.
S = randn(3);
S = S'*S
S =
0.78863 0.01123 -0.27879
0.01123 4.9316 3.5732
-0.27879 3.5732 2.7872
I'll convert S into a correlation matrix.
s = sqrt(diag(S));
C = diag(1./s)*S*diag(1./s)
C =
1 0.0056945 -0.18804
0.0056945 1 0.96377
-0.18804 0.96377 1
Now, I can sample from a normal distribution using the statistics toolbox (mvnrnd should do the trick.) As easy is to use a Cholesky factor.
L = chol(S)
L =
0.88805 0.012646 -0.31394
0 2.2207 1.6108
0 0 0.30643
Now, generate pseudo-random deviates, then transform them as desired.
X = randn(20,3)*L;
cov(X)
ans =
0.79069 -0.14297 -0.45032
-0.14297 6.0607 4.5459
-0.45032 4.5459 3.6549
corr(X)
ans =
1 -0.06531 -0.2649
-0.06531 1 0.96587
-0.2649 0.96587 1
If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough.
You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring.
An approach that might work (but one that I've not totally thought out at this point) is to use the standard scheme as above to generate a random sample. Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired. Now, find a zero mean random perturbation to your sampled data that would move the sample covariance matrix in the desired direction.
This might work, but unless I knew that this is actually the question at hand, I won't bother to go any more deeply into it. (Edit: I've thought some more about this problem, and it appears to be a quadratic programming problem, with quadratic constraints, to find the smallest perturbation to a matrix X, such that the resulting covariance (or correlation) matrix has the desired properties.)
This is not a complete answer, but a suggestion of a possible constructive method:
Looking at the characterizations of the positive definite matrices (http://en.wikipedia.org/wiki/Positive-definite_matrix) I think one of the most affordable approaches could be using the Sylvester criterion.
You can start with a trivial 1x1 random matrix with positive determinant and expand it in one row and column step by step while ensuring that the new matrix has also a positive determinant (how to achieve that is up to you ^_^).
Woodship,
"First of all, are the pseudo-random deviates assumed to be normally distributed?"
yes.
"Perhaps the issue is the user insists that the resulting random matrix of deviates must have correlations in the specified range."
Yes, that's the whole difficulty
"You must recognize that a set of random numbers will only have the desired distribution parameters in an asymptotic sense."
True, but this is not the problem here: your strategy works for p=2, but fails for p>2, regardless of sample size.
"If your desire was that the correlations must ALWAYS be greater than -0.188, then this sampling technique has failed, since the numbers are pseudo-random. In fact, that goal will be a difficult one to achieve unless your sample size is large enough."
It is not a sample size issue b/c with p>2 you do not even observe convergence to the right range for the correlations, as sample size growths: i tried the technique you suggest before posting here, it obviously is flawed.
"You might employ a simple rejection scheme, whereby you do the sampling, then redo it repeatedly until the sample has the desired properties, with the correlations in the desired ranges. This may get tiring."
Not an option, for p large (say larger than 10) this option is intractable.
"Compute the correlations. I they fail to lie in the proper ranges, then identify the perturbation one would need to make to the actual (measured) covariance matrix of your data, so that the correlations would be as desired."
Ditto
As for the QP, i understand the constraints, but i'm not sure about the way you define the objective function; by using the "smallest perturbation" off some initial matrix, you will always end up getting the same (solution) matrix: all the off diagonal entries will be exactly equal to either one of the two bounds (e.g. not pseudo random); plus it is kind of an overkill isn't it ?
Come on people, there must be something simpler

Resources