How do I generate data from a similarity matrix? - similarity

Suppose there are 14 objects, each of which have or do not have 1000 binary features. I have a 14x14 similarity matrix, but not the raw 14x1000 data. Is there a way to reconstruct or generate something similar to the raw data, given the similarity matrix?
I tried Monte Carlo simulations, but unconstrained they would take way too much time to achieve even a low level of consistency with the original similarity matrix.
I saw this relevant question: Similarity matrix -> feature vectors algorithm?. However, they wanted to reduce not increase dimensionality. Also, I am not sure (1) which matrix or matrices to use, and (2) how to convert into a binary matrix.

It's impossible to say for sure unless you describe how the similarity scores were computed.
In general, for the usual kind of similarity scoring this is not possible: information has been lost in the transformation from individual features to aggregate statistics. The best you can hope to do is to arrive at a set of features that are consistent with the similarity scores.
I think that is what you are talking about when you say "similar to" the original. That problem is pretty interesting. Suppose similarity was computed as the dot-product of two feature vectors (ie the count of features for a pair of objects that both have value = 1/true). This is not the only choice: it is consistent with value of 0 (false) meaning no information. But it may generalize to other similarity measures.
In such a case, the problem is really a linear programming problem: a naive approach is to exhaustively search the space of possible objects - not randomly, but guided by the constraints. For example, suppose SIM(A,B) := similarity of object A and object B. Define an order on these vectors.
If SIM(A,B) = N, then choose A=B minimal (like (1,....,1 (N times), 0, .... 0 (1000-N times)), and then choose the minimum C s.t. (A,C), (B,C) have the given values. Once you find an inconsistency, backtrack, and increment.
This will find a consistent answer, although the complexity is very high (but probably better than monte carlo).
Finding a better algorithm is an interesting problem, but more than this I can't say in a SO post - that's probably a topic for a CS thesis!

Related

Different bandwidth specification in mean-shift clustering with different packages in R

I want to perform mean-shift clustering in R and found out that there are at least two packages that have this functionality: MeanShift and meanShiftR. As showed here the latter is much faster and as I tried out the first one and it took a long time to perform a clustering, I'm keen on choosing meanShiftR. However meanShiftR::meanShift function has rather uncommon way of bandwidth specification, see part of documentation:
queryData A matrix or vector of points to be classified by the mean
shift algorithm. Values must be finite and non-missing.
bandwidth A vector of length equal to the number of columns in the queryData matrix, or length one when queryData is a vector. This
value will be used in the kernel density estimate for steepest ascent
classification. The default is one for each dimension.
I'm not an expert in mean-shift clustering, but the only banwidth specifications I have found in the literature is that bandwidth is scalar or positive definite, symmetric matrix, not a vector. So is this the technical trick to represent the bandwidth and the value of bandwidth have to be the same for each dimension? Or maybe it can vary?
The other issue is that even setting the same value of bandwidth in meanShiftR package as in MeanShift::msClustering, but just replicated to match the number of columns, I've obtained totally different results, in particular much larger number of cluster. Also, the modes were rather very similar and not representative of the dataset. That made me wonder if this package works correct. Have someone even used meanShiftR? If so, maybe you could present any example as the documentation is not clear enough for me?
This isn't actually different.
One scalar per query point.

Slepc shell matrices with multiple processes

I currently use explicit matrix storage for my generalized Eigenvalue equation of the form $AX = \lambda BX$ with eigenvalue lambda and eigenvector $X$. $A$ and $B$ are pentadiagonal by blocks, Hermitian and every block is Hermitian as well.
The problem is that for large simulations memory usage gets out of hand. I would therefore like to switch to Shell matrices. An added advantage would be that then I can avoid the duplication of a lot of information, as A and B are both filled through finite differences. I.e., the first derivative of a function X can be approximated by $X_i' = \frac{X_{i+1}-X_{i-1}}{\Delta}$, so that the same piece of information appears in two places. It gets (much) worse for higher orders.
When I try to implement this in Fortran, using multiple MPI processes that each contain a part of the rows of $A$ and $B$, I stumble upon the following issue: To perform matrix multiplication, one needs the vector information of $X$ from other ranks at the end of each rank's interval, due to the off-diagonal elements of $A$ and $B$.
I found a conceptual solution by using MPI all to all commands that pass the information from these "ghosted" regions to the ranks next-door. However, I fear that this might not be most portable, and also not too elegant.
Is there any way to automate this process of taking the information from ghost zones in Petsc / Slepc?

Document similarity selfplagiarism

I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.

How to choose k in Shi-Malik Algorithm?

I'm wondering how one chooses a specific k in Shi-Malik Algo.
Do we choose several ks and rank them via their SSE measures?
Does k reflect the number of clusters we assume for the data?
kind regards Mikey
Yes, K is the number of natural grouping we believe their is in the data.
You can find K by exploring the eigenvalues.
One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the value of k that maximizes the eigengap (difference between consecutive eigenvalues). i.e., choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large.
The larger this eigengap is, the closer the eigenvectors of the ideal case and hence the better spectral clustering works. If you're interested on the justifications for this procedure, it is based on perturbation theory and spectral graph theory.
You can read more here: A Tutorial on Spectral Clustering - Ulrike von Luxburg
Other way to explore the natural grouping: number of connected components and the spectrum of the Laplacian matrix - the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. Your affinity matrix can be considered as a graph and then, try to look how many connected components you have in the graph. That will give you a sense of the neutral structure of your data..
In addition, as you mentioned, we can set a validation criterion (for example, SSE) and see its value under different values of K. That's fine once you have a labeled data (which is not always the case in clustering) and you know that this criterion/quality measure is really meaningful.

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

Resources