Proper similarity measure for clustering - math

I have problems in finding a proper similarity measure for clustering. I have around 3000 arrays of sets, where each set contains features of certain domain (e.g., number, color, days, alphabets, etc). I'll explain my problem with an example.
Lets assume i have only 2 arrays(a1 & a2) and I want to find the similarity between them. each array contains 4 sets (in my actual problem there are 250 sets (domains) per array) and a set can be empty.
a1: {a,b}, {1,4,6}, {mon, tue, wed}, {red, blue,green}
a2: {b,c}, {2,4,6}, {}, {blue, black}
I have come with a similarity measure using Jaccard index (denoted as J):
sim(a1,a2) = [J(a1[0], a2[0]) + J(a1[1], a2[1]) + ... + J(a1[3], a2[3])]/4
note:I divide by total number of sets (in the above example 4) to keep the similarity between 0 and 1.
Is this a proper similarity measure and are there any flaws in this approach. I am applying Jaccard index for each set separately because I want compare the similarity between related domains(i.e. color with color, etc...)
I am not aware of any other proper similarity measure for my problem.
Further, can I use this similarity measure for clustering purpose?

This should work for most clustering algorithms. Don't use k-means - it can handle numeric vector spaces only. But you have a vector-of-sets type of data.
You may want to use a different mean than the arithmetic average for combining the four Jaccard measures. Try the harmonic or geometric means. See, the average over 250 values will likely be somewhere close to 0.5 all the time, so you need a mean that is more "aggressive".
So the plan sounds good. Just try it, implement this similarity and plug it into various clustering algorithm and see if they find something. I like OPTICS for exploring data and distance functions, as the OPTICS plot can be very indicative whether (or not!) there is something to be found based on the distance function. If the plot is too flat, there just is not much to be found, it is like a representative sample of the distances in the data set...
I use ELKI, and they even have a tutorial on adding custom distance functions: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions although you can probably just compute the distances with whatever tool you like and write them to a similarity matrix. At 3000 objects this will remain very manageable, 4200000 doubles is just a few MB.

Related

Setting the "tpow" and "expcost" arguments in TraMineR::seqdist

I'm actually working on the pathways of inpatients during their hospital stay. These pathways are represented as states sequences (the current medical unit at each time unit) and I'm trying to find typical pathways through clustering algorithms.
I create the distance matrix by using the seqdist function from the R package TraMineR, with the method "OMspell". I've already read the R documentation and the related articles, but I can't find how to set the arguments tpow and expcost.
As the time unit is an hour, I don't want any little difference of duration to have a big impact on the clustering result (contrary to a medical unit transfer for example). But I don't want the duration not to have any impact either...
Also, is there a proper way to choose their value ? Or do I just continue to grope around for a good configuration ? (I'm using Dunn, Davies-Bouldin and Silhouette criteria to compare the results of hierarchical clustering, besides the medical opinion on the resulting clusters)
The parameter tpow is an exponential coefficient applied to transform the actual spell lengths (durations). The default value is 1 for which the spell lengths are taken as are. With tpow=0, you would just ignore spell durations, and with tpow=0.5 you would consider the square root of the spell lengths.
The expcost parameter is the expansion cost, i.e. the cost for expanding a (transformed) spell length by one unit. In other words, when in the editing of one sequence into the other a spell of length t1 has to be expanded to length t2, it would cost expcost * |t2^tpow - t1^tpow|. With expcost=0 spells in a same state (e.g. AA and AAAAA) would be equivalent whatever their lengths.
With tpow=.5, for example, increasing the spell length from 1 to 2 costs more than increasing a spell length form 3 to 4. If you do not want to give to much importance to small differences in spell lengths use a low expcost. However, note that the expcost applies to the transformed spell lengths and you may want to adjust it when you change the tpow value.

What is the most efficient way to store a set of points (embeddings) such that queries for closest points are computed quickly

Given a set of embeddings, i.e. set of [name, vector representation]
how should I store it such that queries on the closest points are computed quickly. For example given 100 embeddings in 2-d space, if I query the data struct on the 5 closest points to (10,12), it returns { [a,(9,11.5)] , [b,(12,14)],...}
The trivial approach is calculate all distances, sort and return top-k points. Alternatively, one might think of storing in a 2-d array in blocks/units of mXn space to cover the range of the embedding space. I don't think this is extensible to higher dimensions, but I'm willing to be corrected.
There are standard approximate nearest neighbors library such as faiss, flann, java-lsh etc. (which are either LSH or Product Quantization based), which you may use.
The quickest solution (which I found useful) is to transform a vector of (say 100 dimensions) to a long variable (64 bits) by using the Johnson–Lindenstrauss transform. You can then use Hamming similarity (i.e. 64 minus the number of bits set in a XOR b) to compute the similarity between bit vectors a and b. You could use the POPCOUNT machine instruction to this effect (which is very fast).
In effect, if you use POPCOUNT in C, even if you do a complete iteration over the whole set of binary transformed vectors (long variables of 64 bits), it still will be very fast.

cosine similarity LSH and random hyperplane

I read few solutions about nearest neighbor search in high-dimensions using random hyperplane, but I am still confused about how the buckets work. I have 100 millions of document in the form of 100-dimension vectors and 1 million queries. For each query, I need to find the nearest neighbor based on cosine similarity. The brute force approach is to find cosine value of query with all 100 million documents and select the the ones with value close to 1. I am struggling with the concept of random hyperplanes where I can put the documents in buckets so that I don't have to calculate cosine value 100 million times for each query.
Think in a geometric way. Imagine your data like points in a high dimensional space.
Create random hyperplanes (just planes in a higher dimension), do the reduction using your imagination.
These hyperplanes cut your data (the points), creating partitions, where some points are being positioned apart from others (every point in its partition; would be a rough approximation).
Now the buckets should be populated according to the partitions formed by the hyperplanes. As a result, every bucket contains much less points than the total size of the pointset (because every partition I talked about before, contains less points than the total size of your pointset).
As a consequence, when you pose a query, you check much less points (with the assistance of the buckets) than the total size. That's all the gain here, since checking less points, means that you do much better (faster) than the brute force approach, which checks all the points.

Calculating Correlation Coefficient Between Two Files - Hex Value Histogram Correlation

I'm a new CS student and my teacher has asked us to take 2 txt files and compare their hex values. The content of each file is "abcde ... XYZ" and "accde ... XYZ" respectively. I've gotten the percentage value of each character's occurrence into an excel sheet, now I need to know what he means by Calculate the Correlation Coefficient between these 2 files.
If you need more to understand my question feel free to ask.
An histogram is a graphic representation of a distribution.
A [discrete] distribution is an ordered series of the count of the number of samples of a particular value or in the case of a probability distribution, of probabilty values: the probability that a sample taken at random would have this particular value.
First you need to produce the two binary files by applying the same chain of Cryptographic Encryption onto them, precisely as described in the assigment. This in of itself seems to be quite a hands-on/refresher on these cryptographic algorithms and on the various Block Encryption Modes (ECB, CBC etc.)
Then, for each file need to count the number of each invidudual Hex value, giving you an array from 0 to 255 (or speaking "Hex" from $00 to $FF), containing the count for each corresponding binary octet found in the file. Note that the number of cells (also called "bins" in histogram lingo) in the array is precisely 256, whereby the value of a cell is 0 if somehow there was no byte found in the file with the corresponding hex value.
These arrays are the discrete distribution of hex values found in each file; it is customary to normalize these arrays, a typical approach is to produce another array of same size (here 256 cells) but containing real values, where each value is the ratio of the number of samples for that cell and the total number of samples. Such an array therefore contains the *probability distribution of the hex values found in the file* (though being the distribution of choice, we often talk of these as the "Distribution" rather than the "Probability" distribution) (Also... some pedantic types may sneer at these being said to be probabilities but let's not confuse things at this point...).
I suggest you then plot these distributions in the typical bar-chart / histogram format, and that alone will give you a visual indication of how similar these two distributions are. I hesitate to spoil the fun of the discovery, but I may hint that you should not be disappointed if indeed these two graphs are quite different.)
The final step would be to compute a formal correlation value for these two distributions, i.e. a single value "summarizing" how similar these two are. That's where I fall short of giving you the full detail for your assignment in part because I'm shy about suggesting a particular correlation function; there are a few for that purpose; see your instructor or TA for suggestions.
Bonus / for fun, you can compute and plot the same distributions, histograms and correlation factor for the un-encrypted files (obviously, here you'd expect these to be quite similar).

Problem for lsi

I am using Latent semantic analysis for text similarity. I have 2 questions.
How to select K value for dimention reduction?
I read alot every where that LSI work for similary meaning words for example car and automobile. How is it possible??? What is the magic step I am missing here?
The typical choice for k is 300. Ideally, you set k based on an evaluation metric that uses the reduced vectors. For example, if you're clustering documents, you could select the k that maximizes the clustering solution score. If you don't have a benchmark to measure against, then I would set k based on how big your data set is. If you only have 100 documents, then you wouldn't expect to need several hundred latent factors to represent them. Likewise, if you have a million documents, then 300 may be too small. However, in my experience the resulting vectors are fairly robust to large changes in k, provided that k is not too small (i.e., k = 300 does about as well as k = 1000).
You might be confusing LSI with Latent Semantic Analysis (LSA). They're very related techniques, with the difference being that LSI operates on documents, and LSA operates on words. Both approaches use the same input (a term x document matrix). There are several good open source LSA implementations if you would like to try them. The LSA wikipedia page has a comprehensive list.
try a couple of different values from [1..n] and see what works for whatever task you are trying to accomplish
Make a word-word correlation matrix [ i.e. cell(i,j) holds the # of docs where (i,j) co-occur ] and use something like PCA on it

Resources