Similarity of sets (considering also the set sizes) - similarity

I know that you can use the Jaccard index \ distance to measure the similarity / distance of two sets. However, I am looking for some way to scale the raw Jaccard values with respect to the lengths of the sets. For example, I want a higher similarity for two large sets with a significant overlap than for two small sets.
Of course, I could simply divide the value of the Jaccard distance by the size of the union of both sets, but is there a standard scheme of scaling for that purpose?

Related

Computing Jaccard index of similarity on rasters

I want to compute Jaccard index of similarity based on continuous quantities. I found the package vegan that can compute the Jaccard index for continuous cases based on Bray-Curtis measure through the function vegdist. I was able to do it by choosing the number of sites randomly and compute the Jaccard index between all pairs of the chosen sites and take the average after. This procedure takes a lot of time especially that I have many scenarios to treat. I wonder if there is a way to do it by using rasters directly (using all the non-NA pixels) without using binary maps in faisable time.

Is there a way to perform agglomerative clustering in batches in R?

I have several groups of data, with row counts ranging up to 24,000. I have manually calculated pairwise distances between the points, where the distance is based on custom text-matching rules.
I have been able to perform agglomerative clustering using hclust on groups of size ~1000, but my system's resources cannot handle the 24K x 24K / 2 comparison needed for the larger groups.
The representation of the distances takes up O[n^2] space, but the clustering representation should only take up O(n*ln(n)) space. Are there any packages in R that can perform agglomerative clustering in batches for large amounts of data?

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

How the command dist(x,method="binary") calculates the distance matrix?

I have a been trying to figure that out but without much success. I am working with a table with binary data (0s and 1s). I managed to estimate a distance matrix from my data using the R function dist(x,method="binary"), but I am not quite sure how exactly this function estimates the distance matrix. Is it using the Jaccard coefficient J=(M11)/(M10+M01+M11)?
This is easily found in the help page ?dist:
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
[...]
binary: (aka asymmetric binary): The vectors are regarded as binary
bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The
distance is the proportion of bits in which only one is on amongst
those in which at least one is on.
This is equivalent to the Jaccard distance as described in Wikipedia:
An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
In your notation, it is 1 - J = (M01 + M10)/(M01 + M10 + M11).

An "asymmetric" pairwise distance matrix

Suppose there are three sequences to be compared: a, b, and c. Traditionally, the resulting 3-by-3 pairwise distance matrix is symmetric, indicating that the distance from a to b is equal to the distance from b to a.
I am wondering if TraMineR provides some way to produce an asymmetric pairwise distance matrix.
No, TraMineR does not produce 'assymetric' dissimilaries precisely for the reasons stressed in Pat's comment.
The main interest of computing pairwise dissimilarities between sequences is that once we have such dissimilarities we can for instance
measure the discrepancy among sequences, determine neighborhoods, find medoids, ...
run cluster algorithms, self-organizing maps, MDS, ...
make ANOVA-like analysis of the sequences
grow regression trees for the sequences
Inputting a non symmetric dissimilarity matrix in those processes would most probably generate irrelevant outcomes.
It is because of this symmetry requirement that the substitution costs used for computing Optimal Matching distances MUST be symmetrical. It is important to not interpret substitution costs as the cost of switching from one state to the other, but to understand them for what they are, i.e., edit costs. When comparing two sequences, for example
aabcc and aadcc, we can make them equal either by replacing arbitrarily b with d in the first one or d with b in the second one. It would then not make sense not giving the same cost for the two substitutions.
Hope this helps.

Resources