Calculating Cosine Similarity in Julia for K-Means - julia

I am making with an implementation of K-means clustering in Julia.
Figure out, and implement a modification of k-means that alternatively measure similarity by the angle between vectors.
So I assumed that one could use Cosine Similarity for this, I have made the code work with regular K-means by calculating th squared Euclidian Distance, by this:
Distances[:,i] = sum((X.-C[[i],:]).^2, dims=2) # Where C is center, Distances are added using the i-th center
I tried to do this by using cosine similarity such as this:
Distances[:, i] = sum(1 .- ((X*C[[i], :]).^2 /(sum(X.^2, dims=2).*(C[[i],:]'*C[[i],:]))))
But this seems to not be working.
Have I misunderstood the question or am I implementing it wrong?

In my Beta Machine Learning Package, module Utils, I implemented the distances as:
using LinearAlgebra
"""L1 norm distance (aka _Manhattan Distance_)"""
l1_distance(x,y) = sum(abs.(x-y))
"""Euclidean (L2) distance"""
l2_distance(x,y) = norm(x-y)
"""Squared Euclidean (L2) distance"""
l2²_distance(x,y) = norm(x-y)^2
"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
I then use them in the cluster module.
Note that you need the standard library package LinearAlgebra.

I managed to solve it by using the CosineDist function from Distances github. Although one could also manually calculate the distance using the code supplied in the Github or other implementations.
How I did this, was to calculate the distance for each data point to the i-th cluster center.
Distances[:, i] = [evaluate(CosineDist(), X[j,:], C[[i],:]] for j in 1:300] # Or the length of X

Related

How do I implement a non-default dissimilarity metric with vegan function metaMDS()?

I have constructed a distance matrix from phylogenetic data using the Claddis function MorphDistMatrix() with the distance metric "MORD" (Maximum Observable Rescaled Distance). I now want to use this dissimilarity matrix to run an NMDS using the vegan function metaMDS(). However, although metaMDS has many distance metrics to choose from, "MORD" is not one of them. How do I enable metaMDS() to have this metric as an option?
Edit: here is some example code:
nexus.data<-ReadMorphNexus("example.nex")
Reading in Nexus file
dist<- MorphDistMatrix(nexus.data, distance = "MORD")
Claddis command for creating distance matrix. Instead of using the Gower dissimilarity (distance = "GC"), I would like to use Maximum Observable Rescaled Distance (distance = "MORD"), which is a modified form of Gower for use with ordered characters (Lloyd 2016). So far so good.
nmds<-metaMDS(dist$DistanceMatrix, k=2, trymax=1000, distance = "GC")
Here is where I run into trouble: as I understand it, the distance used for the metaMDS command should be the same as was used to construct the distance matrix, but MORD is not an option for "distance" in metaMDS. If I were to construct the distance matrix under Gower dissimilarity it wouldn't be a problem, as that is also available in metaMDS
Lloyd, G. T., 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society, 118, 131-151.
metaMDS has argument distfun to select other dissimilarity functions than vegdist. Such a function should accept argument method to select the dissimilarity measure used. Further, it should return a regular dissimilarity object that inherits from standard R dist function. I do not know about this Claddis package: does it return regular dissimilarities or something peculiar? Your example hints that it returns something that is not a regular R object, but something peculiar. Alternatively, you can use pre-calculated dissimilarities as input in metaMDS. Again these should be regular dissimilarities like in any decent R implementation. So you need to check the following with your dissimilarities:
inherits(dist, "dist") # your dist result: should be TRUE
inherits(dist$DistanceMatrix, "dist") # alternatively this should be TRUE
## if the latter was TRUE, you can extract that with
d <- dist$DistanceMatrix
## if d is not a "dist" object, you can see if it can be turned into one
d <- as.dist(dist$DistanceMatrix)
inherits(d, "dist") # TRUE: OK, FALSE: no hope
## if it was OK, you just do
metaMDS(d)

knn using cosine and jaccard distance

I am new to R and Machine Learning. I am running a KNN classification using the Euclidian distance. I was wondering how can I use cosine and jaccard distance instead of Euclidian in R? Are there any packages I can use ?
Thank you
first of all what you can do is run from within an R-session
library(sos)
findFn("knn", maxPages=10, sortby="MaxScore")
to search for knn packages by maxscore ( you can adjust the parameters accordingly ).
If you don't find a package that offers the cosine or jaccard distance, then I would suggest to first compute the distance matrix and then give this as input to the knn.
There are some packages like kNN or FastKnn which accept a distance matrix as input ( you can google this using : "distance matrix knn r" ).
Lastly the KernelKnn allows the computation of the jaccard distance but only for binary data (I'm the author, you can have a look to the other distance metrics too).
I hope it helps.

R: clustering with a similarity or dissimilarity matrix? And visualizing the results

I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:
Normalized compression distance (NCD)
Damerau-Levenshtein distance
Jaro-Winkler distance
Levenshtein distance
Optimal string alignment distance (OSA)
("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")
At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.
But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.
I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.
To get the dendrograms using the similarity function I do:
plot(hclust(as.dist(""similarityMATRIX""), "average"))
With the dissimilarity matrix I tried:
plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))
and
plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))
From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)
I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.
Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?
You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package).
You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).
Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).

Using different metric for hclust linkage?

In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.

How to specify distance metric while for kmeans in R?

I'm doing kmeans clustering in R with two requirements:
I need to specify my own distance function, now it's Pearson Coefficient.
I want to do the clustering that uses average of group members as centroids, rather some actual member.
The reason for this requirement is that I think using average as centroid makes more sense than using an actual member since the members are always not near the real centroid. Please correct me if I'm wrong about this.
First I tried the kmeans function in stat package, but this function doesn't allow custom distance method.
Then I found pam function in cluster package. The pam function does allow custom distance metric by taking a dist object as parameter, but it seems to me that by doing this it takes actual members as centroids, which is not what I expect. Since I don't think it can do all the distance computation with just a distance matrix.
So is there some easy way in R to do the kmeans clustering that satisfies both my requirements ?
Check the flexclust package:
The main function kcca implements a general framework for
k-centroids cluster analysis supporting arbitrary distance measures
and centroid computation.
The package also includes a function distCor:
R> flexclust::distCor
function (x, centers)
{
z <- matrix(0, nrow(x), ncol = nrow(centers))
for (k in 1:nrow(centers)) {
z[, k] <- 1 - .Internal(cor(t(x), centers[k, ], 1, 0))
}
z
}
<environment: namespace:flexclust>

Resources