I am new to R and Machine Learning. I am running a KNN classification using the Euclidian distance. I was wondering how can I use cosine and jaccard distance instead of Euclidian in R? Are there any packages I can use ?
Thank you
first of all what you can do is run from within an R-session
library(sos)
findFn("knn", maxPages=10, sortby="MaxScore")
to search for knn packages by maxscore ( you can adjust the parameters accordingly ).
If you don't find a package that offers the cosine or jaccard distance, then I would suggest to first compute the distance matrix and then give this as input to the knn.
There are some packages like kNN or FastKnn which accept a distance matrix as input ( you can google this using : "distance matrix knn r" ).
Lastly the KernelKnn allows the computation of the jaccard distance but only for binary data (I'm the author, you can have a look to the other distance metrics too).
I hope it helps.
Related
For some reason, I have to find the 10~30 nearest neighbors for each samples in a geo-dataset(have lat, lon, and some categorical features, rows >10M) with various kinds of distance metrics, mostly Haversine Distance or Gower Distance.
Here, I need a fast implementation/package for obtaining the index and actual distance of the samples for each data point. Actually, the function get.knn in FNN package works very well and it meets my requirements. Unfortunately, it does not support custom distance settings and only provides euclidean distance.
I was wondering that is there any other package that can perform knn at least with Haversine Distance and output the index and distance in a very fast manner?
Many thanks!
I am making with an implementation of K-means clustering in Julia.
Figure out, and implement a modification of k-means that alternatively measure similarity by the angle between vectors.
So I assumed that one could use Cosine Similarity for this, I have made the code work with regular K-means by calculating th squared Euclidian Distance, by this:
Distances[:,i] = sum((X.-C[[i],:]).^2, dims=2) # Where C is center, Distances are added using the i-th center
I tried to do this by using cosine similarity such as this:
Distances[:, i] = sum(1 .- ((X*C[[i], :]).^2 /(sum(X.^2, dims=2).*(C[[i],:]'*C[[i],:]))))
But this seems to not be working.
Have I misunderstood the question or am I implementing it wrong?
In my Beta Machine Learning Package, module Utils, I implemented the distances as:
using LinearAlgebra
"""L1 norm distance (aka _Manhattan Distance_)"""
l1_distance(x,y) = sum(abs.(x-y))
"""Euclidean (L2) distance"""
l2_distance(x,y) = norm(x-y)
"""Squared Euclidean (L2) distance"""
l2²_distance(x,y) = norm(x-y)^2
"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
I then use them in the cluster module.
Note that you need the standard library package LinearAlgebra.
I managed to solve it by using the CosineDist function from Distances github. Although one could also manually calculate the distance using the code supplied in the Github or other implementations.
How I did this, was to calculate the distance for each data point to the i-th cluster center.
Distances[:, i] = [evaluate(CosineDist(), X[j,:], C[[i],:]] for j in 1:300] # Or the length of X
I have constructed a distance matrix from phylogenetic data using the Claddis function MorphDistMatrix() with the distance metric "MORD" (Maximum Observable Rescaled Distance). I now want to use this dissimilarity matrix to run an NMDS using the vegan function metaMDS(). However, although metaMDS has many distance metrics to choose from, "MORD" is not one of them. How do I enable metaMDS() to have this metric as an option?
Edit: here is some example code:
nexus.data<-ReadMorphNexus("example.nex")
Reading in Nexus file
dist<- MorphDistMatrix(nexus.data, distance = "MORD")
Claddis command for creating distance matrix. Instead of using the Gower dissimilarity (distance = "GC"), I would like to use Maximum Observable Rescaled Distance (distance = "MORD"), which is a modified form of Gower for use with ordered characters (Lloyd 2016). So far so good.
nmds<-metaMDS(dist$DistanceMatrix, k=2, trymax=1000, distance = "GC")
Here is where I run into trouble: as I understand it, the distance used for the metaMDS command should be the same as was used to construct the distance matrix, but MORD is not an option for "distance" in metaMDS. If I were to construct the distance matrix under Gower dissimilarity it wouldn't be a problem, as that is also available in metaMDS
Lloyd, G. T., 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society, 118, 131-151.
metaMDS has argument distfun to select other dissimilarity functions than vegdist. Such a function should accept argument method to select the dissimilarity measure used. Further, it should return a regular dissimilarity object that inherits from standard R dist function. I do not know about this Claddis package: does it return regular dissimilarities or something peculiar? Your example hints that it returns something that is not a regular R object, but something peculiar. Alternatively, you can use pre-calculated dissimilarities as input in metaMDS. Again these should be regular dissimilarities like in any decent R implementation. So you need to check the following with your dissimilarities:
inherits(dist, "dist") # your dist result: should be TRUE
inherits(dist$DistanceMatrix, "dist") # alternatively this should be TRUE
## if the latter was TRUE, you can extract that with
d <- dist$DistanceMatrix
## if d is not a "dist" object, you can see if it can be turned into one
d <- as.dist(dist$DistanceMatrix)
inherits(d, "dist") # TRUE: OK, FALSE: no hope
## if it was OK, you just do
metaMDS(d)
I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:
Normalized compression distance (NCD)
Damerau-Levenshtein distance
Jaro-Winkler distance
Levenshtein distance
Optimal string alignment distance (OSA)
("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")
At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.
But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.
I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.
To get the dendrograms using the similarity function I do:
plot(hclust(as.dist(""similarityMATRIX""), "average"))
With the dissimilarity matrix I tried:
plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))
and
plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))
From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)
I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.
Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?
You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package).
You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).
Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).
In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.