How to specify distance metric while for kmeans in R? - r

I'm doing kmeans clustering in R with two requirements:
I need to specify my own distance function, now it's Pearson Coefficient.
I want to do the clustering that uses average of group members as centroids, rather some actual member.
The reason for this requirement is that I think using average as centroid makes more sense than using an actual member since the members are always not near the real centroid. Please correct me if I'm wrong about this.
First I tried the kmeans function in stat package, but this function doesn't allow custom distance method.
Then I found pam function in cluster package. The pam function does allow custom distance metric by taking a dist object as parameter, but it seems to me that by doing this it takes actual members as centroids, which is not what I expect. Since I don't think it can do all the distance computation with just a distance matrix.
So is there some easy way in R to do the kmeans clustering that satisfies both my requirements ?

Check the flexclust package:
The main function kcca implements a general framework for
k-centroids cluster analysis supporting arbitrary distance measures
and centroid computation.
The package also includes a function distCor:
R> flexclust::distCor
function (x, centers)
{
z <- matrix(0, nrow(x), ncol = nrow(centers))
for (k in 1:nrow(centers)) {
z[, k] <- 1 - .Internal(cor(t(x), centers[k, ], 1, 0))
}
z
}
<environment: namespace:flexclust>

Related

Calculating Cosine Similarity in Julia for K-Means

I am making with an implementation of K-means clustering in Julia.
Figure out, and implement a modification of k-means that alternatively measure similarity by the angle between vectors.
So I assumed that one could use Cosine Similarity for this, I have made the code work with regular K-means by calculating th squared Euclidian Distance, by this:
Distances[:,i] = sum((X.-C[[i],:]).^2, dims=2) # Where C is center, Distances are added using the i-th center
I tried to do this by using cosine similarity such as this:
Distances[:, i] = sum(1 .- ((X*C[[i], :]).^2 /(sum(X.^2, dims=2).*(C[[i],:]'*C[[i],:]))))
But this seems to not be working.
Have I misunderstood the question or am I implementing it wrong?
In my Beta Machine Learning Package, module Utils, I implemented the distances as:
using LinearAlgebra
"""L1 norm distance (aka _Manhattan Distance_)"""
l1_distance(x,y) = sum(abs.(x-y))
"""Euclidean (L2) distance"""
l2_distance(x,y) = norm(x-y)
"""Squared Euclidean (L2) distance"""
l2²_distance(x,y) = norm(x-y)^2
"""Cosine distance"""
cosine_distance(x,y) = dot(x,y)/(norm(x)*norm(y))
I then use them in the cluster module.
Note that you need the standard library package LinearAlgebra.
I managed to solve it by using the CosineDist function from Distances github. Although one could also manually calculate the distance using the code supplied in the Github or other implementations.
How I did this, was to calculate the distance for each data point to the i-th cluster center.
Distances[:, i] = [evaluate(CosineDist(), X[j,:], C[[i],:]] for j in 1:300] # Or the length of X

nstart for k-means in R

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?
Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).
In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?
Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?
The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):
$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$
Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:
km <- kmeans(iris[,-5], 3)
print(km$centers)
If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:
nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension
# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)
for (i in 1:nstart) {
centers <- data.frame(runif(K, r.min[d], r.max[d]))
for (d in 2:D) {
centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
}
names(centers) <- names(iris[,-5])
# call kmeans with centers and compare tot.withinss
# ...
}

How do I implement a non-default dissimilarity metric with vegan function metaMDS()?

I have constructed a distance matrix from phylogenetic data using the Claddis function MorphDistMatrix() with the distance metric "MORD" (Maximum Observable Rescaled Distance). I now want to use this dissimilarity matrix to run an NMDS using the vegan function metaMDS(). However, although metaMDS has many distance metrics to choose from, "MORD" is not one of them. How do I enable metaMDS() to have this metric as an option?
Edit: here is some example code:
nexus.data<-ReadMorphNexus("example.nex")
Reading in Nexus file
dist<- MorphDistMatrix(nexus.data, distance = "MORD")
Claddis command for creating distance matrix. Instead of using the Gower dissimilarity (distance = "GC"), I would like to use Maximum Observable Rescaled Distance (distance = "MORD"), which is a modified form of Gower for use with ordered characters (Lloyd 2016). So far so good.
nmds<-metaMDS(dist$DistanceMatrix, k=2, trymax=1000, distance = "GC")
Here is where I run into trouble: as I understand it, the distance used for the metaMDS command should be the same as was used to construct the distance matrix, but MORD is not an option for "distance" in metaMDS. If I were to construct the distance matrix under Gower dissimilarity it wouldn't be a problem, as that is also available in metaMDS
Lloyd, G. T., 2016. Estimating morphological diversity and tempo with discrete character-taxon matrices: implementation, challenges, progress, and future directions. Biological Journal of the Linnean Society, 118, 131-151.
metaMDS has argument distfun to select other dissimilarity functions than vegdist. Such a function should accept argument method to select the dissimilarity measure used. Further, it should return a regular dissimilarity object that inherits from standard R dist function. I do not know about this Claddis package: does it return regular dissimilarities or something peculiar? Your example hints that it returns something that is not a regular R object, but something peculiar. Alternatively, you can use pre-calculated dissimilarities as input in metaMDS. Again these should be regular dissimilarities like in any decent R implementation. So you need to check the following with your dissimilarities:
inherits(dist, "dist") # your dist result: should be TRUE
inherits(dist$DistanceMatrix, "dist") # alternatively this should be TRUE
## if the latter was TRUE, you can extract that with
d <- dist$DistanceMatrix
## if d is not a "dist" object, you can see if it can be turned into one
d <- as.dist(dist$DistanceMatrix)
inherits(d, "dist") # TRUE: OK, FALSE: no hope
## if it was OK, you just do
metaMDS(d)

Time taken to krige in gstat package in R

The following R program creates an interpolated surface using 470 data points using walker Lake data in gstat package.
source("D:/kriging/allfunctions.r") # Reads in all functions.
source("D:/kriging/panel.gamma0.r") # Reads in panel function for xyplot.
library(lattice) # Needed for "xyplot" function.
library(geoR) # Needed for "polygrid" function.
library(akima)
library(gstat);
library(sp);
walk470 <- read.table("D:/kriging/walk470.txt",header=T)
attach(walk470)
coordinates(walk470) = ~x+y
walk.var1 <- variogram(v ~ x+y,data=walk470,width=10) #the width has to be tuned resulting different point pairs
plot(walk.var1,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 5")
model1.out <- fit.variogram(walk.var1,vgm(70000,"Sph",40,20000))
plot(walk.var1, model=model1.out,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 10")
poly <- chull(coordinates(walk470))
plot(coordinates(walk470),type="n",xlab="X",ylab="Y",cex.lab=1.6,main="Plot of Sample and Prediction Sites",cex.axis=1.5,cex.main=1.6)
lines(coordinates(walk470)[poly,])
poly.in <- polygrid(seq(2.5,247.5,5),seq(2.5,297.5,5),coordinates(walk470)[poly,])
points(poly.in)
points(coordinates(walk470),pch=16)
coordinates(poly.in) <- ~ x+y
krige.out <- krige(v ~ 1, walk470,poly.in, model=model1.out)
print(krige.out)
This program calculates the following for each point of 2688 points
(470x470) matrix inversion
(470x470) and (470x1) matrix multiplication
Is gstat package is using some smart way for calculation. I knew from previous stackoverflow query that it uses cholesky decomposition for matrix inversion. Is it normal speed for one machine to calculate it so quickly.
It uses LDL' decomposition, which is similar to Choleski. As you are using global kriging, the covariance matrix needs to be decomposed only once; then, for each prediction point, a system is solved, which is O(n). No 470x470 matrix gets ever inverted, neither are solutions obtained by multiplying it. Inverses are notational devices, but avoided as computational strategy when possible. In R, for instance, compare runtime of solve(A,b) with solve(A) %*% b.
Use the source, Luke!

Using different metric for hclust linkage?

In R you can use all sorts of metrics to build a distance matrix prior to clustering, e.g. binary distance, Manhattan distance, etc...
However, when it comes to choosing a linkage method (complete, average, single, etc...), these linkage all use euclidean distance. This does not seem particularly appropriate if you rely on a difference metric to build the distance matrix.
Is there a way (or a library...) to apply other distances to linkage methods when building a clustering tree?
Thanks!
I don't really get your question. For example, suppose I have the following data:
x <- matrix(rnorm(100), nrow=5)
then I can build a distance matrix using dist
##Changing the distance measure
d_e = dist(x, method="euclidean")
d_m = dist(x, method="maximum")
I can then cluster in however I want:
##Changing the clustering method
hclust(d_m, method="median")
If you have constructed a matrix that already represents the pairwise distances, use e.g.
hclust(as.dist(mx), method="single")
You might want to try using agnes, rather than hclust, and hand it a distance matrix. There's a nice tutorial on this here:
http://strata.uga.edu/software/pdf/clusterTutorial.pdf
From the tutorial, here's how you would generate and use a distance matrix for clustering:
> library(vegan)
# load library for distance functions
> mydata.bray <- vegdist(mydata, method="bray")
# calculates bray (=Sørenson) distances among samples
> mydata.bray.agnes <- agnes(mydata.bray)
# run the cluster analysis
I myself use Prof. Daniel Müllner's fastcluster library, which has exactly the same API as agnes but is orders of magnitude faster for large data sets.

Resources