R clustering results not as expected - have i misunderstood/misused anything? - r

I am learning to use R to cluster data points and I created a toy example. I use Silhouette statistics to determine an optimal cluster number, but the optimal number it determines is not what i expect. I include all my steps and data as below. I wonder if I have misunderstood/misused anything? I would really appreciate for any comments!
First, data matrix "m" loaded from a file look like this. Each row is the feature vector of an object
Then R code:
d <- dist(m, method="euclidean")
The distance matrix looks like this:
Next perform clustering:
clustering <- hclust(d, "average")
Then calculate silhouette, for all possible cluster numbers, i.e., 1<=i <=10:
sub <- cutree(clustering, k=i) #replace i with 1, 2, 3... 10
si <- silhouette(sub, d)
sm <- summary(si, FUN=mean)
sm #to print
For example, I get the following mean silhouette values for each i:
i=1, NaN
i=2, 0.19
i=3, 0.157
....
i=8, 0.09
...
The maximum is i=2, suggesting there are two clusters, as below:
i.e.,
cluster1 = {4}
cluster2 = {all else}
I wonder why it is not predicting 3 clusters as below, which is what I expect to be reasonable:
cluster1 = {4}
cluster2 = {1,2,5,6,7}
cluster3 = {3,8,9,10}
I obtain this outcome by looking at the feature vectors of each object and grouping objects based on the fact that they have at least feature in common that is a non-zero value. Therefore, I cannot understand why cluster2 and cluster3 should be merged, as suggested by the highest silhouette value?

Euclidean distance always considers all features.
It does not look for 0 values. They are not special.
Given the large amount of 0 values, you should be using a different distance and/or algorithm.

Related

Working with spatial data: How to find the nearest neighbour of points without replacement?

I am currently working with some forest inventory data.
The data were collected on sample plots whose positions are available as point data (spatial data).
I have two datasets:
dataset dat.1 with n sample plots of species A
dataset dat.2 with k sample plots of species B
with n < k
What I want to do is to match every point of dat.1 with a point of dat.2. The result should be n pairs of points. So n of k plots from dat.2 should be selected.
The criteria for matching are:
spatial distance between a pair of points is as close as possible
one point of dat.2 can only be matched with one point in dat.1 and vice versa. So if there is a pair of points, these points should not be used in any other pair, even if it would be useful in terms of shortest distance. The "occupied" points should not be replaced and should not be used in the further matching process.
I have been looking for a very long time for ways to perform this analysis. There are functions like st_nn from 'nngeo' or nn2 from 'RANN' which give out the k nearest neighbours of a point. However, it is not possible to exclude the possibility of a replacement with these functions.
In the package 'matchIt' there are possibilites to perform a nearest neighbour matching without replacement. Yet these functions are adapted to find the closest distance between control variables and not between spatial locations.
Could anyone come up with an idea for a possibility to match my requirements?
I would really appreciate any hints or suggestions for packages and / or functions that could help me with this issue.
The first thing you should do is create your own distance matrix. The rows should correspond to those in dat.1 and the columns to those in dat.2, and each entry in the matrix is the distance between the plot in the row and the plot in the column. You can do this manually by looping through your datasets and computing the Euclidean (or other) distance between the points. You can also use the match_on function in the optmatch package to do this with the following code:
d <- rbind(dat.1, dat.2)
d$dat <- c(rep(1, nrow(dat.1)), rep(0, nrow(dat.2))
dist <- optmatch::match_on(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Once you have a distance matrix in this form, you can supply it to pairmatch in the optmatch package. pairmatch performs K:1 optimal matching without replacement. The matching is optimal in that the sum of the absolute distances between matched pairs in the matched sample is as low as possible. It doesn't guarantee that any one unit will get its nearest neighbor, but it does yield matched samples that ensure no units are matched to other units too far apart from them. You can specify an argument to controls to choose how many dat.2 units you want to be matched to each dat.1 unit. For example, to match 2 plots from dat.2 to each unit in dat.1, you can use
d$pairs <- optmatch::pairmatch(dist)
The output is a factor containing pair membership for each unit. Unmatched units will have a value of NA.
You can also do this in one single step with
d$pairs <- optmatch::pairmatch(dat ~ x.coor + y.coord, data = d,
method = "euclidean")
Then you can subset your dataset so only matched plots remain:
matched <- d[!is.na(d$pairs),]

How to remove outliers from distance matrix or Hierarchical clustering in R?

I have some questions
First, I don't know how to find and remove outliers in distance matrix or symmetry matrix.
Second, also I used Hierarachical clustering with Average linkage.
My data is engmale161 (already made symmetry matrix with DTW )
engmale161 <- na.omit(engmale161)
engmale161 <- scale(engmale161)
d <- dist(engmale161, method = "euclidean")
hc1_engmale161 <- hclust(d, method="average")
and I find optimize index 4 with silhouette, wss & gap.
>sub_grp <- cutree(hc1_engmale161,h=60, k = 4)
>table(sub_grp)
>table(sub_grp)
sub_grp
1 2 3 4
741 16 7 1
> subset(sub_grp,sub_grp==4)
4165634865
4
>fviz_cluster(list(data = engmale161, cluster = sub_grp), geom = "point")
So, I think the right upper point(4165634865) is outlier and it has only one point.
How to delete the outlier in H-C algorithm.
just some ideas.
in a nutshell,
don't do "na.omit" on engmale161. find the outlier(s) using
quantiles and box-and-whiskers put outliers to NA in the dist matrix
proceed with your processing
long version:
"dist" behaves nicely with NAs (from the R documentation, "Missing
values are allowed, and are excluded from all computations involving
the rows within which they occur. Further, when Inf values are
involved, all pairs of values are excluded when their contribution to
the distance gave NaN or NA)"
to find an outlier I would use concepts from exploratory statistics.
use "quantile" with default probs and na.rm = true (because your dist
matrix still contains NAs) --> you'll get values for the quartiles
(dataset split in 4: 0-25%, 25-50%m and so on). 25-75 is the "box".
To find the "whiskers" is a debated topic. the standard approach is
to find the InterQuartileRange (IQR), which is third-first quartile,
then first quartile - 1.5*IQR is the "lower" whiskers, and third
quartile + 1.5*IQR is the "upper" whisker. Any value outside the
whiskers is to be considered an outlier. Mark them as NA, and proceed.
Best of luck, and my compliments for being someone who actually looks at the data!

Clustering and distance calculation in Julia

I have a collection of n coordinate points of the form (x,y,z). These are stored in an n x 3 matrix M.
Is there a built in function in Julia to calculate the distance between each point and every other point? I'm working with a small number of points so calculation time isn't too important.
My overall goal is to run a clustering algorithm, so if there is a clustering algorithm that I can look at that doesn't require me to first calculate these distances please suggest that too. An example of the data I would like to perform clustering on is below. Obviously I'd only need to do this for the z coordinate.
To calculate distances use the Distances package.
Given a matrix X you can calculate pairwise distances between columns. This means that you should supply your input points (your n objects) to be the columns of the matrices. (In your question you mention nx3 matrix, so you would have to transpose this with the transpose() function.)
Here is an example on how to use it:
>using Distances # install with Pkg.add("Distances")
>x = rand(3,2)
3x2 Array{Float64,2}:
0.27436 0.589142
0.234363 0.728687
0.265896 0.455243
>pairwise(Euclidean(), x, x)
2x2 Array{Float64,2}:
0.0 0.615871
0.615871 0.0
As you can see the above returns the distance matrix between the columns of X. You can use other distance metrics if you need to, just check the docs for the package.
Just for completeness to the #niczky12 answer, there is a package in Julia called Clustering which essentially, as the name says, allows you to perform clustering.
A sample kmeans algorithm:
>>> using Clustering # Pkg.add("Clustering") if not installed
>>> X = rand(3, 100) # data, each column is a sample
>>> k = 10 # number of clusters
>>> r = kmeans(X, k)
>>> fieldnames(r)
8-element Array{Symbol,1}:
:centers
:assignments
:costs
:counts
:cweights
:totalcost
:iterations
:converged
The result is stored in the return of the kmeans (r) which contains the above fields. The two probably most interesting fields: r.centers contains the centers detected by the kmeans algorithm and r.assigments contains the cluster to which each of the 100 samples belongs.
There are several other clustering methods in the same package. Feel free to dive into the documentation and apply the one that best suits your needs.
In your case, as your data is an N x 3 matrix you only need to transpose it:
M = rand(100, 3)
kmeans(M', k)

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Clustering - how to find the nearest to a cluster

Hints I got as to a different question puzzled me quite a bit.
I got an exercise, actually part of a larger exercise:
Cluster some data, using hclust (done)
Given a totally new vector, find out to which of the clusters you got in 1 it is nearest.
According to the excercise, this should be done in quite short a time.
However, after weeks I am puzzled whether this can be done at all, as apparently all I really get from hclust is a tree - and not, as I assumed, a number of clusters.
As I suppose I was unclear:
Say, for instance, I feed hclust a matrix which consists of 15 1x5 Vectors, 5 times (1 1 1 1 1 ), 5 times (2 2 2 2 2) and 5 times (3 3 3 3 3). This should give me three quite distinct clusters of size 5, anyone can easily do that by hand. Is there a command to use so that I can actually find out from the program that there are 3 such clusters in my hclust-object and what they contain?
You'll have to think about what the right metric is to define closeness to the cluster. Building on the example in the hclust doc, here's a way to compute the means for each cluster and then measure the distance between the new data point and the set of means.
# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]
# Put the B data into 10 clusters
hc <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]
# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL
# Now add the hold out state to the set of averages
M <-rbind(M,KY)
# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust = which.min(D[-length(D)])
memb[memb==KYclust]
# Now cluster the full set of states and compare the results.
hc <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]
In contrast to k-means, clusters found by hclust can be of arbitrary shape.
The distance to the nearest cluster center therefore is not always meaningful.
Doing a 1 nearest neighbor style assignment probably is better.

Resources