How do I assign weights using Kernel function based on a vector of pairwise Euclidean distance? - r

I want to quantify the dissimilarity between two group. Each group has 5 observations, so there are 25 combinations.
For each combination, I have calculated their pairwise Euclidean distance (based on feature space). So I have had a vector of pairwise Euclidean distances as follows:
set.seed(1)
runif(n=25, min=50, max=90)
[1] 60.62035 64.88496 72.91413 86.32831 58.06728 85.93559 87.78701 76.43191 75.16456 52.47145 58.23898 57.06227 77.48091
[14] 65.36415 80.79366 69.90797 78.70474 89.67624 65.20141 81.09781 87.38821 58.48570 76.06695 55.02220 60.68883
I want to use Kernel function to assign weights to the 25 combinations based on the vector of pairwise Euclidean distances. Shorter distance, larger weight.
How can I do it in R?
I have limited knowledge about kernel. Thank you in advance for any suggestions!
I would really appreciated it even if you can give me some hints about the mathematical formula without any programming.

Related

Weighted observation frequency clustering using hclust in R

I have a large matrix of 500K observations to cluster using hierarchical clustering. Due to the large size, i do not have the computing power to calculate the distance matrix.
To overcome this problem I chose to aggregate my matrix to merge those observations which were identical to reduce my matrix to about 10K observations. I have the frequency for each of the rows in this aggregated matrix. I now need to incorporate this frequency as a weight in my hierarchical clustering.
The data is a mixture of numerical and categorical variables for the 500K observations so i have used the daisy package to calculate the gower dissimilarity for my aggregated dataset. I want to use hclust in the stats package for the aggregated dataset however i want to take into account the frequency of each observation. From the help information for hclust the arguments are as follows:
hclust(d, method = "complete", members = NULL)
The information for the members argument is:, NULL or a vector with length size of d. See the ‘Details’ section. When you look at the details section you get: If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
From the above description, i am unsure if i can assign my frequency weights to the members arguments as it is not clear if this is the purpose of this argument. I would like to use it like this:
hclust(d, method = "complete", members = df$freq)
Where df$freq is the frequency of each row in the aggregated matrix. So if a row is duplicated 10 times this value would be 10.
If anyone can help me that would be great,
Thanks
Yes, this should work fine for most linkages, in particular single, group average and complete linkage. For ward etc. you need to correctly take the weights into account yourself.
But even that part is not hard. Just make sure to use the cluster sizes, because you need to pass the distance of two clusters, not two points. So the matrix should contain the distance of n1 points at location x and n2 points at location y. For min/max/mean this n disappears or cancels out. For ward, you should get a SSQ like formula.

k means clustering on matrix

I am trying to cluster a Multidimensional Functional Object with the "kmeans" algorithms. What does it mean: So I don't have anymore a vector per each row or Individual, even more a 3x3 observation matrix per each Individual.For example: Individual = 1 has the following observations:
(x1, x2, x3),(y1,y2,y3),(z1,z2,z3).
The same structure of observations is also given for the other Individuals. So do you know how I can cluster with "kmeans" including all 3 observation vectors -and not only one observation vector how it is normal used for "kmeans" clustering?
Would you do it for each observation vector, f.e. (x1, x2, x3), separately and then combine the Information somehow together? I want to do this with the kmeans() Function in R.
Many thanks for your answers!
Using k-means you interpret each observation as a point in an N-dimensional vector space. Then you minimize the distances between your observations and the cluster centers.
Since, the data is viewed as dots in an N-dim space, the actual arrangement of the values does not matter.
You can, therefore, either tell your k-means routine to use a matrix norm, for example the Frobenius norm, to compute the distances. The other way would be to flatten your observations from 3 by 3 matrices to 1 by 9 vectors. The Frobenius norm of a NxN matrix is equivalent to the euclidean norm of a 1xN^2 vector.
Just give the argument to kmeans() with all the three columns it'll calculate the distances in 3 dimension, if that is what you are looking for.

Calculate Euclidean Distance of pairs over 3 points?

MY DATA
I have a matrix Median that contains three qualities, Speed, Angle & Acceleration, in virtual 3D space. Each set of qualities belongs to an individual person, termed Class.
Speed<-c(18,21,25,19)
Angle<-c(90,45,90,120)
Acceleration<-c(4,5,9,4)
Class<-c("Nigel","Paul","Kelly","Steve")
Median = data.frame(Class,Speed,Angle,Acceleration)
mm = as.matrix(Median)
In the example above, Nigel's Speed, Angle and Acceleration qualities would be (18,90,4).
MY PROBLEM
I wish to know the euclidean distance between each individual person/class. For example, the euclidean distance between Nigel and Paul, Nigel and Kelly etc. I then wish to display the results in a dendrogram, as a result of hierarchical clustering.
WHAT I HAVE (UNSUCCESSFULLY) ATTEMPTED
I first used hc = hclust(dist(mm)) then plot(hc) although this results in a dendrogram of Speed only. It seems the function pdist() can compute distance between two matrices of observations, but I have three matrices. Is this possible in R? I am new to the language and have found a similar question in MATLAB here Calculating Euclidean distance of pairs of 3D points in matlab but how do I write this in R code?
Many thanks.
When you transform your data.frame into a matrix, all values become characters, I don't think that is what you want... (moreover, you're trying to compute distance with the "class" names as one of the variables...)
The best would be to put your "Class" as row.names and then compute your distances and hclust :
mm<-Median[,-1]
row.names(mm)<-Median[,1]
Then you can compute the euclidean distances between Class with
dist(mm,method="euclidean") :
> dist(mm,method="euclidean")
Nigel Paul Kelly
Paul 45.110974
Kelly 8.602325 45.354162
Steve 30.016662 75.033326 31.000000
Finally, perform your hierarchical classification :
hac<-hclust(dist(mm,method="euclidean"))
and plot(hac,hang=-1) to display the dendrogram.

How the command dist(x,method="binary") calculates the distance matrix?

I have a been trying to figure that out but without much success. I am working with a table with binary data (0s and 1s). I managed to estimate a distance matrix from my data using the R function dist(x,method="binary"), but I am not quite sure how exactly this function estimates the distance matrix. Is it using the Jaccard coefficient J=(M11)/(M10+M01+M11)?
This is easily found in the help page ?dist:
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
[...]
binary: (aka asymmetric binary): The vectors are regarded as binary
bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The
distance is the proportion of bits in which only one is on amongst
those in which at least one is on.
This is equivalent to the Jaccard distance as described in Wikipedia:
An alternate interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference to the union.
In your notation, it is 1 - J = (M01 + M10)/(M01 + M10 + M11).

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Resources