K-means and Mahalanobis distance - r

I'd like to use the Mahalanobis distance in the K-means algorithm, because I have 4 variables which are highly correlated (0.85)
It appears to me that it's better to use the Mahalanobis distance in this case.
The problem is I don't know how to implement it in R, with the K-means algorithm.
I think I need to "fake" it in transform the data before the clustering step, but I don't know how.
I tried the classical kmeans, with the euclidian distance on standardize data, but as I said, there is too much correlation.
fit <- kmeans(mydata.standardize, 4)
I also tried to find a distance parameter, but I think it doesn't exist in the kmeans() function.
The expected result is a way to applied the K-means algorithm with the Mahalanobis distance.

You can rescale the data before running the algorithm,
using the Cholesky decomposition of the variance matrix:
the Euclidian distance after the transformation
is the Mahalanobis distance before.
# Sample data
n <- 100
k <- 5
x <- matrix( rnorm(k*n), nr=n, nc=k )
x[,1:2] <- x[,1:2] %*% matrix( c(.9,1,1,.9), 2, 2 )
var(x)
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix
kmeans(y, 4)
But this assumes that all the clusters have the same shape and orientations as the whole data.
If this is not the case, you may want to look at models that explicitly allow for elliptical clusters,
e.g., in the mclust package.

You can see in page 10 of Brian S. Everitt book -"An R and S-PLUS® Companion to Multivariate Analysis", the formula for Mahalanobis distance. Euclidean distance is one special case of mahalanobis, when the sample covariance is identity matrix. Then the euclidean distance with rescaled data in 'y', is mahalanobis.
# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix

Related

How to calculate the volume under a surface defined by discrete data?

I need to determine the volume beneath a series of surfaces represented by discrete data points. In my data, each sample is stored as a separate data frame within a list of data frames. Here is some (small) example data:
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
df2 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(1,1,2,3,5,6,2,1,3,3,8,9,8,3,1))
DF <- list(df1,df2)
Answers to similar questions are either in other languages (matlab, python), or the answers do not contain useable script to address the problem (as here). I can think of two acceptable ways to estimate the volume beneath each surface: 1) write out a discretized version of simpson's rule as a function in R that is applied across the list of data frames (DF); 2) calculate an arbitrary relationship between x, y, and z and use multivariate numerical integration to find the volume under the surface (with functions like simpson2d / quad2d in the package pracma or adaptIntegrate in cubature).
Regarding the first approach, the formula for the composite simpson's rule (that I would like to use) is here, but due to its complexity, I have been unsuccessful in writing a working double summation function. In this expression, I(lambda(em) lambda(ex)) is equal to z in the above datasets at each x,y grid point, and Delta(em) and Delta(ex) represent the interval between x and y points.
The second approach would essentially extend the approach found here to multivariate spline fits and then pass the predicted z values as a function for integration. Here's what I have tried so far for this approach:
require(pracma)
df1.loess <- loess(z ~ x + y, data=DF[[1]])
mod.fun <- function(x,y) predict(df1.loess, newdata=x,y)
simpson2d(mod.fun, x=c(2,6), y=c(1,3))
But this does not yield useful results.
In reality, I have a list of almost 100 data frames for individual samples, so I really need to be able to express the solution as a series of lapply functions that automate these calculations across all data frames in the list. An example looks something like this:
require(akima)
DF.splines <- lapply(DF, function(x,y,z) interp(x = "x", y = "y", z = "z",
linear=F, nx=4, ny=2))
Unfortunately, this produces an exception for missing values and Infs. I'm extremely open to any suggestions for how to successfully implement one of these strategies, or to utilize a different (simpler?) approach. Could a kriging function (like km in the DiceKriging package) produce a better fit that could be passed on for numerical integration?
I am assuming that the volume surface mesh is defined by connecting points via straight lines. Then you can find the volume beneath that surface via
triangular tessellation of the (x,y) grid into triangles T_i with area A_i
finding the corresponding z values Z_i for each of the triangles T_i
calculating the volume V_i of the truncated prisms (defined by T_i and Z_i) via V_i=A_i*sum(Z_i)/3 (see https://en.wikipedia.org/wiki/Prism_(geometry) and https://math.stackexchange.com/questions/2371139/volume-of-truncated-prism)
summing up all truncated prism volumes V_i
Keep in mind, however, that the volume does depend on your tessellation and that the tessellation is not unique. But your problem is not fully defined in the sense that it does not describe how one should interpolate between points. So any approach to calculate a volume will have to make additional assumptions.
Going back to my solution approach, points 1 and 2 can be achieved via the geometry package.
Here some code
library(geometry)
getVolume=function(df) {
#find triangular tesselation of (x,y) grid
res=delaunayn(as.matrix(df[,-3]),full=TRUE,options="Qz")
#calulates sum of truncated prism volumes
sum(mapply(function(triPoints,A) A/3*sum(df[triPoints,"z"]),
split.data.frame(res$tri,seq_along(res$areas)),
res$areas))
}
sapply(DF,getVolume)
#[1] 32.50000 30.33333
Since it's hard to check whether the results are consistent, here a simple example where we know the right answer. It's a cube with side length 2 where we have cut out a wedge along the x axis. The cut-out region is 1/4 of the total volume.
cutOutCube=expand.grid(c(0,1,2),c(0,1,2))
colnames(cutOutCube)=c("x","y")
cutOutCube$z=ifelse(cutOutCube$x==1,1,2)
sapply(list(cutOutCube),getVolume)
#[1] 6
That's correct since 2^3*(1-1/4)=6.
Another sanity check can be performed by calculating the "complement" of the volume w.r.t. to a simple cuboid where all z values are set to the max z value (in your case max(z)=9 in both cases). The simple cuboid volumes are 72 for both of your cases. Not let's define the complement surfaces and sum up volume and complement volume
df1c=df1
df1c$z=max(df1c$z)-df1c$z
df2c=df2
df2c$z=max(df2c$z)-df2c$z
DFc=list(df1c,df2c)
sapply(DFc,getVolume)+sapply(DF,getVolume)
#[1] 72 72
So volume and complement volume give the right simple cuboid volume in both cases.
You could approximate the surface through a "barycentric Lagrangian" approach as implemented in function barylag2d in the pracma package. Then, to avoid any vectorization problems, apply the Gaussian quadrature rules explicitly.
library(pracma)
df1 <- data.frame(x=c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
y=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3),
z=c(0,2,0,4,6,7,3,2,1,2,7,8,9,4,2))
# Define the nodes in x- and y-direction
xn <- df1$x[c(1,4,7,10,13)]
yn <- df1$y[1:3]
# Define the matrix representing the function
m1 <- matrix(df1$z, nrow=5, byrow=TRUE)
f <- function(x, y)
c(pracma::barylag2d(m1, xn, yn, x, y))
# 32 nodes in integration intervals
n <- 32
xa <- 2; xb <- 6; ya <- 1; yb <- 3
# Apply quadrature rules explicitely
cx <- gaussLegendre(n, xa, xb)
x <- cx$x; wx <- cx$w
cy <- gaussLegendre(n, ya, yb)
y <- cy$x; wy <- cy$w
# Sum weights * values over all nodes
I <- 0
for (i in 1:n) {
for (j in 1:n) {
I <- I + wx[i] * wy[j] * f(x[i], y[j])
}
}
I # 40.37037
The integral value of 40 seems reasonable given the data. simpson2d or quad2d will not work in this setting.
You may try whether adaptIntegrate will work with the so defined function f.

Calculate Rao's quadratic entropy

Rao QE is a weighted Euclidian distance matrix. I have the vectors for the elements of the d_ijs in a data table dt, one column per element (say there are x of them). p is the final column. nrow = S. The double sums are for the lower left (or upper right since it is symmetric) elements of the distance matrix.
If I only needed an unweighted distance matrix I could simply do dist() over the x columns. How do I weight the d_ijs by the product of p_i and p_j?
And example data set is at https://github.com/GeraldCNelson/nutmod/blob/master/RaoD_example.csv with the ps in the column called foodQ.ratio.
You still start with dist for the raw Euclidean distance matrix. Let it be D. As you will read from R - How to get row & column subscripts of matched elements from a distance matrix, a "dist" object is not a real matrix, but a 1D array. So first do D <- as.matrix(D) or D <- dist2mat(D) to convert it to a complete matrix before the following.
Now, let p be the vector of weights, the Rao's QE is just a quadratic form q'Dq / 2:
c(crossprod(p, D %*% p)) / 2
Note, I am not doing everything in the most efficient way. I have performed a symmetric matrix-vector multiplication D %*% p using the full D rather than just its lower triangular part. However, R does not have a routine doing triangular matrix-vector multiplication. So I compute the full version than divide 2.
This doubles computation amount that is necessary; also, making D a full matrix doubles memory costs. But if your problem is small to medium size this is absolutely fine. For large problem, if you are R and C wizard, call BLAS routine dtrmv or even dtpmv for the triangular matrix-vector computation.
Update
I just found this simple paper: Rao's quadratic entropy as a measure of functional diversity based on multiple traits for definition and use of Rao's EQ. It mentions that we can replace Euclidean distance with Mahalanobis distance. In case we want to do this, use my code in Mahalanobis distance of each pair of observations for fast computation of Mahalanobis distance matrix.

Time taken to krige in gstat package in R

The following R program creates an interpolated surface using 470 data points using walker Lake data in gstat package.
source("D:/kriging/allfunctions.r") # Reads in all functions.
source("D:/kriging/panel.gamma0.r") # Reads in panel function for xyplot.
library(lattice) # Needed for "xyplot" function.
library(geoR) # Needed for "polygrid" function.
library(akima)
library(gstat);
library(sp);
walk470 <- read.table("D:/kriging/walk470.txt",header=T)
attach(walk470)
coordinates(walk470) = ~x+y
walk.var1 <- variogram(v ~ x+y,data=walk470,width=10) #the width has to be tuned resulting different point pairs
plot(walk.var1,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 5")
model1.out <- fit.variogram(walk.var1,vgm(70000,"Sph",40,20000))
plot(walk.var1, model=model1.out,xlab="Distance",ylab="Semivariance",main="Variogram for V, Lag Spacing = 10")
poly <- chull(coordinates(walk470))
plot(coordinates(walk470),type="n",xlab="X",ylab="Y",cex.lab=1.6,main="Plot of Sample and Prediction Sites",cex.axis=1.5,cex.main=1.6)
lines(coordinates(walk470)[poly,])
poly.in <- polygrid(seq(2.5,247.5,5),seq(2.5,297.5,5),coordinates(walk470)[poly,])
points(poly.in)
points(coordinates(walk470),pch=16)
coordinates(poly.in) <- ~ x+y
krige.out <- krige(v ~ 1, walk470,poly.in, model=model1.out)
print(krige.out)
This program calculates the following for each point of 2688 points
(470x470) matrix inversion
(470x470) and (470x1) matrix multiplication
Is gstat package is using some smart way for calculation. I knew from previous stackoverflow query that it uses cholesky decomposition for matrix inversion. Is it normal speed for one machine to calculate it so quickly.
It uses LDL' decomposition, which is similar to Choleski. As you are using global kriging, the covariance matrix needs to be decomposed only once; then, for each prediction point, a system is solved, which is O(n). No 470x470 matrix gets ever inverted, neither are solutions obtained by multiplying it. Inverses are notational devices, but avoided as computational strategy when possible. In R, for instance, compare runtime of solve(A,b) with solve(A) %*% b.
Use the source, Luke!

Principal component analysis with EQUAMAX rotation

I need to do a principal component analysis (PCA) with EQUAMAX-rotation in R.
Unfortunately the function principal() I normally use for PCA does not offer this kind of rotation.
I could find out that it may be possible somehow with the package GPArotation but I could not yet figure out how to use this in the PCA.
Maybe someone can give an example on how to do an equamax-rotation PCA?
Or is there a function for PCA in another package that offers the use of equamax-rotation directly?
The package psych from i guess you are using principal() has the rotations varimax, quatimax, promax, oblimin, simplimax, and cluster but not equamax (psych p.232) which is a compromise between Varimax and Quartimax
excerpt from the STATA manual: mvrotate p.3
Rotation criteria
In the descriptions below, the matrix to be rotated is denoted as A, p denotes the number of rows of A, and f denotes the number of columns of A (factors or components). If A is a loading matrix from factor or pca, p is the number of variables, and f is the number of factors or components.
Criteria suitable only for orthogonal rotations
varimax and vgpf apply the orthogonal varimax rotation (Kaiser 1958). varimax maximizes the variance of the squared loadings within factors (columns of A). It is equivalent to cf(1/p) and to oblimin(1). varimax, the most popular rotation, is implemented with a dedicated fast algorithm and ignores all optimize options. Specify vgpf to switch to the general GPF algorithm used for the other criteria.
quartimax uses the quartimax criterion (Harman 1976). quartimax maximizes the variance of
the squared loadings within the variables (rows of A). For orthogonal rotations, quartimax is equivalent to cf(0) and to oblimax.
equamax specifies the orthogonal equamax rotation. equamax maximizes a weighted sum of the
varimax and quartimax criteria, reflecting a concern for simple structure within variables (rows of A) as well as within factors (columns of A). equamax is equivalent to oblimin(p/2) and cf(#), where # = f /(2p).
now the cf (Crawford-Ferguson) method is also available in GPArotation
cfT orthogonal Crawford-Ferguson family
cfT(L, Tmat=diag(ncol(L)), kappa=0, normalize=FALSE, eps=1e-5, maxit=1000)
The argument kappa parameterizes the family for the Crawford-Ferguson method. If m is the number of factors and p is the number of indicators then kappa values having special names are 0=Quartimax, 1/p=Varimax, m/(2*p)=Equamax, (m-1)/(p+m-2)=Parsimax, 1=Factor parsimony.
X <- matrix(rnorm(500), ncol=10)
C <- cor(X)
eig <- eigen(C)
# PCA by hand scaled by sqrt
eig$vectors * t(matrix(rep(sqrt(eig$values), 10), ncol=10))
require(psych)
PCA0 <- principal(C, rotate='none', nfactors=10) #PCA by psych
PCA0
# as the original loadings PCA0 are scaled by their squarroot eigenvalue
apply(PCA0$loadings^2, 2, sum) # SS loadings
## PCA with Equimax rotation
# now i think the Equamax rotation can be performed by cfT with m/(2*p)
# p number of variables (10)
# m (or f in STATA manual) number of components (10)
# gives m==p --> kappa=0.5
PCA.EQ <- cfT(PCA0$loadings, kappa=0.5)
PCA.EQ
I upgraded some of my PCA knowledge by your question, hope it helps, good luck
Walter's answer helped a great deal!
I'll add some sidenotes for what it's worth:
R's psych::principal says under option "rotate", that more rotations are available. Under the linked "fa", there's in fact an "equamax". Sadly, the results are neither replicable with STATA nor with SPSS, at least not with the standard syntax I tried:
# R:
PCA.5f=principal(data, nfactors=5, rotate="equamax", use="complete.obs")
Walter's solution replicates SPSS' equamax rotation (Kaiser-normalized by default) in the first 3 decimal places (i.e. loadings and rotating matrix fairly equivalent) using the following syntax with m=no of factors and p=no of indicators:
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=m/(2*p), normalize=TRUE) # replace kappa factor formula with your actual numbers!
# SPSS:
FACTOR
/VARIABLES listofvariables
/MISSING LISTWISE
/ANALYSIS listofvariables
/PRINT ROTATION
/CRITERIA FACTORS(5) ITERATE(1000)
/EXTRACTION PC
/CRITERIA ITERATE(1000)
/ROTATION EQUAMAX
/METHOD=CORRELATION.
STATA's equamax - Kaiser-normalized and unnormalized - is replicable at least in the first 4 decimal places with Kappa .5 irrespective of your actual number of factors and indicators which seems to contradict their manual (c.f. Walter's citation).
# R:
PCA.5f=principal(data, nfactors=5, rotate="none", use="complete.obs")
PCA.5f.eq = cfT(PCA.5f$loadings, kappa=.5, normalize=TRUE)
# STATA:
factor listofvars, pcf factors(5)
rotate, equamax normalize # kick the "normalize" to replicate R's "normalize=FALSE"
mat list e(r_L)

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

Resources