How to remove outliers from distance matrix or Hierarchical clustering in R? - r

I have some questions
First, I don't know how to find and remove outliers in distance matrix or symmetry matrix.
Second, also I used Hierarachical clustering with Average linkage.
My data is engmale161 (already made symmetry matrix with DTW )
engmale161 <- na.omit(engmale161)
engmale161 <- scale(engmale161)
d <- dist(engmale161, method = "euclidean")
hc1_engmale161 <- hclust(d, method="average")
and I find optimize index 4 with silhouette, wss & gap.
>sub_grp <- cutree(hc1_engmale161,h=60, k = 4)
>table(sub_grp)
>table(sub_grp)
sub_grp
1 2 3 4
741 16 7 1
> subset(sub_grp,sub_grp==4)
4165634865
4
>fviz_cluster(list(data = engmale161, cluster = sub_grp), geom = "point")
So, I think the right upper point(4165634865) is outlier and it has only one point.
How to delete the outlier in H-C algorithm.

just some ideas.
in a nutshell,
don't do "na.omit" on engmale161. find the outlier(s) using
quantiles and box-and-whiskers put outliers to NA in the dist matrix
proceed with your processing
long version:
"dist" behaves nicely with NAs (from the R documentation, "Missing
values are allowed, and are excluded from all computations involving
the rows within which they occur. Further, when Inf values are
involved, all pairs of values are excluded when their contribution to
the distance gave NaN or NA)"
to find an outlier I would use concepts from exploratory statistics.
use "quantile" with default probs and na.rm = true (because your dist
matrix still contains NAs) --> you'll get values for the quartiles
(dataset split in 4: 0-25%, 25-50%m and so on). 25-75 is the "box".
To find the "whiskers" is a debated topic. the standard approach is
to find the InterQuartileRange (IQR), which is third-first quartile,
then first quartile - 1.5*IQR is the "lower" whiskers, and third
quartile + 1.5*IQR is the "upper" whisker. Any value outside the
whiskers is to be considered an outlier. Mark them as NA, and proceed.
Best of luck, and my compliments for being someone who actually looks at the data!

Related

How does distances weighting work in KNN?

I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}

How does SMOTE create new data from categorical data?

I have used SMOTE in R to create new data and this worked fine. When I was doing further researches on how exactly SMOTE works, I couldn't find an answer, how SMOTE handles categorical data.
In the paper, an example is shown (page 10) with just numeric values. But I still do not know how SMOTE creates new data from categorical example data.
This is the link to the paper:
https://arxiv.org/pdf/1106.1813.pdf
That indeed is an important thing to be aware of. In terms of the paper that you are referring to, Sections 6.1 and 6.2 describe possible procedures for the cases of nominal-continuous and just nominal variables. However, DMwR does not use something like that.
If you look at the source code of SMOTE, you can see that the main work is done by DMwR:::smote.exs. I'll now briefly explain the procedure.
The summary is that the order of factor levels matters and that currently there seems to be a bug regarding factor variables which makes things work oppositely. That is, if we want to find an observation close to one with a factor level "A", then anything other than "A" is treated as "close" and those with level "A" are treated as "distant". Hence, the more factor variables there are, the fewer levels they have, and the fewer continuous variables there are, the more drastic the effect of this bug should be.
So, unless I'm wrong, the function should not be used with factors.
As an example, let's consider the case of perc.over = 600 with one continuous and one factor variable. We then arrive to smote.exs with the sub-data frame corresponding to the undersampled class (say, 50 rows) and proceed as follows.
Matrix T contains all but the class variables. Columns corresponding to the continuous variables remain unchanged, while factors or characters are coerced into integers. In means that the order of factor levels is essential.
Next we generate 50 * 6 = 300 new observations. We do so by creating 6 new observations (n = 1, ..., 6) for each of the 50 present ones (i = 1, ..., 50).
We scale the data by xd <- scale(T, T[i, ], ranges) so that xd shows deviations from the i-th observation. E.g., for i = 1 we have may have
# [,1] [,2]
# [1,] 0.00000000 0.00
# [2,] -0.13333333 0.25
# [3,] -0.26666667 0.25
meaning that the continuous variable for i = 2,3 is smaller than for i =1, but that the factor levels of i = 2,3 are "higher".
Then by running for (a in nomatr) xd[, a] <- xd[, a] == 0 we ignore most of the information in the second column related to factor level deviations: we set deviations to 1 to those cases that have the same factor level as the i-th observation, and 0 otherwise. (I believe it should be the opposite, meaning that it's a bug; I'm going to report it.)
Then we set dd <- drop(xd^2 %*% rep(1, ncol(xd))), which can be seen as a vector of squared distances for each observation from the i-th one and kNNs <- order(dd)[2:(k + 1)] gives the indices of the k nearest neighbours. It purposefully is 2:(k + 1) as the first element should be i (distance should be zero). However, the first element actually not always is i in this case due to point 4, which confirms a bug.
Now we create n-th new observation similar to the i-th one. First we pick one of the nearest neighbours, neig <- sample(1:k, 1). Then difs <- T[kNNs[neig], ] - T[i, ] is the component-wise difference between this neighbour and the i-th observation, e.g.,
difs
# [1] -0.1 -3.0
Meaning that the neighbour has lower values in terms of both variables.
New case is constructed by running: T[i, ] + runif(1) * difs which is indeed a convex combination between the i-th variable and the neighbour. This line is for the continuous variable(s) only. For the factors we have c(T[kNNs[neig], a], T[i, a])[1 + round(runif(1), 0)], which means that the new observation will have the same factor levels as the i-th observation with 50% chance, and the same as this chosen neighbour with another 50% chance. So, this is a kind of discrete interpolation.

Weighted correlation in R

I am trying to output a correlation matrix for various locations. The row names 'PC1', PC2' etc. represent principal components. Since the percentage variance explained (and thus the weights) of principal components decreases from PC1 to PC4, I need to run Pearson correlation such that it takes the weights of PC's into account.
In other words, row 1 is more important in determining the correlation among locations than row 2, and row 2 is more important than row 3, and so on...
A simple weight vector for the 4 rows can be as follows:
w = [1.00, 0.75, 0.50, 0.25]
I did go through this, but I am not fully clear with the solution, and unlike this question, I need to find the correlation within the columns of a SINGLE matrix, while weighing its rows.
Ok, this is very easy to do in R using cov.wt (available in stats)
weighted_corr <- cov.wt(DF, wt = w, cor = TRUE)
corr_matrix <- weighted_corr$cor
That's it!

summing 2 distance matrices for getting a third 'overall' distance matrix (ecological context)

I am ecologist, using mainly the vegan R package.
I have 2 matrices (sample x abundances) (See data below):
matrix 1/ nrow= 6replicates*24sites, ncol=15 species abundances (fish)
matrix 2/ nrow= 3replicates*24sites, ncol=10 species abundances (invertebrates)
The sites are the same in both matrices. I want to get the overall bray-curtis dissimilarity (considering both matrices) among pairs of sites. I see 2 options:
option 1, averaging over replicates (at the site scale) fishes and macro-invertebrates abundances, cbind the two mean abundances matrix (nrow=24sites, ncol=15+10 mean abundances) and calculating bray-curtis.
option 2, for each assemblage, computing bray-curtis dissimilarity among pairs of sites, computing distances among sites centroids. Then summing up the 2 distance matrix.
In case I am not clear, I did these 2 operations in the R codes below.
Please, could you tell me if the option 2 is correct and more appropriate than option 1.
thank you in advance.
Pierre
here is below the R code exemples
generating data
library(plyr);library(vegan)
#assemblage 1: 15 fish species, 6 replicates per site
a1.env=data.frame(
Habitat=paste("H",gl(2,12*6),sep=""),
Site=paste("S",gl(24,6),sep=""),
Replicate=rep(paste("R",1:6,sep=""),24))
summary(a1.env)
a1.bio=as.data.frame(replicate(15,rpois(144,sample(1:10,1))))
names(a1.bio)=paste("F",1:15,sep="")
a1.bio[1:72,]=2*a1.bio[1:72,]
#assemblage 2: 10 taxa of macro-invertebrates, 3 replicates per site
a2.env=a1.env[a1.env$Replicate%in%c("R1","R2","R3"),]
summary(a2.env)
a2.bio=as.data.frame(replicate(10,rpois(72,sample(10:100,1))))
names(a2.bio)=paste("I",1:10,sep="")
a2.bio[1:36,]=0.5*a2.bio[1:36,]
#environmental data at the sit scale
env=unique(a1.env[,c("Habitat","Site")])
env=env[order(env$Site),]
OPTION 1, averaging abundances and cbind
a1.bio.mean=ddply(cbind(a1.bio,a1.env),.(Habitat,Site),numcolwise(mean))
a1.bio.mean=a1.bio.mean[order(a1.bio.mean$Site),]
a2.bio.mean=ddply(cbind(a2.bio,a2.env),.(Habitat,Site),numcolwise(mean))
a2.bio.mean=a2.bio.mean[order(a2.bio.mean$Site),]
bio.mean=cbind(a1.bio.mean[,-c(1:2)],a2.bio.mean[,-c(1:2)])
dist.mean=vegdist(sqrt(bio.mean),"bray")
OPTION 2, computing for each assemblage distance among centroids and summing the 2 distances matrix
a1.dist=vegdist(sqrt(a1.bio),"bray")
a1.coord.centroid=betadisper(a1.dist,a1.env$Site)$centroids
a1.dist.centroid=vegdist(a1.coord.centroid,"eucl")
a2.dist=vegdist(sqrt(a2.bio),"bray")
a2.coord.centroid=betadisper(a2.dist,a2.env$Site)$centroids
a2.dist.centroid=vegdist(a2.coord.centroid,"eucl")
summing up the two distance matrices using Gavin Simpson 's fuse()
dist.centroid=fuse(a1.dist.centroid,a2.dist.centroid,weights=c(15/25,10/25))
summing up the two euclidean distance matrices (thanks to Jari Oksanen correction)
dist.centroid=sqrt(a1.dist.centroid^2 + a2.dist.centroid^2)
and the 'coord.centroid' below for further distance-based analysis (is it correct ?)
coord.centroid=cmdscale(dist.centroid,k=23,add=TRUE)
COMPARING OPTION 1 AND 2
pco.mean=cmdscale(vegdist(sqrt(bio.mean),"bray"))
pco.centroid=cmdscale(dist.centroid)
comparison=procrustes(pco.centroid,pco.mean)
protest(pco.centroid,pco.mean)
An easier solution is just to flexibly combine the two dissimilarity matrices, by weighting each matrix. The weights need to sum to 1. For two dissimilarity matrices the fused dissimilarity matrix is
d.fused = (w * d.x) + ((1 - w) * d.y)
where w is a numeric scalar (length 1 vector) weight. If you have no reason to weight one of the sets of dissimilarities more than the other, just use w = 0.5.
I have a function to do this for you in my analogue package; fuse(). The example from ?fuse is
train1 <- data.frame(matrix(abs(runif(100)), ncol = 10))
train2 <- data.frame(matrix(sample(c(0,1), 100, replace = TRUE),
ncol = 10))
rownames(train1) <- rownames(train2) <- LETTERS[1:10]
colnames(train1) <- colnames(train2) <- as.character(1:10)
d1 <- vegdist(train1, method = "bray")
d2 <- vegdist(train2, method = "jaccard")
dd <- fuse(d1, d2, weights = c(0.6, 0.4))
dd
str(dd)
This idea is used in supervised Kohonen networks (supervised SOMs) to bring multiple layers of data into a single analysis.
analogue works closely with vegan so there won't be any issues running the two packages side by side.
The correctness of averaging distances depends on what are you doing with those distances. In some applications you may expect that they really are distances. That is, they satisfy some metric properties and have a defined relation to the original data. Combined dissimilarities may not satisfy these requirements.
This issue is related to the controversy of partial Mantel type analysis of dissimilarities vs. analysis of rectangular data that is really hot (and I mean red hot) in studies of beta diversities. We in vegan provide tools for both, but I think that in most cases analysis of rectangular data is more robust and more powerful. With rectangular data I mean normal sampling units times species matrix. The preferred dissimilarity based methods in vegan map dissimilarities onto rectangular form. These methods in vegan include db-RDA (capscale), permutational MANOVA (adonis) and analysis of within-group dispersion (betadisper). Methods working with disismilarities as such include mantel, anosim, mrpp, meandis.
The mean of dissimilarities or distances usually has no clear correspondence to the original rectangular data. That is: mean of the dissimilarities does not correspond to the mean of the data. I think that in general it is better to average or handle data and then get dissimilarities from transformed data.
If you want to combine dissimilarities, analogue::fuse() style approach is most practical. However, you should understand that fuse() also scales dissimilarity matrices into equal maxima. If you have dissimilarity measures in scale 0..1, this is usually minor issue, unless one of the data set is more homogeneous and has a lower maximum dissimilarity than others. In fuse() they are all equalized so that it is not a simple averaging but averaging after range equalizing. Moreover, you must remember that averaging dissimilarities usually destroys the geometry, and this will matter if you use analysis methods for rectangularized data (adonis, betadisper, capscale in vegan).
Finally about geometry of combining dissimilarities. Dissimilarity indices in scale 0..1 are fractions of type A/B. Two fractions can be added (and then divided to get the average) directly only if the denominators are equal. If you ignore this and directly average the fractions, then the result will not be equal to the same fraction from averaged data. This is what I mean with destroying geometry. Some open-scaled indices are not fractions and may be additive. Manhattan distances are additive. Euclidean distances are square roots of squared differences, and their squares are additive but not the distances directly.
I demonstrate these things by showing the effect of adding together two dissimilarities (and averaging would mean dividing the result by two, or by suitable weights). I take the Barro Colorado Island data of vegan and divide it into two subsets of slightly unequal sizes. A geometry preserving addition of distances of subsets of the data will give the same result as the analysis of the complete data:
library(vegan) ## data and vegdist
library(analogue) ## fuse
data(BCI)
dim(BCI) ## [1] 50 225
x1 <- BCI[, 1:100]
x2 <- BCI[, 101:225]
## Bray-Curtis and fuse: not additive
plot(vegdist(BCI), fuse(vegdist(x1), vegdist(x2), weights = c(100/225, 125/225)))
## summing distances is straigthforward (they are vectors), but preserving
## their attributes and keeping the dissimilarities needs fuse or some trick
## like below where we make dist structure dtmp to be replaced with the result
dtmp <- dist(BCI) ## dist skeleton with attributes
dtmp[] <- dist(x1, "manhattan") + dist(x2, "manhattan")
## manhattans are additive and can be averaged
plot(dist(BCI, "manhattan"), dtmp)
## Fuse rescales dissimilarities and they are no more additive
dfuse <- fuse(dist(x1, "man"), dist(x2, "man"), weights=c(100/225, 125/225))
plot(dist(BCI, "manhattan"), dfuse)
## Euclidean distances are not additive
dtmp[] <- dist(x1) + dist(x2)
plot(dist(BCI), dtmp)
## ... but squared Euclidean distances are additive
dtmp[] <- sqrt(dist(x1)^2 + dist(x2)^2)
plot(dist(BCI), dtmp)
## dfuse would rescale squared Euclidean distances like Manhattan (not shown)
I only considered addition above, but if you cannot add, you cannot average. It is a matter of taste if this is important. Brave people will average things that cannot be averaged, but some people are more timid and want to follow the rules. I rather go the second group.
I like this simplicity of this answer, but it only applies to adding 2 distance matrices:
d.fused = (w * d.x) + ((1 - w) * d.y)
so I wrote my own snippet to combine an array of multiple distance matrices (not just 2), and using standard R packages:
# generate array of distance matrices
x <- matrix(rnorm(100), nrow = 5)
y <- matrix(rnorm(100), nrow = 5)
z <- matrix(rnorm(100), nrow = 5)
dst_array <- list(dist(x),dist(y),dist(z))
# create new distance matrix with first element of array
dst <- dst_array[[1]]
# loop over remaining array elements, add them to distance matrix
for (jj in 2:length(dst_array)){
dst <- dst + dst_array[[jj]]
}
You could also use a vector of similar size to dst_array in order to define scaling factors
dst <- dst + my_scale[[jj]] * dst_array[[jj]]

How to show the value of the AUC from geom_density/stat_density

I have produced some density plots using ggplot2 and stat_density. My colleague mentioned he wasn't convinced that the area under each curve would sum to 1. So, I set out to calculate the area under the curve, and I am wondering if there might be a better approach than what I did.
Here is an example of what I did:
data(iris)
p<-ggplot(iris,aes(x=Petal.Length))+
stat_density(aes(colour=Species),geom="line",position="identity")
q<-print(p)
q<-q$data[[1]]
# calculate interval between density estimates for a given point.
# assume it is the same interval for all estimates
interval<-q$x[2]-q$x[1]
# calculate AUC by summing interval*height for the density estimate at each point
tapply(q$density*interval,
q$group,
sum)
The result:
1 2 3
0.9913514 1.0009785 0.9817040
It seems to works decently, but I wonder if there is a better way of doing this. In particular, my calculation of the interval (i.e. dx, I suppose) seems like it could be a problem, especially if the different density curves use different intervals.
Your way is already good.
Another way to do it is using the trapezoid rule:
data <- cbind(q$x, q$y)
by(data, q$group, FUN = function(x) trapz(x[, 1], x[, 2]))
The results are nearly the same:
INDICES: 1
[1] 0.9903457
INDICES: 2
[1] 1.000978
INDICES: 3
[1] 0.9811152
This is because at the bandwidth needed to make the graph of the densities look reasonable (interval in your code), you are very close to what you would get if you could do the actual integral.

Resources