K-means clustering with my own distance function - r

I have defined a distance function as follow
jaccard.rules.dist <- function(x,y) ({
# implements feature distance. Feature "Airline" gets a different treatment, the rest
# are booleans coded as 1/0. Airline column distance = 0 if same airline, 1 otherwise
# the rest of the atributes' distance is cero iff both are 1, 1 otherwise
airline.column <- which(colnames(x)=="Aerolinea")
xmod <- x
ymod <-y
xmod[airline.column] <-ifelse(x[airline.column]==y[airline.column],1,0)
ymod[airline.column] <-1 # if they are the same, they are both ones, else they are different
andval <- sum(xmod&ymod)
orval <- sum(xmod|ymod)
return (1-andval/orval)
})
which modifies a little bit jaccard distance for dataframes of the form
t <- data.frame(Aerolinea=c("A","B","C","A"),atr2=c(1,1,0,0),atr3=c(0,0,0,1))
Now, I would like to perform some k-means clustering on my dataset, using the distance just defined. If I try to use the function kmeans, there is no way to specify my distance function. I tried the to use hclust, which accepts a distanca matrix, which I calculated as follows
distmat <- matrix(nrow=nrow(t),ncol=nrow(t))
for (i in 1:nrow(t))
for (j in i:nrow(t))
distmat[j,i] <- jaccard.rules.dist(t[j,],t[i,])
distmat <- as.dist(distmat)
and then invoked hclust
hclust(distmat)
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
what am i doing wrong? is there another way to do clustering that just accepts an arbitrary distance function as its input?
thanks in advance.

I think distmat (from your code) has to be a distance structure (which is different from a matrix). Try this instead:
require(proxy)
d <- dist(t, jaccard.rules.dist)
clust <- hclust(d=d)
clust#centers
[,1] [,2]
[1,] 0.044128322 -0.039518142
[2,] -0.986798495 0.975132418
[3,] -0.006441892 0.001099211
[4,] 1.487829642 1.000431146

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

GRG Nonlinear R

I want to transform my excel solver model into a model in R. I need to find 3 sets of coordinates which minimizes the distance to the 5 other given coordinates. I've made a program which calculates a distance matrix which outputs the minimal distance from each input to the given coordinates. I want to minimize this function by changing the input. Id est, I want to find the coordinates such that the sum of minimal distances are minimized. I tried several methods to do so, see the code below (Yes my distance matrix function might be somewhat cluncky, but this is because I had to reduce the input to 1 variable in order to run some algorithms such as nloprt (would get warnings otherwise). I've also seen some other questions (such as GRG Non-Linear Least Squares (Optimization)) but they did not change/improve the solution.
# First half of p describes x coordinates, second half the y coordinates # yes thats cluncky
p<-c(2,4,6,5,3,2) # initial points
x_given <- c(2,2.5,4,4,5)
y_given <- c(9,5,7,1,2)
f <- function(Coordinates){
# Predining
Term_1 <- NULL
Term_2 <- NULL
x <- NULL
Distance <- NULL
min_prob <- NULL
l <- length(Coordinates)
l2 <- length(x_given)
half_length <- l/2
s <- l2*half_length
Distance_Matrix <- matrix(c(rep(1,s)), nrow=half_length)
# Creating the distance matrix
for (k in 1:half_length){
for (i in 1:l2){
Term_1[i] <- (Coordinates[k]-x_given[i])^2
Term_2[i] <- (Coordinates[k+half_length]-y_given[i])^2
Distance[i] <- sqrt(Term_1[i]+Term_2[i])
Distance_Matrix[k,i] <- Distance[i]
}
}
d <- Distance_Matrix
# Find the minimum in each row, thats what we want to obtain ánd minimize
for (l in 1:nrow(d)){
min_prob[l] <- min(d[l,])
}
som<-sum(min_prob)
return(som)
}
# Minimise
sol<-optim(p,f)
x<-sol$par[1:3]
y<-sol$par[4:6]
plot(x_given,y_given)
points(x,y,pch=19)
The solution however is clearly not that optimal. I've tried to use the nloptr function, but I'm not sure which algorithm to use. Which algorithm can I use or can I use/program another function which solves this problem? Thanks in advance (and sorry for the detailed long question)
Look at the output of optim. It reached the iteration limit and had not yet converged.
> optim(p, f)
$`par`
[1] 2.501441 5.002441 5.003209 5.001237 1.995857 2.000265
$value
[1] 0.009927249
$counts
function gradient
501 NA
$convergence
[1] 1
$message
NULL
Although the result is not that different you will need to increase the number of iterations to get convergence. If that is still unacceptable then try different starting values.
> optim(p, f, control = list(maxit = 1000))
$`par`
[1] 2.502806 4.999866 5.000000 5.003009 1.999112 2.000000
$value
[1] 0.005012449
$counts
function gradient
755 NA
$convergence
[1] 0
$message
NULL

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

How to input dissimilarity matrix in spatial analysis in spdep R

Aim: I want to create a dissimilarity matrix between pairs of coordinates. I want to use this matrix as an input to calculate local spatial clusters using Moran's I (LISA) and latter in geographically weighted regression (GWR).
Problem: I know I can use dnearneigh{spdep} to calculate a distance matrix. However, I want to use the travel-time between polygons I already have estimated. In practice, I think this would be like inputting a dissimilarity matrix that tells the distance/difference between polygons based on a another characteristic. I've tried inputting my matrix to dnearneigh{spdep}, but I get the error Error: ncol(x) == 2 is not TRUE
dist_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat = F, row.names=rn)
Any suggestions? There is a reproducible example below:
EDIT: Digging a bit further, I think I could use mat2listw{spdep} but I'm still not sure it keeps the correspondence between the matrix and the polygons. If I add row.names = T it returns an error row.names wrong length :(
listw_dissi <- mat2listw(diss_matrix_invers)
lmoran <- localmoran(oregon.tract#data$white, listw_dissi,
zero.policy=T, alternative= "two.sided")
Reproducible example
library(UScensus2000tract)
library(spdep)
library(ggplot2)
library(dplyr)
library(reshape2)
library(magrittr)
library(data.table)
library(reshape)
library(rgeos)
library(geosphere)
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame( gCentroid(oregon.tract, byid=TRUE) )
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# create Origin-destination pairs
od_pairs <- expand.grid.df(centroids, centroids) %>% setDT()
colnames(od_pairs) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")
# calculate dissimilarity between each pair.
# For the sake of this example, let's use ellipsoid distances. In my real case I have travel-time estimates
od_pairs[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
# This is the format of how my travel-time estimates are organized, it has some missing values which include pairs of origin-destination that are too far (more than 2hours apart)
od_pairs <- od_pairs[, .(origi_id, dest_id, dist)]
od_pairs$dist[3] <- NA
> origi_id dest_id dist
> 1: oregon_0 oregon_0 0.00000
> 2: oregon_1 oregon_0 NA
> 3: oregon_2 oregon_0 39874.63673
> 4: oregon_3 oregon_0 31259.63100
> 5: oregon_4 oregon_0 33047.84249
# Convert to matrix
diss_matrix <- acast(od_pairs, origi_id~dest_id, value.var="dist") %>% as.matrix()
# get an inverse matrix of distances, make sure diagonal=0
diss_matrix_invers <- 1/diss_matrix
diag(diss_matrix_invers) <- 0
Calculate simple distance matrix
# get row names
rn <- sapply(slot(oregon.tract, "polygons"), function(x) slot(x, "ID"))
# get centroids coordinates
coords <- coordinates(oregon.tract)
# get distance matrix
diss_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat =T, row.names=rn)
class(diss_matrix)
> [1] "nb"
Now how to use my diss_matrix_invers here?
you are right about the use of matlistw{spdep}. By default the function preserves the names of rows to keep correspondence between the matrix. You can also specify the row.names like so:
listw_dissi <- mat2listw(diss_matrix_invers, row.names = row.names(diss_matrix_invers))
The list that is created will contain the appropriate names for the neighbours along with their distance as weights. You can check this by looking at the neighbours.
listw_dissi$neighbours[[1]][1:5]
And you should be able to use this directly to calculate Moran's I.
dnearneigh{sdep}
There is no way you can use diss_matrix within dnearneigh{spdep}, as this function takes in a list of coordinates.
however, if you need to define a set of neighbours given a distance threshold (d1,d2) using your own distance matrix (travel-time). I think this function can do the trick.
dis.neigh<-function(x, d1 = 0, d2=50){
#x must be a symmetrical distance matrix
#create empty list
style = "M" #for style unknown
neighbours<-list()
weights<-list()
#set attributes of neighbours list
attr(neighbours, "class")<-"nb"
attr(neighbours, "distances")<-c(d1,d2)
attr(neighbours, "region.id")<-colnames(x)
#check each row for neighbors that satisfy distance threshold
neighbour<-c()
weight<-c()
i<-1
for(row in c(1:nrow(x))){
j<-1
for(col in c(1:ncol(x))){
if(x[row,col]>d1 && x[row,col]<d2){
neighbour[j]<-col
weight[j]<-1/x[row,col] #inverse distance (dissimilarity)
j<-1+j
}
}
neighbours[i]<-list(neighbour)
weights[i]<-list(weight)
i<-1+i
}
#create neighbour and weight list
res <- list(style = style, neighbours = neighbours, weights = weights)
class(res) <- c("listw", "nb")
attr(res, "region.id") <- attr(neighbours, "region.id")
attr(res, "call") <- match.call()
return(res)
}
And use it like so:
nb_list<-dis.neigh(diss_matrix, d1=0, d2=10000)
lmoran <- localmoran(oregon.tract#data$white, nb_lists, alternative= "two.sided")

How to compute distances between centroids and data matrix (for kmeans algorithm)

I am a student of clustering and R. In order to obtain a better grip of both I would like to compute the distance between centroids and my xy-matrix for each iteration till it "converges". How can I solve for step 2 and 3 using R?
library(fields)
x <- c(3,6,8,1,2,2,6,6,7,7,8,8)
y <- c(5,2,3,5,4,6,1,8,3,6,1,7)
df <- data.frame(x,y) initial matrix
a <- c(3,6,8)
b <- c(5,2,3)
df1 <- data.frame(a,b) # initial centroids
Here is what I want to do:
I0 <- t(rdist(df, df1)) after zero iteration
Cluster objects based on minimum distance
Determining the centroids based on the cluster average
Repetition with I1
I tried the kmeans function. But for some reasons it produces those centroids which have to come up at the end. That is I defined the start of:
start <- matrix(c(3,5,6,2,8,3), 3, byrow = TRUE)
cluster <- kmeans(df,centers = start, iter.max = 1) # one iteration
kmeans doesn't allow me to track the movement of the centroids. Therefore I would like to do it "manually" by applying step 2 & 3 using R.
Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers").
For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix to all the "centers".
Here is such a function:
myEuclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
}
distanceMatrix
}
points1 is the data matrix with points as rows and dimensions as columns. points2 is the matrix of centers (points as rows again). The first line of code just defines the answer matrix (which will have as many rows as there are rows in the data matrix and as many columns as there are centers). So the point i,j in the result matrix will be the distance from the ith point to the jth center.
Then the for loop iterates over all centers. For each center it computes the euclidean distance from each point to the current center and returns the result. This line here: sqrt(rowSums(t(t(points1)-points2[i,])^2)) is euclidean distance. Inspect it closer and look up the formula if you have any troubles with that. (the transposes there are mainly done to make sure subtraction is being done row-wise).
Now you can also implement k-means algorithm:
myKmeans <- function(x, centers, distFun, nItter=10) {
clusterHistory <- vector(nItter, mode="list")
centerHistory <- vector(nItter, mode="list")
for(i in 1:nItter) {
distsToCenters <- distFun(x, centers)
clusters <- apply(distsToCenters, 1, which.min)
centers <- apply(x, 2, tapply, clusters, mean)
# Saving history
clusterHistory[[i]] <- clusters
centerHistory[[i]] <- centers
}
list(clusters=clusterHistory, centers=centerHistory)
}
As you can see it's also a very simple function - it takes data matrix, centers, your distance function (the one defined above) and number of wanted iterations.
The clusters are defined by assigning the closest center for each point. And centers are updated as a mean of the points assigned to that center. Which is a basic k-means algorithm).
Let's try it out. Define some random points (in 2d, so number of columns = 2)
mat <- matrix(rnorm(100), ncol=2)
Assign 5 random points from that matrix as initial centers:
centers <- mat[sample(nrow(mat), 5),]
Now run the algorithm:
theResult <- myKmeans(mat, centers, myEuclid, 10)
Here are the centers in the 10th iteration:
theResult$centers[[10]]
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Compare that with implemented kmeans function:
theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")
theResult2$centers
[,1] [,2]
1 -0.1343239 1.27925285
2 -0.8004432 -0.77838017
3 0.1956119 -0.19193849
4 0.3886721 -1.80298698
5 1.3640693 -0.04091114
Works fine. Our function however tracks the iterations. We can plot the progress over the first 4 iterations like this:
par(mfrow=c(2,2))
for(i in 1:4) {
plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
}
Nice.
However this simple design allows for much more. For example if we want to use another kind of distance (not euclidean) we can just use any function that takes data and centers as inputs. Here is one for correlation distances:
myCor <- function(points1, points2) {
return(1 - ((cor(t(points1), t(points2))+1)/2))
}
And we then can do Kmeans based on those:
theResult <- myKmeans(mat, centers, myCor, 10)
The resulting picture for 4 iterations then looks like this:
Even thou we specified 5 clusters - there were 2 left at the end. That is because for 2 dimensions the correlation can have to values - either +1 or -1. Then when looking for the clusters each point get's assigned to one center, even if it has the same distance to multiple centers - the first one get's chosen.
Anyway this is now getting out of scope. The bottom line is that there are many possible distance metrics and one simple function allows you to use any distance you want and track the results over iterations.
Modified the distance matrix function above (added another loop for no. of points) as the above function displays only the distance of first point from all clusters and not all points, which is what the question is looking for:
myEuclid <- function(points1, points2) {
distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
for(i in 1:nrow(points2)) {
for (j in c(1:dim(t(points1))[2])) {
distanceMatrix[j,i] <- sqrt(rowSums(t(t(points1)[,j]-t(points2[i,]))^2))
}
}
distanceMatrix
}
Do let me know if this works fine!

Resources