I need to get k nearest neighbors from distance matrix. Example:
I have two "training" vectors "a" <- c(1,1) and "b" <- c(2,2) which are two dimensional vectors. I have to classify c(3,3) and I didn't have regular distance because numbers are codes for characteristics, and distance(2,3) > distance(1,3)...so c(3,3) has "a" for nearest neighbor. Later I have to generalize and output n nearest neighbors, but only for one vector at a time.
This was most promising at first, but when I looked into documentation for k.nearest.neighbors I realized it won't help me. I can't do this with Python's scikit-learn, but have some hope for R implementation, any suggestions?
I need speed with this so if I'm going to implement it in high level language I need to do it with some library...I can easily code this up in Python's numpy, but will be almost certainly too slow.
EDIT:
library(FNN)
distance_matrix <- matrix( rep( 0, len=9), nrow = 3)
distance_matrix[1,3] <- 2
distance_matrix[3,1] <- 2
distance_matrix[2,3] <- 3
distance_matrix[3,2] <- 3
train <- rbind(c(1,1), c(2,2)
test <- rbind(c(3,3))
y <- c("one", "two")
fit <- knn(train, test, y, distance_matrix, k=1, prob=TRUE)
result <- data.frame(test, pred=fit, prob=attr(fit, "prob"))
But when I look at dataframe result I see result based on euclidian metric or something alike, not my distance matrix.
Related
I am calculating a index that needs a matrix of species x sites and a matrix of cophenetic distances between species (generated from a phylogenetic tree). This block of code gives the objects needed to calculate it (site and tree):
library(ape)#phylogenetic tree
library(picante)#ses.mpd calculation
library(purrr)#list of distance matrices
#Sample matrix
set.seed(0000)
site <- matrix(data = sample(c(0, 1), 15, prob = c(0.4, 0.6), replace = T), ncol = 5, nrow = 3)
colnames(site) <- c("t1", "t2", "t3", "t4", "t5")
rownames(site) <- c("samp1", "samp2", "samp3")
#Sample phylogenetic tree
tree <- rcoal(5)
#Reordering species names in the community to match the order in the tree
site <- site[, tree$tip.label]
From the output above, I need to calculate ses.mpd 100 times using the same community matrix, but changing the distance matrix (100 of them stored in a list of 4gb). I used for loops to calculate ses.mpd, but I realised that it would take more than a month to get the output! I have used lapply before, but I do not know how to use it this time, neither purrr::map. I have seen similar questions here: Apply a function to list of matrices and here:Calculate function for all row combinations of two matrices in R, but none of them actually resembles my problem. Here is the code I used with for loop (updated by #Parfait). I need any other way faster than a loop to get the same output. Any suggestion? Thank you very much!
#Empty list for the resolved trees
many.trees <- list()
#Creates 5 resolved trees with the function ape::multi2di
for(i in 1:5){
many.trees[[i]] <- multi2di(tree)
}
#For each resolved tree, creates a distance matrix
many.dists <- map(many.trees, cophenetic)
#ses.mpd using each of the distance matrices above
out <- list()
for(i in 1:5){
out.2[[i]] <- ses.mpd(site, many.dists[[length(many.dists)]])# Thanks, #Parfait.
}
Consider an apply family solution for more compact code and avoid bookkeeping of initializing empty lists and assigning to it.
# Creates 5 resolved trees with the function ape::multi2di
many.trees <- replicate(5, multi2di(tree), simplify = FALSE)
# For each resolved tree, creates a distance matrix
many.dists <- lapply(many.trees, cophenetic)
# ses.mpd using each of the distance matrices above
out_nested <- lapply(many.dists, function(d) ses.mpd(site, d))
To retain names (if included in above methods), change lapply to sapply (equivalent to lapply with simplify=FALSE but maintains USE.NAMES=TRUE). The result would then be named lists.
# For each resolved tree, creates a distance matrix
many.dists <- sapply(many.trees, cophenetic, simplify = FALSE)
# ses.mpd using each of the distance matrices above
out <- sapply(many.dists, function(d) ses.mpd(site, d), simplify = FALSE)
out$first_name
out$second_name
out$third_name
out$fourth_name
out$fifth_name
I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.
I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.
I have a matrix with a hundred rows.
Is there a way to obtain a subset of ten rows which are most similar to the first row.
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
set1 <- subset(res2, res2 >condition1)
set1[with(set1, order(condition)), ]
set2 <- head(set1,10)
Perhaps:
Generate data:
set.seed(101)
res2 <- matrix(rexp(200, rate=.1), ncol=10, nrow=100)
Calculate the distance matrix. This is very inefficient because we're computing all of the pairwise distances, but it's efficiently coded and easy to use and you have lots of choices of distance metric (see ?dist, look for method). For this size problem it's very quick.
dd <- dist(res2)
rr <- rank(as.matrix(dd)[1,])
You'll notice that the rank of the first element of the first row (which is the distance between row 1 and itself) is 1, and its value (as.matrix(dd)[1,1]) is zero. So all we need now are the rows with the next ten smallest distances ...
res2[rr>1 & rr<=11,]
I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".