Implementing KNN with different distance metrics using R - r

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]

Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

Related

How to write Bray-Curtis function?

i tried to write braycurtis function on my own. My data is economic,social data about different regions(each row is diffrent region and each column is economenter image description hereic index). Sample(data is already normalized in range 0-1, thats why best region, the standard region have value 1)- real data have more regions and values :P
Region= c("A", "B", "C")
Sp1 =c(0.43, 1, 0.5)
Sp2 = c(0.53, 0.12, 0.75)
...
Sp23 = c(0.97, 0.2, 1)
Sp24 = c(0.34, 0.72, 0.23)
I need synthetic index of development, thats why i try to use bray_curtis. That's code of my function
bray_curtis <- function(x, na.rm = FALSE) {
return(1-(rowSums(abs(x - max(x))))/rowSums(x+max(x)))
}
gus2016_braycurtis <- as.data.frame(lapply(gus2016_norm, bray_curtis))
Formula, that i tried to implement
[1]: https://i.stack.imgur.com/LRrBb.png
What should i change? to output i need one colum of synthetic index of development for each region.
While I think you do not want to have Bray-Curtis index, I show how to get it. Your math needs fixing: your calculations had little to do with the formula you linked to (and even that linked formula botches indices, but let's ignore that and implement the formula that was intended).
BC index is a dissimilarity or similarity index calculated between two rows (or columns). When calculated just for one observation (that is, when indices are equal and the row/column is just duplicated) it will be 0 (if distance) or 1 (if similarity). The linked formula defines a similarity index. Here is a straightforward implementation of the linked formula for matrix z and doing calculations between rows. You need to switch indexing if you want to calculate similarities between columns:
N <- nrow(z) # assuming that matrix/data.frame z exists
d <- matrix(0, N, N)
for(i in 1:N) for(j in 1:N) d[i,j] <- 1 - sum(abs(z[i,]-z[j,]))/sum(z[i,]+z[j,])
## d[i,j] <- 2*sum(pmin(z[i,], z[j,]))/sum(z[i,]+z[j,]) is mathematically equivalent
This will give you a symmetric matrix with diagonal of 1 (so a lot of redundant work was done).
Then there is an easy way, as the BC index has been implemented in several packages. I just show how to get it in vegan, where we need to change the dissimilarity to similarity, and I also cast the distance structure to symmetric matrix as above:
library(vegan)
d <- 1 - as.matrix(vegdist(z)) # assuming that z exists
Probably you don't want to have this, but here you see how to get it.

simulation of binomial distribution and storing value in matrix in r

set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.

Find K nearest neighbors, starting from a distance matrix

I'm looking for a well-optimized function that accepts an n X n distance matrix and returns an n X k matrix with the indices of the k nearest neighbors of the ith datapoint in the ith row.
I find a gazillion different R packages that let you do KNN, but they all seem to include the distance computations along with the sorting algorithm within the same function. In particular, for most routines the main argument is the original data matrix, not a distance matrix. In my case, I'm using a nonstandard distance on mixed variable types, so I need to separate the sorting problem from the distance computations.
This is not exactly a daunting problem -- I obviously could just use the order function inside a loop to get what I want (see my solution below), but this is far from optimal. For example, the sort function with partial = 1:k when k is small (less than 11) goes much faster, but unfortunately returns only sorted values rather than the desired indices.
Try to use FastKNN CRAN package (although it is not well documented). It offers k.nearest.neighbors function where an arbitrary distance matrix can be given. Below you have an example that computes the matrix you need.
# arbitrary data
train <- matrix(sample(c("a","b","c"),12,replace=TRUE), ncol=2) # n x 2
n = dim(train)[1]
distMatrix <- matrix(runif(n^2,0,1),ncol=n) # n x n
# matrix of neighbours
k=3
nn = matrix(0,n,k) # n x k
for (i in 1:n)
nn[i,] = k.nearest.neighbors(i, distMatrix, k = k)
Notice: You can always check Cran packages list for Ctrl+F='knn'
related functions:
https://cran.r-project.org/web/packages/available_packages_by_name.html
For the record (I won't mark this as the answer), here is a quick-and-dirty solution. Suppose sd.dist is the special distance matrix. Suppose k.for.nn is the number of nearest neighbors.
n = nrow(sd.dist)
knn.mat = matrix(0, ncol = k.for.nn, nrow = n)
knd.mat = knn.mat
for(i in 1:n){
knn.mat[i,] = order(sd.dist[i,])[1:k.for.nn]
knd.mat[i,] = sd.dist[i,knn.mat[i,]]
}
Now knn.mat is the matrix with the indices of the k nearest neighbors in each row, and for convenience knd.mat stores the corresponding distances.

To find the distance between two roots in R

Suppose I have a function f(x) that is well defined on an interval I. I want to find the greatest and smallest roots of f(x), then taking the difference of them. What is a good way to program it?
To be precise, f can at worst be a rational function like (1+x)/(1-x). It should be a (high degree) polynomial most of the times. I only need to know the result numerically to some precision.
I am thinking about the following:
Convert f(x) into a form recognizable by R. (I can do)
Use R to list all roots of f(x) on I (I found the uniroot function only give me one root)
Use R to to find the maximum and minimum elements in the list (should be possible once I converted it to a vector)
Taking the difference of the two roots. (should be trivial)
I am stuck on step (2) and I do not know what to do. My professor give a brutal force solution, suggesting me to do:
Divide interval I into one million pieces.
Evaluate f on each end points, find the end points where f>=0.
Choose the maximum and minimum elements from the set formed in step 2.
Take the difference between them.
I feel this way is not very efficient and might not work for all f in general, but I am having trouble to implement it even for quadratics. I do not know how to do step (2) as well. So I want to ask for a hint or some toy examples.
At this point I am trying to implement the following code:
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(tail(ypos, -1) != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
at here everything is okay, but when I try to extract the roots to Y[i,1], Y[i,2] by
Y[i,1]=(ri<-root intervals(function(x)(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505, c(0,40),n=1e6)[1]
I found I cannot evaluate it anymore. R keep telling me
Error: unexpected symbol in:
"}
Y[i,1]=(ri<-root intervals"
and I got stuck. I really appreciate everyone's help as I am feeling lost.
I checked the function's expression many times using the plot function and it has no grammar mistakes. Also I believe it is well defined for all X in the interval.
This should give you a good start on the brute force solution. You're right, it's not elegant, but for relatively simple univariate functions, evaluating 1 million points is trivial.
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(ypos[-1] != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
This function returns a two column matrix of x values, where the function changes sign between column 1 and column 2:
f1 <- function (x) 0.05 * x^5 - 2 * x^4 + x^3 - x^2 + 1
> (ri <- root_intervals(f1, c(-10, 10), n = 1e6))
[,1] [,2]
[1,] -0.6372706 -0.6372506
[2,] 0.8182708 0.8182908
> f1(ri)
[,1] [,2]
[1,] -3.045326e-05 6.163467e-05
[2,] 2.218895e-05 -5.579081e-05
Wolfram Alpha confirms results on the specified interval.
The top and bottom rows will be the min and max intervals found. These intervals (over which the function changes sign) are precisely what uniroot wants for it's interval, so you could use it to solve for the (more) exact roots. Of course, if the function changes sign twice within one interval (or any even number of times), it won't be picked up, so choose a big n!
Response to edited question:
Looks like your trying to define a bunch of functions, but your edits have syntax errors. Here's what I think you're trying to do: (this first part might take some more work to work right)
my_funs <- list()
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
my_funs[[i]] <- function(x){(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505}
}
Here's using the root_intervals on the first of your generated functions.
> root_intervals(my_funs[[1]], interval = c(0, 40))
[,1] [,2]
[1,] 0.8581609 0.8582009
[2,] 11.4401314 11.4401714
Notice the output, a matrix, with the roots of the function being between the first and second columns. Being a matrix, you can't assign it to a vector. If you want a single root, use uniroot using each row to set the upper and lower bounds. This is left as an exercise to the reader.

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

Resources