how to make a fast pairwise Tanimoto distance function in R - r

I have a data.frame of items identified by an integer property ID, which is also the row number of the data.frame.
Each item has a vector of features FP associated to it. The elements of each FP are unique (within that FP). So for instance c(1,2,7) but never c(1,7,7).
The Tanimoto distance between any two ID's is defined as 1 minus the number of unique elements in the intersection of their FP's, divided by the number of unique elements in the union of their FP's.
I need to calculate such distances in the context of a 'maxmin' algorithm. See for instance this blog post.
The most important point to note is that I must NOT compute a full distance matrix (even with the best algorithms it would be unfeasible on the scale of datasets I am working with).
As explained in the above post, the strength of the iterative maxmin picker according to Roger Sayle's method is that one can avoid computing most of the pairwise distances, and instead calculate only the few relevant ones. Hence my question.
Here's what I could come up with so far:
# make a random dataset
set.seed(1234567)
d <- sample(30:45, 1000, replace = T)
dd <- setNames(data.frame(do.call(rbind, sapply(d,function(n) list(sample(as.character(1:(45*2)), n, replace = F)), simplify = F))), "FP")
dd["ID"] <- 1:NROW(dd)
# define a pairwise distance function for ID's
distfun <- function(ID1,ID2) {
FP1 <- dd$FP[[ID1]]
FP2 <- dd$FP[[ID2]]
int <- length(intersect(FP1,FP2))
1 - int/(d[ID1]+d[ID2]-int)
}
# test performance of distance function
x <- sample(dd$ID, 200, replace = F)
y <- sample(dd$ID[!(dd$ID %in% x)], 200, replace = F)
pairwise.dist <- NULL
system.time(
for(i in x) {
for (j in y) {
dij <- distfun(i,j)
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 0.86 0.00 0.86
Question 1 : do you think the distance function could be made faster?
I tried making a sparse matrix of the features (ddu.tab in the code below, where I omitted the denominator, which is trivial to compute from the intersection) and defining the distance function as vector operations, but that was much slower (a bit to my surprise, I must say).
ddu <- do.call(rbind, sapply(dd$ID, function(x) {data.frame("ID"=x, "FP"=dd$FP[[x]], stringsAsFactors = F)}, simplify = F))
ddu.tab <- xtabs(~ID+FP, ddu, sparse = T)
system.time(
for(i in x) {
for (j in y) {
dij <- t(ddu.tab[i,]) %*% ddu.tab[j,]
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 32.35 0.03 32.66
Question 2 : actually less important than the distance calculation, but if anyone can advise... The update of pairwise.dist by rbind is (apparently) very costly. I don't know if I can do it differently (meaning not adding new elements at each iteration), because in the maxmin application the pairs of ID's whose distances are to be calculated are not known upfront like in this example, and pairwise.dist is continuously read and appended new elements.
Someone in the past suggested to me that lists may be better than matrices for read/write. If that is the case, I could write out pairwise.dist as a named list.
BTW, just FYI, in this specific example the full distance matrix is calculated quite fast:
system.time(ddu.dist <- dist(ddu.tab, method = "binary"))
# user system elapsed
# 0.61 0.00 0.61
which seems to indicate that there is indeed a fast method to calculate binary distances.
If anyone could please advise and/or point me to relevant resources, it would be great.
Thanks!

Not sure about speeding up the distance function itself, but you could replace your double loop, using the tidyverse, with
library(tidyverse)
results <- crossing(x = x, y = y) %>% #all x,y combinations
filter(x < y) %>% #remove duplicates
mutate(pairwise.dist = map2_dbl(x, y, distfun)) #apply distance function

Related

Speed of Daisy Function

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
}
}
dist[i,j] <- dist_v
}
}
return(dist)
}
data("iris")
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))
one option:
euclidean3 <- function(df) {
require(data.table)
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
d
}
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

How to return possible pairs of variables based on a function?

I have a pair of variables (x, y) and for each variable there is a possible range of values (xmin, xmax and ymin, ymax). I am looking for such pairs that based on a function would yield the same probability.
This is my function that would return probabilities.
f <- function(x, y) 1-exp(-(x^(1/0.9)+y^(1/0.9))^0.9)
Now suppose I want to know that for a certain probability, say 0.01 what are the possible pairs of variables of x and y yielding that (considering their constraints, min and max values).
(What I have already tried is doing the whole thing the other way around by creating a matrix first for x and y and then for each combination I calculated the probability, but then I would need to find the same probabilities in the matrix, which seems to be even more difficult.)
So by doing some math (sorry latex formatting is not supported in SO):
P=1-exp(-(x^(1/0.9)+y^(1/0.9))^0.9)
ln(1-P)=-(x^(1/0.9)+y^(1/0.9))^0.9)
(-ln(1-P))^(1/0.9)-y^(1/0.9)=x^(1/0.9)
((-ln(1-P))^(1/0.9)-y^(1/0.9))^0.9=x
Now if we put it in some R code, and check when results do not exists :
get_x <- function(P,y)
{
x=((-log(1-P))^(1/0.9)-y^(1/0.9))^0.9
# Verification of the results
# If results no real (then x[i]=NaN) or if it does
# not match the given probability (should never happens)
# the results is set to NaN
# This verification is for debugging only, should be removed
for (i in c(1:length(y))){
if(is.na(x[i]) | abs(P-1+exp(-(x[i]^(1/0.9)+y[i]^(1/0.9))^0.9))>0.00001)
{
x[i]=NaN
print(paste0("Oops, something went wrong with y=",y[i]))
}
}
return(x)
}
y_values=seq(0.01,0.99,by=0.001)
get_x(0.09,y_values)
Which is pretty fast, now only one loop is used, instead of two to fill the matrix, so order n instead of n^2
We can calculate probability for all possible combinations and create a dataframe with combination which satisfies our criteria with some tolerance (for floating point comparison)
tol <- 0.0001
mat <- which((matrix2 >= 0.01 - tol) & (matrix2 <= 0.01 + tol), arr.ind = TRUE)
data.frame(comb1 = rownames(matrix2)[mat[, 1]], comb2 = colnames(matrix2)[mat[, 2]])

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

compute samples variance without loops

Here is what I want to do:
I have a time series data frame with let us say 100 time-series of
length 600 - each in one column of the data frame.
I want to pick up 10 of the time-series randomly and then assign them
random weights that sum up to one. Using those I want to compute the
variance of the sum of the 10 weighted time series variables (e.g.
convex combination).
The df is in the form
v1,v2,v2.....v100
1,5,6,.......9
2,4,6,.......10
3,5,8,.......6
2,2,8,.......2
etc
i can compute it inside a loop but r is vector oriented and it is not efficient.
ntrials = 10000
ts.sd = NULL
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd = cbind(ts.sd, temp)
}
Not sure what type of "random" you want for your weights... so I'll use a normal distribution scaled s.t. it sums to one:
x=as.data.frame(matrix(sample(1:20, 100*600, replace=TRUE), ncol=100))
myfun <- function(inc, DF=x) {
w = runif(10)
w = w / sum(w)
t(w) %*% cov(DF[, sample(seq_along(DF), 10)]) %*% w
}
lapply(1:ntrials, myfun)
However, this isn't really avoiding loops per say since lapply is just an efficient looping construct. That said, for loops in R aren't explicitly bad or inefficient. Growing a data structure, like you're doing with cbind, however, is.
But in this case since you're only growing it by appending a single element it really wont change things much. The "correct" version would be to pre-allocate your vector ts.sd using ntrials.
ts.sd = vector(mode='numeric', length=ntrials)
The in your loop assign into it using i:
for (x in 1:ntrials))
{
temp = t(weights[,x]) %*% cov(df[, samples[, x]]) %*% weights[, x]
ts.sd[i] = temp
}

Resources