Computing pairwise distances between a set of intervals - r

Let's say I have a set of closed linear intervals represented by this matrix:
interval.mat = matrix(c(1,2,3,5,4,6,8,9), byrow = TRUE, ncol = 2)
where interval.mat[,1] are the interval start points and interval.mat[,2] are their corresponding end points.
I'm looking for an efficient (since this example matrix is a toy and in reality my matrix contains a few thousands of intervals) way to produce a matrix that will hold all the pairwise positive distances between the intervals. The distance between a pair of intervals should be the start of the interval with the bigger end among the two minus the end of the interval with the smaller end among the two. For example the distance between intervals c(1,2) and c(3,5) should 3 - 2 = 1, since the second interval ends after the first one. In case the intervals overlap the distance should be 0. So for example, in the case of c(3,5) and c(4,6) the distance would be 0.
So, the pairwise distance matrix for the intervals above would be:
> matrix(c(0,1,2,6,1,0,0,3,2,0,0,2,6,3,2,0), byrow = TRUE, nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 0 1 2 6
[2,] 1 0 0 3
[3,] 2 0 0 2
[4,] 6 3 2 0

Here's an Rcpp solution. It will be fast and memory efficient (for details see below).
First let's define a helper function which calculates all the pairwise distances. If n is the number of intervals to consider, we have n*(n-1)/2 unique pairs of vectors (we don't take the same intervals into account, of course, as the distance between them is 0).
library('Rcpp')
library('inline')
cppFunction("
NumericVector distint_help(NumericMatrix x) {
int n = x.nrow(); // number of rows
NumericVector out(n*(n-1)/2); // result numeric vector
int k = 0;
for (int i=0; i<n-1; ++i) {
for (int j=i+1; j<n; ++j) {
if (x(i,0) >= x(j,1))
out[k++] = x(i,0)-x(j,1);
else if (x(j,0) > x(i,1))
out[k++] = x(j,0)-x(i,1);
else
out[k++] = 0.0;
}
}
return out;
}
")
The above function returns a numeric vector with the calculated distances. Let's try to mimic the output of the built-in dist function (checkout the result of x <- dist(interval.mat); unclass(x)).
Now the main function:
distint <- function(interval) {
stopifnot(is.numeric(interval), is.matrix(interval), ncol(interval) == 2)
res <- distint_help(interval) # use Rcpp to calculate the distances
# return the result similar to the one of dist()
structure(res, class='dist', Size=nrow(interval), Diag=FALSE, Upper=FALSE)
}
distint(interval.mat)
## 1 2 3
## 2 1
## 3 2 0
## 4 6 3 2
The above object may be converted to an "ordinary" square matrix:
as.matrix(distint(interval.mat))
## 1 2 3 4
## 1 0 1 2 6
## 2 1 0 0 3
## 3 2 0 0 2
## 4 6 3 2 0
Unless the distance matrix is sparse (there are many many zeros), the above solution is storage efficient.
A benchmark:
test <- matrix(runif(1000), ncol=2)
library('microbenchmark')
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
microbenchmark(distint(test), as.matrix(dist(test, method=f)), times=10)
## Unit: milliseconds
## expr min lq median uq max neval
## distint(test) 1.584548 1.615146 1.650645 3.071433 3.164231 10
## as.matrix(dist(test, method = f)) 455.300974 546.438875 551.596582 599.977164 609.418194 10

You can use the proxy package, which has a dist(...) method that allows user definition of the distance function. Note that loading this library will mask the dist(...) function in base R
library(proxy)
f <- function(x,y) max(min(x)-max(y),0)
as.matrix(dist(interval.mat,method=f))
# 1 2 3 4
# 1 0 1 2 6
# 2 1 0 0 3
# 3 2 0 0 2
# 4 6 3 2 0

Related

Confusion matrix using table in k-means and hierarchical clustering

I have some problems with calculating of confusion matrix. I have created three sets of points by multivariate normal distibution:
library('MASS')
library('ggplot2')
library('reshape2')
library("ClusterR")
library("cluster")
library("dplyr")
library ("factoextra")
library("dendextend")
library("circlize")
mu1<-c(1,1)
mu2<-c(1,-9)
mu3<-c(-7,-2)
sigma1<-matrix(c(1,1,1,2), nrow=2, ncol=2, byrow = TRUE)
sigma2<-matrix(c(1,-1,-1,2), nrow=2, ncol=2, byrow = TRUE)
sigma3<-matrix(c(2,0.5,0.5,0.3), nrow=2, ncol=2, byrow = TRUE)
simulation1<-mvrnorm(100,mu1,sigma1)
simulation2<-mvrnorm(100,mu2,sigma2)
simulation3<-mvrnorm(100,mu3,sigma3)
X<-rbind(simulation1,simulation2,simulation3)
colnames(X)<-c("x","y")
X<-data.frame(X)
I have also constructed clusters using k-means clustering and hierarchical clustering with k initial centers (k=3):
//k-means clustering
k<-3
B<-kmeans(X, centers = k, nstart = 10)
x_cluster = data.frame(X, group=factor(B$cluster))
ggplot(x_cluster, aes(x, y, color = group)) + geom_point()
//hierarchical clustering
single<-hclust(dist(X), method = "single")
clusters2<-cutree(single, k = 3)
fviz_cluster(list (data = X, cluster=clusters2))
How can I calculate confusion matrix for full dataset(X) using table in both of these cases?
Using your data, insert set.seed(42) just before you create sigma1 so that we have a reproducible example. Then after you created X:
X.df <- data.frame(Grp=rep(1:3, each=100), x=X[, 1], y=X[, 2])
k <- 3
B <- kmeans(X, centers = k, nstart = 10)
table(X.df$Grp, B$cluster)
#
# 1 2 3
# 1 1 0 99
# 2 0 100 0
# 3 100 0 0
Original group 1 is identified as group 3 with one specimen assigned to group 1. Original group 2 is assigned to group 2 and original group 3 is assigned to group 1. The group numbers are irrelevant. The classification is perfect if each row/column contains all values in a single cell. In this case only 1 specimen was missplaced.
single <- hclust(dist(X), method = "single")
clusters2 <- cutree(single, k = 3)
table(X.df$Grp, clusters2)
# clusters2
# 1 2 3
# 1 99 1 0
# 2 0 0 100
# 3 0 100 0
The results are the same, but the cluster numbers are different. One specimen from the original group 1 was assigned to the same group as the group 3 specimens. To compare these results:
table(Kmeans=B$cluster, Hierarch=clusters2)
# Hierarch
# Kmeans 1 2 3
# 1 0 101 0
# 2 0 0 100
# 3 99 0 0
Notice that each row/column contains only one cell that is nonzero. The two cluster analyses agree with one another even though the cluster designations differ.
D <- lda(Grp~x + y, X.df)
table(X.df$Grp, predict(D)$class)
#
# 1 2 3
# 1 99 0 1
# 2 0 100 0
# 3 0 0 100
Linear discriminant analysis tries to predict the specimen number given the values of x and y. Because of this, the cluster numbers are not arbitrary and the correct predictions all fall on the diagonal of the table. This is what is usually described as a confusion matrix.

How to build a graph using random distance matrix?

I want to generate random transportation graphs in R with n vertices.
The graphs should be bipartite and connected ( there is usually a link / way between 2 vertices , not usually a direct one ) without " Loops".
In addition , the edges valuation should be random & strictly positive .
I tried doing the following :
n=6 # number of vertices
F <- erdos.renyi.game(n, p.or.m=0.5, directed=FALSE)
m=ecount(F)
min = 1 # 1 km
max = 50 # 50 km
F <- set.edge.attribute(F, name="distance", value=runif(m , min , max))
plot(F, layout=layout.fruchterman.reingold)
distances(F)
The problem is that i failed to get the wanted distances in the graph distance matrix :
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 2 1 1 2 1
[2,] 2 0 2 1 1 1
[3,] 1 2 0 1 1 1
[4,] 1 1 1 0 2 1
[5,] 2 1 1 2 0 1
[6,] 1 1 1 1 1 0
Normally , i should get uniform random values between 1 and 50 .
Also i'm not sure if the graph will not contain a loop if i execute another occurence of this code in future.
I guess my questions are clear !
Thank you if you could help !
The solution :
n=6 # number of vertices
F <- erdos.renyi.game(n, p.or.m=0.5, directed=FALSE) # We create the graph F
m=ecount(F) # number of edges we obtained using erdos.renyi.game()
min = 1 # 1 km # min and max values for the edges flows
max = 50 # 50 km
F <- set.edge.attribute(F, name="distance", value=runif(m , min , max)) # The flows are drawn in random
plot(F, layout=layout.fruchterman.reingold) # we plot the graph F
distances(F) # matrix of lentghs of the shortest paths between i & j
distances(F, weights = E(F)$distance) # matrix of flows of the shortest paths
distances(F, v = 1, to = 6, weights = E(F)$distance) # same if we want a specific minimum path between two vertices ( ex : 1--->6 )
get.all.shortest.paths(F, 1, to = V(F)) # all shortest paths list
graph.maxflow(F, 1, 6) # the maximum possible flow of any route between 1--->6

Better way to apply rolling function to zoo or xts object?

I am wondering if there isn’t a more elegant way to do this. I tried rollapply but could never get it to respond to more than the first column of the zoo object.
I want to access a 2-dimensional zoo or xts object, create a rolling window that includes all columns, perform some operation on each instance of the rolling window, and return a matrix containing the result of the operation on each of the rolling windows. I want the operation on a windowed snippet to be assignable to a function that I externally define.
Here is an example that works, but is not very elegant:
rolling_function <- function(my_data, w, FUN = my_func)
{
## Produce a rolling window of width w starting at
## w, ending at nrow(my_data), with window width w.
## FUN is some function passed that performs some
## operation on 'snippet' and returns a value for
## each column of snippet. That is assembled into
## a matrix and returned.
## Set up a matrix to hold results
results <- matrix(ncol = ncol(my_data),
nrow = (nrow(my_data) - w + 1))
nn <-nrow(my_data)
for(jstart in 1:(nn - w + 1))
{
snippet <- window(my_data,
start = index(my_data[jstart]),
end = index(my_data[jstart + w - 1]))
## Do something with snippet here
# print(my_func(snippet))
results[jstart, ] <- FUN(snippet)
}
return(results)
}
my_func <- function(x)
{
# An example function that takes the difference between
# the first and last rows of the snippet, x
result <- as.vector(x[1,]) - as.vector(x[nrow(x),])
return(result)
}
A small test case is given below:
## Main code
## Define a zoo object with dummy dates
my_data <-zoo(matrix(data = c(1,5,6,5,3,7,8,8,8,2,4,5),
nrow = 4, ncol = 3), order.by = as.Date(100:103))
## Define a window width of 2 and call the rolling function
width = 2
print(rolling_function(my_data, width))
The test zoo object is:
1970-04-11 1 3 8
1970-04-12 5 7 2
1970-04-13 6 8 4
1970-04-14 5 8 5
and the test output is:
[,1] [,2] [,3]
[1,] -4 -4 6
[2,] -1 -1 -2
[3,] 1 0 -1
Is there a more elegant/straightforward/faster way to perform this operation, perhaps using rollapply (I could not make this work)?
Assuming the input z shown reproducibly in the Note at the end, if the width is 2 then:
library(zoo)
-diff(z)
## V2 V3 V4
## 1970-04-12 -4 -4 6
## 1970-04-13 -1 -1 -2
## 1970-04-14 1 0 -1
and in general:
w <- 2 # modify as needed
-diff(z, w-1)
## V2 V3 V4
## 1970-04-12 -4 -4 6
## 1970-04-13 -1 -1 -2
## 1970-04-14 1 0 -1
or using rollapplyr:
w <- 2 # modify as needed
rollapplyr(z, w, function(x) x[1] - x[w])
## V2 V3 V4
## 1970-04-12 -4 -4 6
## 1970-04-13 -1 -1 -2
## 1970-04-14 1 0 -1
Note
Lines <- "
1970-04-11 1 3 8
1970-04-12 5 7 2
1970-04-13 6 8 4
1970-04-14 5 8 5"
library(zoo)
z <- read.zoo(text = Lines)

conditional which.min function

I have two sets of data, one is coordinates of machines, one is coordinates of the nearest repair shop.
I have a working model that has assigned each machine to the nearest store. However one store only has 1 machine and another has 7 machines assigned to it.
What I want is to add a condition so that each store is assigned at least 2 machines but no more than 4.
library(geosphere)
library(ggplot2)
#machine Locations
machine.x <- c(-122.37, -111.72, -111.87, -112.05, -87.17, -86.57, -86.54, -88.04, -86.61, -88.04, -86.61)
machine.y <- c(37.56, 35.23, 33.38, 33.57, 30.36, 30.75, 30.46, 30.68, 30.42, 30.68, 30.42)
machines <- data.frame(machine.x, machine.y)
#store locations
store.x <- c(-121.98, -112.17, -86.57)
store.y <- c(37.56, 33.59, 30.75)
stores <- data.frame(store.x, store.y)
centers<-data.frame(x=stores$store.x, y=stores$store.y)
pts<-data.frame(x=(machines$machine.x), y=(machines$machine.y))
#allocate space
distance<-matrix(-1, nrow = length(pts$x), ncol= length(centers$x))
#calculate the dist matrix - the define centers to each point
#columns represent centers and the rows are the data points
dm<-apply(data.frame(1:length(centers$x)), 1, function(x){ replace(distance[,x], 1:length(pts$x), distGeo(centers[x,], pts))})
#find the column with the smallest distance
closestcenter<-apply(dm, 1, which.min)
#color code the original data for verification
colors<-c(stores)
#create a scatter plot of assets color coded by which fe they belong to
plot(pts, col=closestcenter, pch=9)
So what I want is for each group to have a minimum count of 2 and a max count of 4, I tried adding a if else statement in the closest center variable but it didn't get even close to working out the way I thought it would. and i've looked around on line but can't find any way to add a counting condition to the which.min statement.
Note:My actual data set has several thousand machines and over 100 stores.
If M is an 11 x 3 zero-one matrix where M[i,j] = 1 if machine i is assigned to store j and 0 otherwise then the rows of M must each sum to 1 and the columns must each sum to 2 to 4 inclusive and we want to choose such an M which minimizes the sum of the distances sum(M * dm), say. This would give us the 0-1 linear program shown below. Below A is such that A %*% c(M) is the same as rowSums(M). Also B is such that B %*% c(M) is the same as colSums(M).
library(lpSolve)
k <- 3
n <- 11
dir <- "min"
objective.in <- c(dm)
A <- t(rep(1, k)) %x% diag(n)
B <- diag(k) %x% t(rep(1, n))
const.mat <- rbind(A, B, B)
const.dir <- c(rep("==", n), rep(">=", 3), rep("<=", 3))
const.rhs <- c(rep(1, n), rep(2, k), rep(4, k))
res <- lp(dir, objective.in, const.mat, const.dir, const.rhs, all.bin = TRUE)
res
## Success: the objective function is 9025807
soln <- matrix(res$solution, n, k)
and this solution:
> soln
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 1 0
[5,] 0 1 0
[6,] 0 0 1
[7,] 0 0 1
[8,] 1 0 0
[9,] 0 0 1
[10,] 0 1 0
[11,] 0 0 1
or in terms of the vector of store numbers assigned to each machine:
c(soln %*% (1:k))
## [1] 1 1 2 2 2 3 3 1 3 2 3

R: condense indexes

I have a vector like the following:
xx <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1)
I want to find the indexes that have ones and combine them together. In this case, I want the output to look like 1 6 and 11 14 in a 2x2 matrix. My vector is actually very long so I can't do this by hand. Can anyone help me with this? Thanks.
Since the question originally had a tag 'bioinformatics' I'll mention the Bioconductor package IRanges (and it's companion for ranges on genomes GenomicRanges)
> library(IRanges)
> xx <- c(1,1,1,1,1,1,0,0,0,0,1,1,1,1)
> sl = slice(Rle(xx), 1)
> sl
Views on a 14-length Rle subject
views:
start end width
[1] 1 6 6 [1 1 1 1 1 1]
[2] 11 14 4 [1 1 1 1]
which could be coerced to a matrix, but that would often not be convenient for whatever the next step is
> matrix(c(start(sl), end(sl)), ncol=2)
     [,1] [,2]
[1,]    1    6
[2,]   11   14
Other operations might start on the Rle, e.g.,
> xx = c(2,2,2,3,3,3,0,0,0,0,4,4,1,1)
> r = Rle(xx)
> m = cbind(start(r), end(r))[runValue(r) != 0,,drop=FALSE]
> m
[,1] [,2]
[1,] 1 3
[2,] 4 6
[3,] 11 12
[4,] 13 14
See the help page ?Rle for the full flexibility of the Rle class; to go from a matrix like that above to a new Rle as asked in the comment below, one might create a new Rle of appropriate length and then subset-assign using an IRanges as index
> r = Rle(0L, max(m))
> r[IRanges(m[,1], m[,2])] = 1L
> r
integer-Rle of length 14 with 3 runs
Lengths: 6 4 4
Values : 1 0 1
One could expand this to a full vector
> as(r, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
but often it's better to continue the analysis on the Rle. The class is very flexible, so one way of going from xx to an integer vector of 1's and 0's is
> as(Rle(xx) > 0, "integer")
[1] 1 1 1 1 1 1 0 0 0 0 1 1 1 1
Again, though, it often makes sense to stay in Rle space. And Arun's answer to your separate question is probably best of all.
Performance (speed) is important, although in this case I think the Rle class provides a lot of flexibility that would weigh against poor performance, and ending up at a matrix is an unlikely end-point for a typical analysis. Nonetheles the IRanges infrastructure is performant
eddi <- function(xx)
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1),
ncol = 2, byrow = TRUE)
iranges = function(xx) {
sl = slice(Rle(xx), 1)
matrix(c(start(sl), end(sl)), ncol=2)
}
iranges.1 = function(xx) {
r = Rle(xx)
cbind(start(r), end(r))[runValue(r) != 0, , drop=FALSE]
}
with
> xx = sample(c(0, 1), 1e5, TRUE)
> microbenchmark(eddi(xx), iranges(xx), iranges.1(xx), times=10)
Unit: milliseconds
expr min lq median uq max neval
eddi(xx) 45.88009 46.69360 47.67374 226.15084 234.8138 10
iranges(xx) 112.09530 114.36889 229.90911 292.84153 294.7348 10
iranges.1(xx) 31.64954 31.72658 33.26242 35.52092 226.7817 10
Something like this, maybe?
if (xx[1] == 1) {
rr <- cumsum(c(0, rle(xx)$lengths))
} else {
rr <- cumsum(rle(xx)$lengths)
}
if (length(rr) %% 2 == 1) {
rr <- head(rr, -1)
}
oo <- matrix(rr, ncol=2, byrow=TRUE)
oo[, 1] <- oo[, 1] + 1
[,1] [,2]
[1,] 1 6
[2,] 11 14
This edit takes care of cases where 1) the vector starts with a "0" rather than a "1" and 2) where the number of consecutive occurrences of 1's are odd/even. For ex: xx <- c(1,1,1,1,1,1,0,0,0,0).
Another, short one:
cbind(start = which(diff(c(0, xx)) == +1),
end = which(diff(c(xx, 0)) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
I tested on a very long vector and it is marginally slower than using rle. But more readable IMHO. If speed were really a concern, you could also do:
xx.diff <- diff(c(0, xx, 0))
cbind(start = which(head(xx.diff, -1) == +1),
end = which(tail(xx.diff, -1) == -1))
# start end
# [1,] 1 6
# [2,] 11 14
Here's another solution that's built upon the others' ideas, and is a bit shorter and faster:
matrix(which(diff(c(0,xx,0)) != 0) - c(0,1), ncol = 2, byrow = T)
# [,1] [,2]
#[1,] 1 6
#[2,] 11 14
I didn't test the non-base solution, but here's a comparison of base ones:
xx = sample(c(0,1), 1e5, T)
microbenchmark(arun(xx), flodel(xx), flodel.fast(xx), eddi(xx))
#Unit: milliseconds
# expr min lq median uq max neval
# arun(xx) 14.021134 14.181134 14.246415 14.332655 15.220496 100
# flodel(xx) 12.885134 13.186254 13.248334 13.432974 14.367695 100
# flodel.fast(xx) 9.704010 9.952810 10.063691 10.211371 11.108171 100
# eddi(xx) 7.029448 7.276008 7.328968 7.439528 8.361609 100

Resources