R: Faster way of computing large distance matrix - r

I am computing distance matrix between large number of locations (5000) on sphere (using Haversine distance function).
Here is my code:
require(geosphere)
x=rnorm(5000)
y=rnorm(5000)
xy1=cbind(x,y)
The time taken for computing the distance matrix is
system.time( outer(1:nrow(xy1), 1:nrow(xy1), function(i,j) distHaversine(xy1[i,1:2],xy1[j,1:2])))
The time taken to execute this program is high. Any suggestion how to lower time consumption to do this job! Thanks.

Try the built-in function in the geosphere package?
z <- distm( xy1 )
The default distance function for distm() - which calculates a distance matrix between a set of points - is the Haversine ("distHaversine") formula, but you may specify another using the fun argument.
On my 2.6GHz Core i7 rMBP this takes about 5 seconds for 5,000 points.

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.
library(geosphere)
library(spatialrisk)
library(data.table)
x=rnorm(5000)
y=rnorm(5000)
xy1 = data.table(x,y)
# Cross join two data tables
coordinates_dt <- optiRum::CJ.dt(xy1, xy1)
system.time({
z <- distm( xy1 )
})
# user system elapsed
# 14.163 3.700 19.072
system.time({
distances_m <- coordinates_dt[, dist_m := spatialrisk::haversine(y, x, i.y, i.x)]
})
# user system elapsed
# 2.027 0.848 2.913

Related

how to make a fast pairwise Tanimoto distance function in R

I have a data.frame of items identified by an integer property ID, which is also the row number of the data.frame.
Each item has a vector of features FP associated to it. The elements of each FP are unique (within that FP). So for instance c(1,2,7) but never c(1,7,7).
The Tanimoto distance between any two ID's is defined as 1 minus the number of unique elements in the intersection of their FP's, divided by the number of unique elements in the union of their FP's.
I need to calculate such distances in the context of a 'maxmin' algorithm. See for instance this blog post.
The most important point to note is that I must NOT compute a full distance matrix (even with the best algorithms it would be unfeasible on the scale of datasets I am working with).
As explained in the above post, the strength of the iterative maxmin picker according to Roger Sayle's method is that one can avoid computing most of the pairwise distances, and instead calculate only the few relevant ones. Hence my question.
Here's what I could come up with so far:
# make a random dataset
set.seed(1234567)
d <- sample(30:45, 1000, replace = T)
dd <- setNames(data.frame(do.call(rbind, sapply(d,function(n) list(sample(as.character(1:(45*2)), n, replace = F)), simplify = F))), "FP")
dd["ID"] <- 1:NROW(dd)
# define a pairwise distance function for ID's
distfun <- function(ID1,ID2) {
FP1 <- dd$FP[[ID1]]
FP2 <- dd$FP[[ID2]]
int <- length(intersect(FP1,FP2))
1 - int/(d[ID1]+d[ID2]-int)
}
# test performance of distance function
x <- sample(dd$ID, 200, replace = F)
y <- sample(dd$ID[!(dd$ID %in% x)], 200, replace = F)
pairwise.dist <- NULL
system.time(
for(i in x) {
for (j in y) {
dij <- distfun(i,j)
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 0.86 0.00 0.86
Question 1 : do you think the distance function could be made faster?
I tried making a sparse matrix of the features (ddu.tab in the code below, where I omitted the denominator, which is trivial to compute from the intersection) and defining the distance function as vector operations, but that was much slower (a bit to my surprise, I must say).
ddu <- do.call(rbind, sapply(dd$ID, function(x) {data.frame("ID"=x, "FP"=dd$FP[[x]], stringsAsFactors = F)}, simplify = F))
ddu.tab <- xtabs(~ID+FP, ddu, sparse = T)
system.time(
for(i in x) {
for (j in y) {
dij <- t(ddu.tab[i,]) %*% ddu.tab[j,]
#pairwise.dist <- rbind(pairwise.dist,c(min(i,j),max(i,j),dij))
}
}
)
# user system elapsed
# 32.35 0.03 32.66
Question 2 : actually less important than the distance calculation, but if anyone can advise... The update of pairwise.dist by rbind is (apparently) very costly. I don't know if I can do it differently (meaning not adding new elements at each iteration), because in the maxmin application the pairs of ID's whose distances are to be calculated are not known upfront like in this example, and pairwise.dist is continuously read and appended new elements.
Someone in the past suggested to me that lists may be better than matrices for read/write. If that is the case, I could write out pairwise.dist as a named list.
BTW, just FYI, in this specific example the full distance matrix is calculated quite fast:
system.time(ddu.dist <- dist(ddu.tab, method = "binary"))
# user system elapsed
# 0.61 0.00 0.61
which seems to indicate that there is indeed a fast method to calculate binary distances.
If anyone could please advise and/or point me to relevant resources, it would be great.
Thanks!
Not sure about speeding up the distance function itself, but you could replace your double loop, using the tidyverse, with
library(tidyverse)
results <- crossing(x = x, y = y) %>% #all x,y combinations
filter(x < y) %>% #remove duplicates
mutate(pairwise.dist = map2_dbl(x, y, distfun)) #apply distance function

How to use doParallel for calculating distance between zipcodes in R?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?
What I tried
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?
Edit: Benchmarks from (some) replies
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.
R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.
Without any sample data this snippet is untested.
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude.
The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.
I used #SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. See the geodist vignette for more information. Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.
If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. (The distVincenty* these are mainly implemented for curiosity).

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.
You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.
Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))
If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

Speed up raster::extract with weights in r

I want to extract the precise mean value of raster values from an area extent defined by a polygon in r. This works using raster::extract with the option weights=TRUE. However, this operation becomes prohibitively slow with large rasters and the function doesn't seem to be parallelized, thus beginCluster() ... endCluster() does not speed up the process.
I need to extract the values for a range of rasters, exemplified here as r, r10 and r100. Is there a way to speed this up in r, or is there an alternative way of doing this in GDAL?
r <- raster(nrow=1000, ncol=1000, vals=sample(seq(0,0.8,0.01),1000000,replace=TRUE))
r10 <- aggregate(r, fact=10)
r100 <- aggregate(r, fact=100)
v = Polygons(list(Polygon(cbind(c(-100,100,80,-120), c(-70,0,70,0)))), ID = "a")
v = SpatialPolygons(list(v))
plot(r)
plot(r10)
plot(r100)
plot(v, add=T)
system.time({
precise.mean <- raster::extract(r100, v, method="simple",weights=T, normalizeWeights=T, fun=mean)
})
user system elapsed
0.251 0.000 0.253
> precise.mean
[,1]
[1,] 0.3994278
system.time({
precise.mean <- raster::extract(r10, v, method="simple",weights=T, normalizeWeights=T, fun=mean)
})
user system elapsed
7.447 0.000 7.446
precise.mean
[,1]
[1,] 0.3995429
In the end I resorted the problem using gdalUtils working directly on the harddisk.
I used the command gdalwarp() to reduce the raster resolution to r10, 100.
Then gdalwarp() to increase the resolution of the resulting raster to the original resolution of r.
Then gdalwarp() with cutline= "v.shp", crop_to_cutline =T to mask the raster to the vector v.
And then gdalinfo() combined with subset(x(grep("Mean=",x))) to extract the mean values.
All of this was packed in a foreach() %dopar% loop to process a number of rasters and resolution.
While complicated and probably not as precise as extract::raster, it did the job.
It should actually run faster if you first call beginCluster (the function then deals with the parallelization). Even better would be to use version 2.7-14 which has a much faster implementation. It is currently under review at CRAN, but you can also get it here: https://github.com/rspatial/raster

Efficient multidimensional dynamic time warping implementation

Here's how the literature explains how to compute multidimensional dynamic time warping of two time series:
library(dtw)
x<- cbind(1:10,1)
y<- cbind(11:15,2)
cxdist <-dist(x,y,method="euclidean")
dtw(cxdist)$distance
In fact it first computes the cross distance matrix and then use it as input in the dtw function.
I'd like to use multidimensional dynamic time warping in image classification with quite large images.
Image values are stored in a data frame that could look like this:
inDf <- data.frame(matrix(rnorm(60), ncol = 6))
colnames(inDf) <- c('var1t1','var2t1','var1t2','var2t2','var1t3','var2t3')
In this example, there are two variables (var1 and var2) observed three times.
The question is how to get the dtw distance matrix with the as much efficiency as possible regarding computing intensity?
Here are some thoughts:
- iterate through each values of the input image matrices, reshape the vectors to matrices in order to be able to compute the cross distances and then compute the dtw distance and store it in a dedicated matrix.
This is certainly the most computing intensive solution
When dealing with intensive computations always makes sense to consider Rcpp package. If you want to get distance matrix with euclidean distances faster, you can implement corresponding Rcpp function:
library(Rcpp)
library(inline)
# Rcpp function for euclidean distance
fastdist <- cxxfunction(signature(x="matrix", y="matrix"), plugin="Rcpp",
body='
Rcpp::NumericMatrix dx(x);
Rcpp::NumericMatrix dy(y);
const int N = dx.nrow();
const int M = dy.nrow();
Rcpp::NumericMatrix res(N, M);
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
res(i,j) = sqrt(sum((dx(i,_)-dy(j,_))*(dx(i,_)-dy(j,_))));
}
}
return res;
')
It uses Rcpp syntactic sugar in order to make code more compact and readable. However, sometimes it's better having wrapper function for checking types, coercing, etc. It's not necessary - you can call fastdist directly. But, anyway, wrapper can look like this:
# Wrapper R function
fast.dist <- function(x, y){
stopifnot(class(x) %in% c("data.frame","matrix") &
class(y) %in% c("data.frame","matrix") &
ncol(x)==ncol(y))
fastdist(as.matrix(x), as.matrix(y))
}
Now we can turn to literature example.
library(dtw)
# EXAMPLE 1
x<- cbind(1:10,1)
y<- cbind(11:15,2)
# Check results
all.equal(fast.dist(x,y), dist(x,y,method="euclidean"), check.attributes=F)
# [1] "target is matrix, current is crossdist"
all.equal(fast.dist(x,y), matrix(dist(x,y,method="euclidean"), ncol=nrow(y)))
# [1] TRUE
Note, that dist returns result of class crossdist. So, for comparison it should be coerced to matrix.
And now your primary question - we're generating data first:
# EXAMPLE 2
set.seed(1234)
N <- 100
inDf <- data.frame(matrix(rnorm(6*N), ncol = 6))
colnames(inDf) <- c('var1t1','var2t1','var1t2','var2t2','var1t3','var2t3')
# Extracting variables
var1 <- inDf[,c("var1t1","var1t2","var1t3")]
var2 <- inDf[,c("var2t1","var2t2","var2t3")]
I'm not completely sure about your data structure, but in any case you can always prepare variables according to your needs.
Comparison and benchmarking:
library(rbenchmark)
all.equal(fast.dist(var1,var2), matrix(dist(var1,var2), ncol=N))
# [1] TRUE
benchmark(fast.dist(var1,var2), dist(var1,var2), order="relative")[,1:4]
# test replications elapsed relative
# 1 fast.dist(var1, var2) 100 0.081 1.000
# 2 dist(var1, var2) 100 0.246 3.037
fast.dist is roughly 3 times faster than dist in this case. However, while N is growing the relative speed-up will go down.
Also note, as were mentioned in comments, dtw can compute distance matrix by itself. Nevertheless, it's more efficient to have distance matrix precomputed. See quick test below:
cxdist <- fast.dist(var1,var2)
benchmark(dtw(cxdist)$distance, dtw(var1,var2)$distance, order="relative")[,1:4]
# test replications elapsed relative
# 1 dtw(cxdist)$distance 100 0.476 1.000
# 2 dtw(var1, var2)$distance 100 0.736 1.546
Also, if you're interested only in $distance you can pass distance.only=T to dtw() - it gives some speed-up.

Resources