Recalculating distance matrix - r

I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?

I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.
Here is (smaller than yours) sample data
num.points <- 400
num.vars <- 1000
original.points <- matrix(runif(num.points * num.vars),
nrow = num.points, ncol = num.vars)
and the distance matrix you already computed:
d0 <- rdist(original.points)
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist <- rdist(extra.points)
outer.dist <- rdist(extra.points, original.points)
so you can bind them to your bigger distance matrix:
d1 <- rbind(cbind(d0, t(outer.dist)),
cbind(outer.dist, inner.dist))
Let's check that it matches what a full, long rerun would have produced:
d2 <- rdist(rbind(original.points, extra.points))
identical(d1, d2)
# [1] TRUE

Related

Calculate Euclidean distance between multiple pairs of points in dataframe in R

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!
Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

What causes the difference between calc and cellStats in raster calculations in R?

I am working with a dataset that consists of 20 layers, stacked in a RasterBrick (originating from an array). I have looked into the sum of the layers, calculated with both 'calc' and 'cellStats'. I have used calc to calculate the sum of the total values and cellStats to look at the average of the values per layer (useful for a time series).
However, when I sum the average of each layer, it is half the value of the other calculated sum. What causes this difference? What am I overlooking?
Code looks like this:
testarray <- runif(54214776,0,1)
# Although testarray should contain a raster of 127x147 with 2904 time layers.
# Not really sure how to create that yet.
for (i in 1830:1849){
slice<-array2[,,i]
r <- raster(nrow=(127*5), ncol=(147*5), resolution =5, ext=ext1, vals=slice)
x <- stack(x , r)
}
brickhp2 <- brick(x)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2[r_sumhp2<= 0] <- NA
SWEavgpertimestepM <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
The goal is to compare the sum of the layers calculated with 'calc(x, sum)' with the sum of the mean calculated with 'cellStats(x, mean)'.
Rasterbrick looks like this (600kb, GTiff) : http://www.filedropper.com/brickhp2
*If there is a better way to share this, please let me know.
The confusion comes as you are using calc which operates pixel-wise on a brick (i.e. performs the calculation on the 20 values at each pixel and returns a single raster layer) and cellStats which performs the calculation on each raster layer individually and returns a single values for each layer. You can see that the results are comparable if you use this code:
library(raster)
##set seed so you get the same runif vals
set.seed(999)
##create example rasters
ls=list()
for (i in 1:20){
r <- raster(nrow=(127*5), ncol=(147*5), vals=runif(127*5*147*5))
ls[[i]] <- r
}
##create raster brick
brickhp2 <- brick(ls)
##calc sum (pixel-wise)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2 ##returns raster layer
##calc mean (layer-wise)
r_meanhp2 <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
r_meanhp2 ##returns vector of length nlayers(brickhp2)
##to get equivalent values you need to divide r_sumhp2 by the number of layers
##and then calculate the mean
cellStats(r_sumhp2/nlayers(brickhp2),stat="mean")
[1] 0.4999381
##and for r_meanhp2 you need to calculate the mean of the means
mean(r_meanhp2)
[1] 0.4999381
You will need to determine for yourself if you want to use the pixel or layer wise result for your application.

R - Different approach to speed up 3 dimension array/matrix creation

My question is one of approach. Using SO I iterated through methods to create a 3 dimension array in R (this is my first question; R is a constraint). The use case is that this final array needs to be updated often but the two input arrays are updated at different periods. The goal is to minimize the final array creation time, but also intermediary steps if possible.
I know I can reach out with Rcpp, and I assign more than I need to for readability, but what I am wondering is:
Is there a better approach to completing this operation?
if (!require("geosphere")) install.packages("geosphere")
#simulate real data
dimLength <- 418
latLong <- cbind(rep(40,418),rep(2,418))
potentialChurn <- as.matrix(rep(500,418))
#create 2D matrix
valueMat <- matrix(0,dimLength,dimLength)
value <- potentialChurn
valueTranspose <- t(value)
for (s in 1:dimLength){valueMat[s,] <- value + valueTranspose[s]}
diag(valueMat) <- 0
#create 3D matrix from copying 2D matrix
bigValMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (d in 1:dimLength){bigValMat[,d,] <- valueMat}
#get crow fly distance between locations, create 2D matrix
distMat <- as.matrix(outer(seq(dimLength), seq(dimLength), Vectorize(function(i, j) distCosine(latLong[i,], latLong [j,]))))
###create 3D matrix by calculating distance between any two locations;
# create 2D matrix from each column in original 2D matrix
# add this column-replicated 2D matrix to the original
bigDistMat <- array(0,dim=c(dimLength,dimLength,dimLength))
for (p in 1:dimLength){
addCol <- distMat[,p]
addMatrix <- as.matrix(addCol)
for (y in 2:dimLength) {addMatrix <- cbind(addMatrix,addCol)}
bigDistMat[,p,] <- data.matrix(distMat) + data.matrix(addMatrix)}
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
...as context this is part of a two step ahead forecast policy developed for a class using Barcelona Bikesharing (Bicing) data. The project is over and I am interested how I could have done better.
In general if you want to speed up your code you want to identify bottle necks and fix them like explained here. Putting all your code before hand in a function would
Be a good idea.
In your specific case, you use much too much for loops for an R code. You need to vectorize your code much more.
Edit
Now for the long answer:
#simulate real data, you want them to be random
dimLength <- 418
latLong <- cbind(rnorm(dimLength,40,0.5),rnorm(dimLength,2,0.5))
potentialChurn <- as.matrix(rnorm(dimLength,500,10))
#create 2D matrix, outer is designed for this operation
valueMat <- outer(value,t(value),FUN="+")[,1,1,]
diag(valueMat) <- 0
# create 3D matrix from copying 2D matrix, again, avoid for loop
bigValMat <- array(rep(valueMat,dimLength),dim=c(dimLength,dimLength,dimLength))
# and use aperm to permute the dimensions
bigValMat <- aperm(bigValMat2,c(1,3,2))
#get crow fly distance between locations, create 2D matrix
# other packages are available to compute that kind of distance matrix
# but let's stay in plain R
# wordy but so much faster (and easier to read)
longs1 <- rep(latLong[,1],dimLength)
lats1 <- rep(latLong[,2],dimLength)
latLong1 <- cbind(longs1,lats1)
longs2 <- rep(latLong[,1],each=dimLength)
lats2 <- rep(latLong[,2],each=dimLength)
latLong2 <- cbind(longs2,lats2)
distMat <- matrix(distCosine(latLong1,latLong2),ncol=dimLength)
###create 3D matrix by calculating distance between any two locations;
# same logic than for bigValMat
addMatrix <- array(rep(distMat,dimLength),dim=rep(dimLength,3))
distMat3D <- aperm(addMatrix,c(1,3,2))
bigDistMat <- addMatrix + distMat3D
#get crow fly distance between locations, create 2D matrix
#Final matrix calculation
bigValDistMat <- bigValMat / bigDistMat
Here it is 25x faster than your initial code (76s -> 3s). It could still be much improved but you got the idea: avoid for and cbind and co at all costs.

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

Find K nearest neighbors, starting from a distance matrix

I'm looking for a well-optimized function that accepts an n X n distance matrix and returns an n X k matrix with the indices of the k nearest neighbors of the ith datapoint in the ith row.
I find a gazillion different R packages that let you do KNN, but they all seem to include the distance computations along with the sorting algorithm within the same function. In particular, for most routines the main argument is the original data matrix, not a distance matrix. In my case, I'm using a nonstandard distance on mixed variable types, so I need to separate the sorting problem from the distance computations.
This is not exactly a daunting problem -- I obviously could just use the order function inside a loop to get what I want (see my solution below), but this is far from optimal. For example, the sort function with partial = 1:k when k is small (less than 11) goes much faster, but unfortunately returns only sorted values rather than the desired indices.
Try to use FastKNN CRAN package (although it is not well documented). It offers k.nearest.neighbors function where an arbitrary distance matrix can be given. Below you have an example that computes the matrix you need.
# arbitrary data
train <- matrix(sample(c("a","b","c"),12,replace=TRUE), ncol=2) # n x 2
n = dim(train)[1]
distMatrix <- matrix(runif(n^2,0,1),ncol=n) # n x n
# matrix of neighbours
k=3
nn = matrix(0,n,k) # n x k
for (i in 1:n)
nn[i,] = k.nearest.neighbors(i, distMatrix, k = k)
Notice: You can always check Cran packages list for Ctrl+F='knn'
related functions:
https://cran.r-project.org/web/packages/available_packages_by_name.html
For the record (I won't mark this as the answer), here is a quick-and-dirty solution. Suppose sd.dist is the special distance matrix. Suppose k.for.nn is the number of nearest neighbors.
n = nrow(sd.dist)
knn.mat = matrix(0, ncol = k.for.nn, nrow = n)
knd.mat = knn.mat
for(i in 1:n){
knn.mat[i,] = order(sd.dist[i,])[1:k.for.nn]
knd.mat[i,] = sd.dist[i,knn.mat[i,]]
}
Now knn.mat is the matrix with the indices of the k nearest neighbors in each row, and for convenience knd.mat stores the corresponding distances.

Resources