How to use apply function to calculate the distance between two matrices

How to use apply function to calculate the distance between two matrices - r

I'm trying to calculate the euclidean distance between two matrices. I have already achieved that using 2 for loops but trying to vectorize the calculation to speed up. I'm using pdist as a benchmark to valid if the distance is calculated correctly.
Thanks to this post, https://medium.com/#souravdey/l2-distance-matrix-vectorization-trick-26aa3247ac6c, I tried to achieve the same thing in r with this code:
dist <- sqrt(rowSums(xtest**2)+rowSums(xtrain**2)-2*xtrain %*% t(xtest))
But the result is different from what comes out of pdist. I am not sure what's wrong with this.
Here are some codes
Create some data
xtest=matrix(cbind(c(0,0),c(1,31)),2,2,byrow=TRUE)
xtrain=matrix(cbind(c(9,2),c(4,15),c(7,8),c(-22,-2)),4,2,byrow=TRUE)
Calculate using double loops
mydist <- function(xtest,xtrain) {
euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))
dist <- matrix(,nrow=nrow(xtrain),ncol=nrow(xtest))
for (i in 1:nrow(xtrain)){
for (j in 1:nrow(xtest)){
dist[i,j] <- euc.dist(xtrain[i,], xtest[j,])
}
}
return (dist)
}
> mydist(xtest,xtrain)
[,1] [,2]
[1,] 9.219544 30.08322
[2,] 15.524175 16.27882
[3,] 10.630146 23.76973
[4,] 22.090722 40.22437
The result is same as using pdist
> libdists <- pdist(xtrain,xtest)
> as.matrix(libdists)
[,1] [,2]
[1,] 9.219544 30.08322
[2,] 15.524175 16.27882
[3,] 10.630146 23.76973
[4,] 22.090721 40.22437
But if I use matrix multiplication method it's wrong
> mydist2 <- function(xtest,xtrain) {
+ dist <- sqrt(rowSums(xtest**2)+rowSums(xtrain**2)-2*xtrain %*% t(xtest))
+ return (dist)
+ }
> mydist2(xtest,xtrain)
[,1] [,2]
[1,] 9.219544 NaN
[2,] 34.684290 16.27882
[3,] 10.630146 NaN
[4,] 38.078866 40.22437
I have also tried to use mapply function
> mydist3 <- function(xtest,xtrain) {
+ euc.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))
+ dist <- mapply(euc.dist, xtest,xtrain)
+ return (dist)
+ }
> mydist3(xtest,xtrain)
[1] 9 3 7 53 2 14 8 33
I think it goes element wise rather than takes each row as a vector to calculate the distance between two vectors.
Any suggestions will be appreciated!

Use two apply instances with the second nested in the first:
d1 <- apply(xtest, 1, function(x) apply(xtrain, 1, function(y) sqrt(crossprod(x-y))))
Check against pdist:
library(pdist)
d2 <- as.matrix(pdist(xtrain, xtest))
all.equal(d1, d2, tolerance = 1e-7)
## [1] TRUE

Related

double sum calculation, what is the most efficient approach?

I need to calculate this
where x is a vector of length n and f is a function.
What is the most efficient calculation for this in R?
One method is a double for loop, but that is obviously slow.

One fast way to do is the following:
Assume we have this vector:
x = c(0,1,2)
i.e. n=3, and assume f is a multiplication function:
Now, we use expand.grid.unique custom function which produces unique combinations within vector; in other words, it is similar to expand.grid base function but with unique combinations:
expand.grid.unique <- function(x, y, include.equals=FALSE)
{
x <- unique(x)
y <- unique(y)
g <- function(i)
{
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
In our vector case, when we cal expand.grid.unique(x,x), it produces the following result:
> expand.grid.unique(x,x)
[,1] [,2]
[1,] 0 1
[2,] 0 2
[3,] 1 2
Let's assign two_by_two to it:
two_by_two <- expand.grid.unique(x,x)
Since our function is assumed to be multiplication, then we need to calculate sum-product, i.e. dot product of first and second columns of two_by_two. For this we need %*% operator:
output <- two_by_two[,1] %*% two_by_two[,2]
> output
[,1]
[1,] 2

See ?combn
x <- 0:2
combn(x, 2)
# unique combos
[,1] [,2] [,3]
#[1,] 0 0 1
#[2,] 1 2 2
sum(combn(x, 2))
#[1] 6
combn() creates all the unique combinations. If you have a function that you want to sum, you can add a FUN to the call:
random_f <- function(x){x[1] + 2 * x[2]}
combn(x, 2, FUN = random_f)
#[1] 2 4 5
sum(combn(x, 2, FUN = random_f))
#[1] 11

Calculating distance between two points for multiple records for matching rows - loop over rows of two matrices

I have got two matrices with coordinates and I am trying to compute distances between points in matching rows, i.e. between row 1 in first matrix and row 1 in second matrix.
What I am getting is computed distance between row 1 and all the other rows. This is creating memory issues as I have 800,000 rows. Does anyone know how to ask for that?
I am using
dist1 <- distm(FareStageMatrix[1:25000,], LSOACentroidMatrix[1:25000,], fun=distHaversine)
I am trying to create something like this but doesn't seem to work
for(i in 1:nrow(FareStageMatrix)) {
for(j in 1:nrow(LSOACentroidMatrix)) {
my_matrix[i] <- my_matrix[distm(FareStageMatrix[i], LSOACentroidMatrix[i], fun=distHaversine)]
}
}
changed to
for (i in 1:nrow(FareStageMatrix)){
for (i in 1:nrow(LSOACentroidMatrix)){
r1<-FareStageMatrix[i,]
r2<-LSOACentroidMatrix[i,]
results[i]<-distm(r1, r2, fun=distHaversine)
}
}
Is that something that should be working?

It seems I have managed to find a solution to that:
results<-matrix(NA,nrow(FareStageMatrix))
for (i in 1:nrow(FareStageMatrix)){
for (i in 1:nrow(LSOACentroidMatrix)){
r1<-FareStageMatrix[i,]
r2<-LSOACentroidMatrix[i,]
results[i]<-distm(r1, r2, fun=distHaversine) ## Example function
}
}
where FareStageMatrix and LSOACentroidMatrix are matrices with coordinates
It seems to have calculated one distance for a given pair of points

I've adapted geosphere's distGeo function (geodesic distance) for this purpose.
library(geosphere)
source("https://raw.githubusercontent.com/RomanAbashin/distGeo_v/master/distGeo_v.R")
Data
set.seed(1702)
m1 <- matrix(runif(20000, -10, 10), ncol = 2)
m2 <- matrix(runif(20000, -10, 10), ncol = 2)
Code
result <- distGeo_v(m1[, 1], m1[, 2],
m2[, 1], m2[, 2])
Result
> head(m1)
[,1] [,2]
[1,] 8.087152 9.227607
[2,] 9.528334 9.103403
[3,] 5.637921 -2.213228
[4,] -2.473758 -9.812986
[5,] -2.844036 -5.245779
[6,] -4.824615 -4.330890
> head(m2)
[,1] [,2]
[1,] 0.1673027 0.6483745
[2,] -2.5033184 0.1386050
[3,] 4.8589785 5.1996968
[4,] 8.3239454 -8.9810949
[5,] 0.8280422 -7.8272613
[6,] -6.2633738 -5.8725562
> head(result)
[1] 1292351.3 1661739.3 824260.0 1189476.4 496403.2 233480.2

How to create random vectors of another vector?

I am performing calculations with constants and vectors (approximate length = 100) for which I need to simulate normal distributions N (with rnorm). For constants (K, with standard deviation = KU) I use rnorm() in the standard way:
K <- 2
KU <- 0.2
set.seed(123)
KN <- rnorm(n = 3, mean = K, sd = KU)
what provides a vector of length 3 (KN):
[1] 1.887905 1.953965 2.311742
Now, I need to do the same thing with a vector (V, standard deviation VU). My first guess is to use:
V <- c(1, 2, 3)
VU <- 0.1 * V
set.seed(123)
VN <- rnorm(3, V, VU)
but only a vector of 3 elements is produced, one for each vector element:
[1] 0.9439524 1.9539645 3.4676125
This is actually the first simulation of the vector, but I need 3 times this vector. One solution is to create 9 numbers, but VN is a vector of 9 elements:
[1] 0.9439524 1.9539645 3.4676125 1.0070508 2.0258575 3.5145195 1.0460916 1.7469878 2.7939441
not 3 vectors of 3 elements. What I want is VN =
[1] 0.9439524 1.0070508 1.0460916
[2] 1.9539645 2.0258575 1.7469878
[3] 3.4676125 3.5145195 2.7939441
so, VN are 3 vectors which I can subsequently use in other calculations, such as KN * VN. The solution that I have found is:
set.seed(123)
VN <- as.data.frame(t(matrix(rnorm(3 * length(V), V, VU), nrow = length(V))))
but in my opinion this is a rather cumbersome expression (which I need to repeat several times in different places with rather long variable names). Is there a simpler way in base R to produce random vectors? I would like to see something like:
VN <- rnorm.vector(3, V, VU)

We can use replicate
set.seed(123)
replicate(3, rnorm(3, V, VU))
# [,1] [,2] [,3]
#[1,] 0.9439524 1.007051 1.046092
#[2,] 1.9539645 2.025858 1.746988
#[3,] 3.4676125 3.514519 2.793944
Or it could be
mapply(rnorm, n = 3, mean = V, sd = VU)

In addition to #akrun's great options, you may also use something slightly simpler than your approach:
matrix(rnorm(n * length(V), V, VU), nrow = n, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 0.9439524 1.953965 3.467612
# [2,] 1.0070508 2.025858 3.514519
# [3,] 1.0460916 1.746988 2.793944
or also the MASS package with mvrnorm letting to sample from a multivariate normal distribution:
library(MASS)
mvrnorm(n, VU, diag(VU))
# [,1] [,2] [,3]
# [1,] 0.6650715 0.37923044 0.05590089
# [2,] 0.2574341 0.24949882 0.97045721
# [3,] -0.5218990 -0.04857971 0.49707815
where
diag(VU)
# [,1] [,2] [,3]
# [1,] 0.1 0.0 0.0
# [2,] 0.0 0.2 0.0
# [3,] 0.0 0.0 0.3
The latter option is the way to go in case you want the variance-covariance matrix not to be diagonal.

Multiply a matrix' columns by its columns

I have a 4x100 matrix where I would like to multiply column 1 with row 1 in its transpose etc and store these matrices somewhere to be able to take the sum of these new matrices lateron.
I really don't know where to start due to the fact that I get 4x4 matrices after the column-row-multiplication. Due to this fact I cannot store them in a matrix
data:
mm num[1:4,1:100]
mm_t num[1:100,1:4]
I'm thinking of creating a list in some way
list1=list()
for(i in 1:100){
list1[i] <- mm[,i]%*%mm_t[i,]
}
but I need some more indices i think because this just leaves me with a number in each argument..

First, your call for data is not clear. Second, are you tryign to multiply each value by itself, or do matrix multiplication
We create a 4x100 matrix and its transpose:
mm <- matrix(1:400, nrow = 4, ncol = 100)
mm.t <- t(mm)
Then we can do the matrix multiplication (which is what you did, and you get a 4 x 4 matrix from the definition of matrix multiplication https://www.wikiwand.com/en/Matrix_multiplication)
If we want to multiply each index by itself (so mm[1,1] by mm [1,1]) then:
mm * mm
This will result in 4x100 matrix where each value is the square of the original value.
If we want the matrix multiplication of each column with itself, then:
sapply(1:100, function(x) {
mm[, x] %*% mm[, x]
})
This results in 100 values: each one is the matrix product of a 4x1 vector with itself.

Let's start with some sample data. Please get in the habit of including things like this in your question:
nr = 4
nc = 100
set.seed(47)
mm = matrix(runif(nr * nc), nrow = nr)
Here's a working answer, very similar to your attempt:
result = list()
for (i in 1:ncol(mm)) result[[i]] = mm[, i] %*% t(mm[, i])
result[1:2]
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] 0.9544547 0.3653018 0.7439585 0.8035430
# [2,] 0.3653018 0.1398132 0.2847378 0.3075428
# [3,] 0.7439585 0.2847378 0.5798853 0.6263290
# [4,] 0.8035430 0.3075428 0.6263290 0.6764924
#
# [[2]]
# [,1] [,2] [,3] [,4]
# [1,] 0.3289532 0.3965557 0.2231443 0.2689613
# [2,] 0.3965557 0.4780511 0.2690022 0.3242351
# [3,] 0.2231443 0.2690022 0.1513691 0.1824490
# [4,] 0.2689613 0.3242351 0.1824490 0.2199103
As to why yours didn't work, we can experiment and see that indeed we get a number rather than a matrix. The reason is that when you subset a single row or column of a matrix, the dimensions are "dropped" and it is coerced to a plain vector. And when you matrix multiply two vectors, you get their dot product.
mmt = t(mm)
mm[, 1] %*% mmt[1, ]
# [,1]
# [1,] 2.350646
dim(mm[, 1])
# NULL
dim(mmt[1, ])
# NULL
We can avoid this by specifying drop = FALSE in the subset code
dim(mmt[1, , drop = FALSE])
# [1] 1 4
And thus slightly modify your attempt, just adding drop = FALSE will make it work.
res2 = list()
for (i in 1:ncol(mm)) res2[[i]] = mm[, i] %*% mmt[i, , drop = FALSE]
identical(result, res2)
# [1] TRUE

How to calculate correlation between matrices with different column dimention in R

I have two matrices with same number of rows and different number of columns as:
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
Since i have the same number of rows I want to calculate the following correlation between the columns of the matrices:
cor.test(mat1[,1], mat2[,1])
cor.test(mat1[,1], mat2[,2])
cor.test(mat1[,1], mat2[,3])
cor.test(mat1[,2], mat2[,1])
cor.test(mat1[,2], mat2[,2])
cor.test(mat1[,2], mat2[,3])
...........
...........
cor.test(mat1[,5], mat2[,3])
for(i in 1:5){
for(j in 1:3){
pv[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
At the end I want a matrix(5 * 3) or vector containing the correlation values, can anyone help?
Can i use this to return both p.value and estimate?
FUN <- function(x, y) {
res <- cor.test(x, y, method="spearman", exact=F)
return(list(c = res$estimate, p = res$p.value))
}
r1 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$p))
r2 <- outer(colnames(mat1), colnames(mat2), Vectorize(function(i,j) FUN(mat1[,i], mat2[,j])$c))
Thank you.

Why don't you just use cor function to calculate the pearson correlation?
seed(1)
mat1 <- matrix(rnorm(20), 4, 5)
mat2 <- matrix(rnorm(12), 4, 3)
cor(mat1, mat2)
[,1] [,2] [,3]
[1,] 0.4406765 -0.70959590 0.10731768
[2,] -0.2566199 -0.01588993 -0.63630159
[3,] -0.9813313 0.85082165 -0.77172317
[4,] 0.6121358 -0.38564314 0.87077092
[5,] -0.6897573 0.66272015 -0.08380553
To double check,
> col_1 <- 3
> col_2 <- 2
# all.equal is used to compare numeric equality where `==` is discouraged
> all.equal(cor(mat1, mat2)[col_1, col_2], cor(mat1[,col_1], mat2[,col_2]))
[1] TRUE
They are equal!

An alternative, slightly easier to understand than loops in my opinion:
sapply(
data.frame(mat1),
function(x) Map(function(a,b) cor.test(a,b)$p.value,
list(x),
as.data.frame(mat2))
)
Result:
# X1 X2 X3 X4 X5
#[1,] 0.7400541 0.8000358 0.5084979 0.4441933 0.9104712
#[2,] 0.2918163 0.2764817 0.956807 0.6072979 0.4395218
#[3,] 0.2866105 0.4095909 0.5648188 0.1746428 0.9125866

I supose you would like to do it without for's. With base stuff, here is the double apply aproach:
apply(mat1, 2, function(col_mat1){
apply(mat2, 2, function(col2, col1) {
cor.test(col2, col1)$p.value
}, col1=col_mat1)
})
The outter apply iterates at mat1 columns and serves one side of cor.test(). The inner one does the same, but now fills the second side of cor.test(). In practie, apply is replacing the for's.

I think all you need is to define your matrix first
mat_cor <- matrix(nrow=ncol(mat1), ncol=ncol(mat2))
for(i in 1:5)
{
for(j in 1:3)
{
mat_cor[i,j] <- cor.test(mat1[, i], mat2[ , j])$p.value
}
}
Output
mat_cor
[,1] [,2] [,3]
[1,] 0.9455569 0.8362242 0.162569342
[2,] 0.7755360 0.9849619 0.775006329
[3,] 0.8799139 0.8050564 0.001358697
[4,] 0.1574388 0.1808167 0.618624825
[5,] 0.8571844 0.8897125 0.879818822

You can try with something like this
pv <- c()
for(i in 1:dim(mat1)[2]){
for(j in 1:dim(mat2)[2]){
pv <-c(c, cor.test(mat1[, i], mat2[ , j])$estimate)
}
}
dim(pv) <- c(dim(mat1)[2], dim(mat2)[2])

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to use apply function to calculate the distance between two matrices - r

Use two apply instances with the second nested in the first: d1 <- apply(xtest, 1, function(x) apply(xtrain, 1, function(y) sqrt(crossprod(x-y)))) Check against pdist: library(pdist) d2 <- as.matrix(pdist(xtrain, xtest)) all.equal(d1, d2, tolerance = 1e-7) ## [1] TRUE

Related

double sum calculation, what is the most efficient approach?

Calculating distance between two points for multiple records for matching rows - loop over rows of two matrices

How to create random vectors of another vector?

Multiply a matrix' columns by its columns

How to calculate correlation between matrices with different column dimention in R

Categories

Resources