Calculate Euclidean distance between multiple pairs of points in dataframe in R - r

I'm trying to calculate the Euclidean distance between pairs of points in a dataframe in R, and there's an ID for each pair:
ID <- sample(1:10, 10, replace=FALSE)
P <- runif(10, min=1, max=3)
S <- runif(10, min=1, max=3)
testdf <- data.frame(ID, P, S)
I found several ways to calculate the Euclidean distance in R, but I'm either getting an error, returning only 1 value (so it's computing the distance between the entire vector), or I end up with a matrix when all I need is a 4th column with the distance between each pair (columns 'P' and 'S.') I'm a bit confused by matrices so I'm not sure how to work with that result.
Tried making a function and applying it to the 2 columns but I get an error:
testdf$V <- apply(testdf[ , c('P', 'S')], 1, function(P, S) sqrt(sum((P^2, S^2)))
# Error in FUN(newX[, i], ...) : argument "S" is missing, with no default
Then tried using the dist() function in the stats package but it only returns 1 value:
(Same problem if I follow the method here: https://www.statology.org/euclidean-distance-in-r/)
P <- testdf$P
S <- testdf$S
testProbMatrix <- rbind(P, S)
stats::dist(testProbMatrix, method = "euclidean")
# returns only 1 distance
Returns a matrix
(Here's a nice explanation why: Calculate the distances between pairs of points in r)
stats::dist(cbind(P, S), method = "euclidean")
But I'm confused how to pull the distances out of the matrix and attach them to the correct ID for each pair of points. I don't understand why I have to make a matrix instead of just applying the function to the dataframe - matrices have always confused me.
I think this is the same question as here (Finding euclidean distance between all pair of points) but for R instead of Python
Thanks for the help!

Try this out if you would just like to add another column to your dataframe
testdf$distance <- sqrt((P^2 + S^2))

Related

What causes the difference between calc and cellStats in raster calculations in R?

I am working with a dataset that consists of 20 layers, stacked in a RasterBrick (originating from an array). I have looked into the sum of the layers, calculated with both 'calc' and 'cellStats'. I have used calc to calculate the sum of the total values and cellStats to look at the average of the values per layer (useful for a time series).
However, when I sum the average of each layer, it is half the value of the other calculated sum. What causes this difference? What am I overlooking?
Code looks like this:
testarray <- runif(54214776,0,1)
# Although testarray should contain a raster of 127x147 with 2904 time layers.
# Not really sure how to create that yet.
for (i in 1830:1849){
slice<-array2[,,i]
r <- raster(nrow=(127*5), ncol=(147*5), resolution =5, ext=ext1, vals=slice)
x <- stack(x , r)
}
brickhp2 <- brick(x)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2[r_sumhp2<= 0] <- NA
SWEavgpertimestepM <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
The goal is to compare the sum of the layers calculated with 'calc(x, sum)' with the sum of the mean calculated with 'cellStats(x, mean)'.
Rasterbrick looks like this (600kb, GTiff) : http://www.filedropper.com/brickhp2
*If there is a better way to share this, please let me know.
The confusion comes as you are using calc which operates pixel-wise on a brick (i.e. performs the calculation on the 20 values at each pixel and returns a single raster layer) and cellStats which performs the calculation on each raster layer individually and returns a single values for each layer. You can see that the results are comparable if you use this code:
library(raster)
##set seed so you get the same runif vals
set.seed(999)
##create example rasters
ls=list()
for (i in 1:20){
r <- raster(nrow=(127*5), ncol=(147*5), vals=runif(127*5*147*5))
ls[[i]] <- r
}
##create raster brick
brickhp2 <- brick(ls)
##calc sum (pixel-wise)
r_sumhp2 <- calc(brickhp2, sum, na.rm=TRUE)
r_sumhp2 ##returns raster layer
##calc mean (layer-wise)
r_meanhp2 <- cellStats(brickhp2, stat='mean', na.rm=TRUE)
r_meanhp2 ##returns vector of length nlayers(brickhp2)
##to get equivalent values you need to divide r_sumhp2 by the number of layers
##and then calculate the mean
cellStats(r_sumhp2/nlayers(brickhp2),stat="mean")
[1] 0.4999381
##and for r_meanhp2 you need to calculate the mean of the means
mean(r_meanhp2)
[1] 0.4999381
You will need to determine for yourself if you want to use the pixel or layer wise result for your application.

Implementing KNN with different distance metrics using R

I am working on a dataset in order to compare the effect of different distance metrics. I am using the KNN algorithm.
The KNN algorithm in R uses the Euclidian distance by default. So I wrote my own one. I would like to find the number of correct class label matches between the nearest neighbor and target.
I have prepared the data at first. Then I called the data (wdbc_n), I chose K=1. I have used Euclidian distance as a test.
library(philentropy)
knn <- function(xmat, k,method){
n <- nrow(xmat)
if (n <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n, ncol = k)
for(i in 1:n) {
ddist<- distance(xmat, method)
neigh[i, ] <- order(ddist)[2:(k + 1)]
}
return(neigh)
}
wdbc_nn <-knn(wdbc_n ,1,method="euclidean")
Hoping to get a similar result to the paper ("on the surprising behavior of distance metrics in high dimensional space") (https://bib.dbvis.de/uploadedFiles/155.pdf, page 431, table 3).
My question is
Am I right or wrong with the codes?
Any suggestions or reference that will guide me will be highly appreciated.
EDIT
My data (breast-cancer-wisconsin)(wdbc) dimension is
569 32
After normalizing and removing the id and target column the dimension is
dim(wdbc_n)
569 30
The train and test split is given by
wdbc_train<-wdbc_n[1:469,]
wdbc_test<-wdbc_n[470:569,]
Am I right or wrong with the codes?
Your code is wrong.
The call to the distance function taked about 3 seconds every time on my rather recent PC so I only did the first 30 rows for k=3 and noticed that every row of the neigh matrix was identical. Why is that? Take a look at this line:
ddist<- distance(xmat, method)
Each loop feeds the whole xmat matrix at the distance function, then uses only the first line from the resulting matrix. This calculates the distance between the training set rows, and does that n times, discarding every row except the first. Which is not what you want to do. The knn algorithm is supposed to calculate, for each row in the test set, the distance with each row in the training set.
Let's take a look at the documentation for the distance function:
distance(x, method = "euclidean", p = NULL, test.na = TRUE, unit =
"log", est.prob = NULL)
x a numeric data.frame or matrix (storing probability vectors) or a
numeric data.frame or matrix storing counts (if est.prob is
specified).
(...)
in case nrow(x) = 2 : a single distance value. in case nrow(x) > 2 :
a distance matrix storing distance values for all pairwise probability
vector comparisons.
In your specific case (knn classification), you want to use the 2 row version.
One last thing: you used order, which will return the position of the k largest distances in the ddist vector. I think what you want is the distances themselves, so you need to use sort instead of order.
Based on your code and the example in Lantz (2013) that your code seemed to be based on, here is a complete working solution. I took the liberty to add a few lines to make a standalone program.
Standalone working solution(s)
library(philentropy)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
knn <- function(train, test, k, method){
n.test <- nrow(test)
n.train <- nrow(train)
if (n.train + n.test <= k) stop("k can not be more than n-1")
neigh <- matrix(0, nrow = n.test, ncol = k)
ddist <- NULL
for(i in 1:n.test) {
for(j in 1:n.train) {
xmat <- rbind(test[i,], train[j,]) #we make a 2 row matrix combining the current test and train rows
ddist[j] <- distance(as.data.frame(xmat), method, k) #then we calculate the distance and append it to the ddist vector.
}
neigh[i, ] <- sort(ddist)[2:(k + 1)]
}
return(neigh)
}
wbcd <- read.csv("https://resources.oreilly.com/examples/9781784393908/raw/ac9fe41596dd42fc3877cfa8ed410dd346c43548/Machine%20Learning%20with%20R,%20Second%20Edition_Code/Chapter%2003/wisc_bc_data.csv")
rownames(wbcd) <- wbcd$id
wbcd$id <- NULL
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_train<-wbcd_n[1:469,]
wbcd_test<-wbcd_n[470:549,]
wbcd_nn <-knn(wbcd_train, wbcd_test ,3, method="euclidean")
Do note that this solution might be slow because of the numerous (100 times 469) calls to the distance function. However, since we are only feeding 2 rows at a time into the distance function, it makes the execution time manageable.
Now does that work?
The two first test rows using the custom knn function:
[,1] [,2] [,3]
[1,] 0.3887346 0.4051762 0.4397497
[2,] 0.2518766 0.2758161 0.2790369
Let us compare with the equivalent function in the FNN package:
library(FNN)
alt.class <- get.knnx(wbcd_train, wbcd_test, k=3, algorithm = "brute")
alt.class$nn.dist
[,1] [,2] [,3]
[1,] 0.3815984 0.3887346 0.4051762
[2,] 0.2392102 0.2518766 0.2758161
Conclusion: not too shabby.

How to combine data from different columns, e.g. mean of surrounding columns for a given column

I am trying to smooth a matrix by attributing the mean value of a window covering n columns around a given column. I've managed to do it but I'd like to see how would be 'the R way' of doing it as I am making use of for loops. Is there a way to get this using apply or some function of the same family?
Example:
# create a toy matrix
mat <- matrix(ncol=200);
for(i in 1:100){ mat <- rbind(mat,sample(1:200, 200) )}
# quick visualization
image(t(mat))
This is the matrix before smoothing:
I wrote the function smooth_mat that takes a matrix and the length of the smoothing kernel:
smooth_row_mat <- function(k, k.d=5){
k.range <- (k.d + 2):(ncol(k) - k.d - 1)
k.smooth <- matrix(nrow=nrow(k))
for( i in k.range){
if (i %% 10 == 0) cat('\r',round(i/length(k.range), 2))
k.smooth <- cbind( k.smooth, rowMeans(k[,c( (i-1-k.d):(i-1) ,i, (i+1):(i + 1 - k.d) )]) )
}
return(k.smooth)
}
Now we use smooth_row_mat() with mat
mat.smooth <- smooth_mat(mat)
And we have successfully smoothed, on a row basis, the content of the matrix.
This is the matrix after:
This method is good for such a small matrix although my real matrices are around 40,000 x 400, still works but I'd like to improve my R skills.
Thanks!
You can apply a filter (running mean) across each row of your matrix as follows:
apply(k, 1, filter, rep(1/k.d, k.d))
Here's how I'd do it, with the raster package.
First, create a matrix filled with random data and coerce it to a raster object.
library(raster)
r <- raster(matrix(sample(200, 200*200, replace=TRUE), nc=200))
plot(r)
Then use the focal function to calculate a neighbourhood mean for a neighbourhood of n cells either side of the focal cell. The values in the matrix of weights you provide to the focal function determine how much the value of each cell contributes to the focal summary. For a mean, we say we want each cell to contribute 1/n, so we fill a matrix of n columns, with values 1/n. Note that n must be an odd number, and the cell in the centre of the matrix is considered the focal cell.
n <- 3
smooth_r <- focal(r, matrix(1/n, nc=n))
plot(smooth_r)

Recalculating distance matrix

I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?
I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.
Here is (smaller than yours) sample data
num.points <- 400
num.vars <- 1000
original.points <- matrix(runif(num.points * num.vars),
nrow = num.points, ncol = num.vars)
and the distance matrix you already computed:
d0 <- rdist(original.points)
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist <- rdist(extra.points)
outer.dist <- rdist(extra.points, original.points)
so you can bind them to your bigger distance matrix:
d1 <- rbind(cbind(d0, t(outer.dist)),
cbind(outer.dist, inner.dist))
Let's check that it matches what a full, long rerun would have produced:
d2 <- rdist(rbind(original.points, extra.points))
identical(d1, d2)
# [1] TRUE

applying the pvclust R function to a precomputed dist object

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:
I imported the distance matrix
I used the as.dist function to transform it in a dist object
I run hclust on the dist object
Here's the R code:
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")
At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?
I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):
# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)
# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig,
type="h", lwd=5, las=1,
xlab="Number of dimensions",
ylab="Eigenvalues")
# Recover the coordinates that give the same distance matrix with the correct number of dimensions
x <- cmdscale(d,nb_dimensions)
# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.
# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )
# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )
# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)
If the dataset is large, you may have to check how pvclust is implemented.
It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by #Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Resources