Computing distance with relation to other rows (using R) - r

I have a set of columns with numerical values that describe a given object (row) in some 5 dimensional space. I want to compute the distance for each object from a fixed object at various times. I can group_by time and perform the desired computation. The issue is that I'm not sure how to do the computation. I want to use the Euclidean distance (squared) to measure the distance between these objects in 5 dimensional space. So clearly at each time, the reference object should be 0 distance from itself.
The metric should look like distance from object x to object Reference is
(x1 - Reference1)^2 + (x2 - Reference2)^2 + ....
I'm VERY new to working in R (and programming in general), so I was hoping this exercise would help me learn; I apologize if my question is not appropriate, I'm new.
My data looks like

Distances from rows to other rows can be done in base R with this:
mtx <- structure(c(2.8, 6.4, 1.7, 3.2, 24.2, 25.5, 5.4, 16.2, 15.6, 25.1, 8.6, 15.4, 0.7, 0.8, 0.1, 0.5, 0.1, 0.4, 0.04, 0.2), .Dim = 4:5)
outer(seq_len(nrow(mtx)), seq_len(nrow(mtx)),
function(a, b) rowSums((mtx[a,] - mtx[b,])^2))
# [,1] [,2] [,3] [,4]
# [1,] 0.0000 105.0000 404.0136 64.2500
# [2,] 105.0000 0.0000 698.9696 190.9500
# [3,] 404.0136 698.9696 0.0000 165.3156
# [4,] 64.2500 190.9500 165.3156 0.0000
Granted, you only need to calculate (less-than) half of that matrix, since the diagonal is always zero and the upper/lower triangles of it are mirrors, but this gives you what you need. For instance, the distances from the third row to all other rows are in the third row (and third column).
If all you need is one row compared to all others, then
rowSums((mtx[rep(3,nrow(mtx)),] - mtx)^2)
# [1] 404.0136 698.9696 0.0000 165.3156
The mtx[rep(3,nrow(mtx)),] creates a same-size matrix so that subtraction works seamlessly.

Related

Clustering values with given threshold

I have several vectors:
a <- c(1.1, 2.9, 3.9, 5.2)
b <- c(1.0, 1.9, 4.0, 5.1)
c <- c(0.9, 2.1, 3.1, 4.1, 5.0, 11.13)
They can have different length.
I want to combine them in such a way to obtain a general vector with averaged values if there are similar meanings in all of the vectors or in any pairs of them; and with an initial meanings if this meaning is only in one vector. For averaging I would like to use a threshold = 0.2.
My explanation could be a bit confusing, but here is the general vector I want to obtain:
d <- c(1, 2, 3, 4, 5.1, 11.13)
I have around 12 vectors and about 2000 values in each vector.
I will be glad for any help
Seems like a clustering problem (clustered by distance). You can try the code below
library(igraph)
v <- sort(c(a, b, c))
tapply(
v,
membership(components(graph_from_adjacency_matrix(as.matrix(dist(v)) <= 0.2 + sqrt(.Machine$double.eps)))),
mean
)
which gives
1 2 3 4 5 6
1.00 2.00 3.00 4.00 5.10 11.13

Why does distHaversine return NaN for some pairs of coordinates?

I am trying to calculate the distance between all possible pairs of coordinates in a list of coordinates. However, I am surprisingly getting NaN for some pairs of coordinates that you should definitely be able to calculate the distance between.
My coordinates are in decimal longitude and latitude. I am using the distHaversine function from the geodist package in R.
Here is a sample of a pair of coordinates for which distHaversine returns NaN. I tried this code with many other pairs of coordinates and it works correctly.
# Create long, lat matrix with two sets of coordinates
coord_list <- matrix(c(2.5, -177.5, 5.5, -5.5), ncol=2)
coord_list
# Create a matrix with the distance between each pair of points (zero on the diagonals)
dist <- apply(coord_list, 1,
FUN=function(X) {
distHaversine(X, coord_list)
})
dist
# [,1] [,2]
#[1,] 0 NaN
#[2,] NaN 0
In case it's relevant, I need these distances for an inverse distance weighting matrix for spatially weighted regressions. However, I'd rather figure out why distHaversine is occasionally returning NaN than calculate the matrix differently (which I know how to do).
Thanks for your help!
Thanks for reporting. This was fixed in geosphere 1.5-7.
I would use the very precise geosphere::distGeo function in stead of geopshere::distHaversine (which is more for historical interest)
To get the distance for all points to all points, you can use the distm function
coords <- matrix(c(2.5, -177.5, 5.5, -5.5, 0, 0), ncol=2)
library(geosphere)
distm(coords)
# [,1] [,2] [,3]
#[1,] 0.0 19395754 693590.1
#[2,] 19395754.2 0 19703549.9
#[3,] 693590.1 19703550 0.0
Or the pointDistance function in the raster package (and the same algorithm as distGeo):
library(raster)
pointDistance(coords, lonlat=TRUE)
# [,1] [,2] [,3]
#[1,] 0.0 NA NA
#[2,] 19395754.2 0 NA
#[3,] 693590.1 19703550 0

Poorly implemented two-sample Kolmogorov-Smirnov test (kstest2) in Matlab?

Am I missing something obvious or Matlab's kstest2 is giving very poor p-values?
Under very poor I mean that I have the suspicion that it is even wrongly implemented.
Help page of kstest2 states that the function calculates asymptotic p-value, though I did not find any reference about which method is used exactly. Anyway, the description further states:
asymptotic p-value becomes very accurate for large sample sizes, and is believed to be reasonably accurate for sample sizes n1 and n2, such that (n1*n2)/(n1 + n2) ≥ 4
Example 1
Let's take Example 6 from Lehman and D'Abrera (1975):
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
[h,p,ks2stat] = kstest2(sampleA, sampleB, 'Tail', 'unequal');
(n1*n2)/(n1 + n2) = 4 in this case so the p-value should be reasonably accurate.
Matlab yields p = 0.0497, while the solution given in the book is 0.0870.
To validate the solution I used R, which I trust more than Matlab, especially in statistics.
Using ks.test from stats package and ks.boot from Matching package:
ks.test(sampleA, sampleB, alternative = "two.sided")
ks.boot(sampleA, sampleB, alternative = "two.sided")
Both give p = 0.0870.
Example 2
Lets use kstest2's own example to compare Matlab and R results for larger sample size:
rng(1); % For reproducibility
x1 = wblrnd(1,1,1,50);
x2 = wblrnd(1.2,2,1,50);
[h,p,ks2stat] = kstest2(x1,x2);
This yields p = 0.0317. Now, using the same x1 and x2 vectors R gives p = 0.03968.
About 20% difference when very accurate result is expected (n1*n2)/(n1 + n2) = 25.
Am I missing, messing up something?
Is it possible that Matlab's kstest2 performs so poorly as the examples indicate? What approximation, algorithm kstest2 is using? (I can see the implemented code for kstest2, however a reference to book or paper would be much better to understand what is going on.)
I am using Matlab 2016a.
Lehman and D'Abrera (1975). Nonparametrics: Statistical Methods Based on Ranks. 1st edition. Springer.
I think that the correct test to compare with R's ks.test in MATLAB or Octave would be kolmogorov_smirnov_test_2:
sampleA = [6.8, 3.1, 5.8, 4.5, 3.3, 4.7, 4.2, 4.9];
sampleB = [4.4, 2.5, 2.8, 2.1, 6.6, 0.0, 4.8, 2.3];
kolmogorov_smirnov_test_2(sampleA, sampleB)
pval: 0.0878664
The difference appears to be the use of ks versus lambda, i.e.
ks = sqrt (n) * d;
pval = 1 - kolmogorov_smirnov_cdf (ks);
versus
lambda = max((sqrt(n) + 0.12 + 0.11/sqrt(n)) * d , 0);
pval = 1 - kolmogorov_smirnov_cdf (lambda);
I presume the different test statistics arise from the differences in the research papers cited by these 2 functions. If you want a deeper dive into the statistical theory you may want to reach out to CrossValidated.

Matrix of pairwise distances

I have a set of points coordinates and I want to use it to generate a matrix of distances. More specifically, I have two sets of points, A of size n and B of size m, given as 2d coordinates and I want to have all Euclidean distances between points from A and points from B and no other distances, in a matrix.
Edit: what if the situation is more complicated: what if I have my matrix but now I want to divide each row of it by the sum of Euclidean distances of the first point from A from all the points in set B: that is, normalise each row of distances. Is there an efficient way to do that?
set.seed(101)
n <- 10; m <- 20
A <- data.frame(x=runif(n),y=runif(n))
B <- data.frame(x=runif(m),y=runif(m))
We want
sqrt((x_{1,i}-x_{2,j})^2+(y_{1,i}-y_{2,j})^2)
for every i=1:n and j=1:m.
You can do this via
dists <- sqrt(outer(A$x,B$x,"-")^2 + outer(A$y,B$y,"-")^2)
which in this case is a 10x20 matrix. In words, we're finding the difference ("-" is a reference to the subtraction operator) between each pair of x values and each pair of y values, squaring, adding, and taking the square root.
If you want to normalize every row by its sum, I would suggest
norm.dists <- sweep(dists,MARGIN=1,STATS=rowSums(dists),FUN="/")
The dist(...) function in base R will not be helpful, because it calculates the auto-distances (distance from every point to every other point in a given dataset). You want cross-distances. There is a dist(...) function in package proxy which is designed for this.
Using the dataset kindly provided by #BenBolker,
library(proxy) # note that this masks the dist(...) fn in base R...
result <- dist(A,B)
result[1:5,1:5]
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.5529902 0.7303561 0.1985409 0.6184414 0.7344280
# [2,] 0.7109408 0.9506428 0.1778637 0.7216595 0.9333687
# [3,] 0.2971463 0.3809688 0.4971621 0.4019629 0.3995298
# [4,] 0.4985324 0.5737397 0.4760870 0.5986826 0.5993541
# [5,] 0.4513063 0.7071025 0.3077415 0.4289675 0.6761988

In data set, I want to change each element to p value

c<- c(1.88, 2.33, -2.4, -0.6)
dim(c)<-c(2,2)
I have a data set, 9X12 matrix.
The data set is standardized to be normal, so I can compare each element.
For better comparing, I want to change each value to p-value.
How can I make it? (Please use above matrix.)
Please let me know.
Don't use c for a variable name (you know better):
A <- c(1.88, 2.33, -2.4, -0.6)
dim(A) <- c(2,2)
You are looking for pnorm:
pnorm(A)
# [,1] [,2]
# [1,] 0.9699460 0.008197536
# [2,] 0.9900969 0.274253118

Resources