R: creating a distance matrix quickly (like using mapply() or similar) - r

I am looking to create a distance matrix for any arbitrary non-standard distance function.
I can do this the slow way as follows:
set.seed(1000)
DF <- data.frame(x=rnorm(10),y=rnorm(10)) # ten random points on the x y plane
L <- dim(DF)[1] # length of DF
F <- function(P1,P2,y){sqrt((P2$x-P1$x)^2 + (P2$y-P1$y)^2 + 1)}
# Almost the euclidean distance but with an added 1 to make it nonstandard
M <- matrix(nrow=L,ncol=L)
# Find the distances between every point in DF and every other point in DF
for(i in 1:L){
for(j in 1:L){
M[i,j] <- F(DF[i,],DF[j,])
}
}
M
which gives:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
[2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
[3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
[4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
[5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
[6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
[7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
[8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
[9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Obviously, with 2 nested for loops in R, this will be very slow for datasets of any size.
I would like to speed this up by using a function such as mapply() or outer() but am unsure of how to do it.
I've had a good look for similar questions but I can't find one that give an adequate answer that doesn't involve rcpp.
Create a distance matrix in R using parallelization
Create custom distance matrix function in R
Speed Up Distance Calculations
Trying the advice given in this link below gives me:
pairwise comparison with all vectors of a list
outer(DF,DF,FUN=Vectorize(F))
Error: $ operator is invalid for atomic vectors
or
outer(DF,DF,FUN=F)
Error in dim(robj) <- c(dX, dY) :
dims [product 4] do not match the length of object [10]

Here is how to use the outer() function to replace a nested loop and calculate a custom distance function:
set.seed(1000)
L <- 10
DF <- data.frame(x=rnorm(L), y=rnorm(L))
FN <- function (P1,P2) {sqrt((P2$x - P1$x)^2 + (P2$y - P1$y)^2 + 1)}
outer(1:L, 1:L, FUN=function(x, y) FN(DF[x,], DF[y,]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.000000 1.326971 1.566994 1.708761 1.114078 1.527042 1.514868 1.836636 1.521510 1.813663
[2,] 1.326971 1.000000 1.735444 2.143117 1.336652 1.482555 1.427014 2.245816 2.153173 1.271712
[3,] 1.566994 1.735444 1.000000 1.190212 1.951701 1.088288 1.126241 1.212367 2.388228 1.734505
[4,] 1.708761 2.143117 1.190212 1.000000 2.123664 1.461169 1.523137 1.013764 2.267420 2.271950
[5,] 1.114078 1.336652 1.951701 2.123664 1.000000 1.851806 1.822077 2.263007 1.447333 1.934958
[6,] 1.527042 1.482555 1.088288 1.461169 1.851806 1.000000 1.004188 1.497537 2.459305 1.406153
[7,] 1.514868 1.427014 1.126241 1.523137 1.822077 1.004188 1.000000 1.564111 2.460997 1.344779
[8,] 1.836636 2.245816 1.212367 1.013764 2.263007 1.497537 1.564111 1.000000 2.415824 2.327128
[9,] 1.521510 2.153173 2.388228 2.267420 1.447333 2.459305 2.460997 2.415824 1.000000 2.818048
[10,] 1.813663 1.271712 1.734505 2.271950 1.934958 1.406153 1.344779 2.327128 2.818048 1.000000
Benchmark with DF <- data.frame(x=rnorm(100),y=rnorm(100)) 100x100
Unit: milliseconds
expr min lq mean median uq max neval
loop 647.080268 681.283754 720.842738 695.972994 728.078378 1057.16015 100
outer 7.892903 8.145765 8.661221 8.307392 8.710785 14.07253 100

You can use a nice simple method included in base R to calculate distances in dataframes of points (2D or 3D)
dist(DF, method = "euclidean", diag =TRUE, upper = TRUE)
If you only want the lower triangle leave out upper=TRUE, and if you do not want to see the Zero values for the diagonal on your triangle set diag=FALSE
This function can also to manhattan, minkowski and canabera distances as well. Super simple
Understanding what you now want there is a package for R called usedist, it offers some methods for defining matrices and functions for the application of distance measures.
It has a function `dist_make() which applies a function to each pair of rows in a matrix (not dataframe)
You will need to figure out how to retool your function to align a matrix of your data
Here is the documentation

Related

Bootstrapping elements of a matrix

I have a procedure that takes data and creates a square matrix M, where the elements of M correspond to certain features of the data. I wish to get a confidence interval for each element of the matrix to measure the uncertainty around each feature that I'm estimating. To do this, I want to bootstrap, so I initialize a list in R, resample my data, and store the resulting matrices in the list. How can I then estimate the 95% confidence interval for each element?
For example, say my original matrix M is 10 by 10, and the list of bootstrapped matrices is
mylist <- list()
for(i in 1:1000){
boot_matrices[[i]] <- matrix(rnorm(10*10, mean=0, sd=1), nrow=10, ncol=10)
}
I wish to calculate a confidence interval around each (i,j) element in M, where i=1...10 and j=1...10 by using the 1000 bootstrap matrices. In this toy example, I know the parametric distribution of each (i,j) element, and I'd expect the 95% confidence interval for each element to be around (-1.96, 1.96). However, in my real data, I don't know as much, and I don't want to assume as much (hence, using bootstrap). Is there a way to automatically calculate confidence intervals around each matrix element in this case?
How about something like this?
Create nBS bootstrap 10x10 matrices (I'm using set.seed(...) to ensure reproducibility of sample data). Resulting matrices are stored in a list.
# List of bootstrap matrices
nBS <- 1000;
set.seed(2017);
lst <- lapply(1:nBS, function(x)
matrix(rnorm(10 * 10, mean = 0, sd = 1), nrow = 10, ncol = 10));
Calculate the mean and standard deviation for every matrix element (i,j) across all bootstrap samples:
# Calculate mean and sd of every matrix element (i,j)
mat.mean <- apply(simplify2array(lst), c(1, 2), mean);
mat.sd <- apply(simplify2array(lst), c(1, 2), sd);
Calculate 95% confidence interval as mean +- 1.96 * sem:
# Calculate lower and upper 95% confidence interval
mat.lowerCI <- mat.mean - 1.96 * mat.sd / sqrt(nBS);
mat.upperCI <- mat.mean + 1.96 * mat.sd / sqrt(nBS);
Show mat.mean:
mat.mean;
[,1] [,2] [,3] [,4] [,5]
[1,] -0.011862801 -0.017872385 -2.059780e-02 -0.056602452 -0.077408704
[2,] 0.083863805 -0.057467756 -7.920189e-03 0.001923072 -0.010616517
[3,] -0.021193913 -0.021594100 -3.069827e-03 0.082500345 -0.015010818
[4,] -0.001063529 -0.028606045 6.366336e-02 0.021871973 0.014491280
[5,] -0.042912905 -0.020031203 7.075698e-03 0.032309070 0.051875125
[6,] -0.028336190 -0.055650895 -1.119998e-02 -0.030252861 -0.008670326
[7,] 0.006555878 -0.008686383 -1.928690e-02 -0.027290181 -0.002037219
[8,] 0.001513634 -0.057669094 -6.025466e-03 0.028409560 0.052159330
[9,] 0.044741065 -0.026265301 3.915427e-02 -0.011599341 0.006817949
[10,] 0.035356686 -0.039949595 -5.468612e-05 0.007272050 0.013150241
[,6] [,7] [,8] [,9] [,10]
[1,] 0.054420568 0.0050127337 -0.046358349 -0.029833662 -0.0525282034
[2,] -0.033703118 -0.0623761140 -0.029511715 -0.048816905 -0.0189984349
[3,] -0.013218223 -0.0278959480 -0.036351073 0.028833428 -0.0001538902
[4,] 0.029236408 -0.0046022995 0.019077031 0.069887669 -0.0283910941
[5,] -0.035474785 0.0372263523 0.021329823 0.006252149 0.0395028012
[6,] 0.008978299 0.0266740599 -0.006252266 -0.005793750 0.0072594645
[7,] 0.092958577 0.0047135528 0.019320387 0.011766436 -0.0021045223
[8,] 0.014867452 -0.0001325218 0.014760887 -0.027671024 0.0610503856
[9,] -0.031151561 0.0373095832 0.016197685 -0.050206244 -0.0561044648
[10,] 0.059817479 -0.0669659941 0.020218135 -0.039548025 0.0115156843
Lower and upper 95% confidence intervals are given in mat.lowerCI and mat.upperCI.

A for loop with rowSums function

I am a beginner in R and I have written a double-for loop for calculating chi2 values for selecting features among 6610 terms and 10 classes.
Here is my for loops:
library(raster)
#for x^2 [n,r] = term n, class r. n starts from col #7 and r starts from col #6617
chi2vals <- matrix(0:0,6610,10)
chi2avgs <- vector("numeric",6610L)
for(r in 1:10){
for(n in 1:6610){
A = sum(data1.sub.added[,6+n]==1 & data1.sub.added[,6616+r]==1)
M = sum(data1.sub.added[,6+n]==1)
P = sum(data1.sub.added[,6616+r]==1)
N = nrow(data1.sub.added)
E = ((A*N)-(M*P))**2
F = (N-P)*(N-M)
chi2vals[n,r] = (N/(P*M))*(E/F) # for term n
}
Prcj = sum(data1.sub.added[,6616+r]==1)/sum(data1.sub.added[,6616:6626]==1) #probability of class c_r
pchi <- Prcj * chi2vals
chi2avgs[n] = rowSums(pchi)[n]
}
The code correctly calculates everything up to the line pchi <- Prcj * chi2vals. The result is a nice matrix of p*chi2 values:
> head(pchi)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 128.36551442 0.239308113 0.683517530 1.5038665 0.6145058 3.656857e-01 1.3311564 2.6977448 0.410702803
[2,] 0.06632758 0.067970859 0.019178551 0.2900692 1.5300639 4.430705e-08 0.2599859 0.6362953 0.098745147
[3,] 1.85641330 1.411925435 3.590747764 7.3018416 38.8044465 4.102248e-01 6.4118078 13.0164994 1.709506238
[4,] 0.11063892 0.005039029 0.244964758 0.1622654 0.1156411 8.274468e+00 0.2564959 0.0577651 0.242946022
[5,] 0.04788648 0.049072885 0.001420669 0.2094211 1.7200152 2.045923e-01 0.1877019 0.1468187 0.005493183
[6,] 5.39946188 6.899336618 60.735646913 7.4351538 10.7005784 9.946261e+00 35.8868899 178.7112406 11.382740754
[,10]
[1,] 0.26436516
[2,] 0.14414444
[3,] 0.90292073
[4,] 0.01168997
[5,] 0.06641298
[6,] 19.68599142
But the final chi2avgs values mostly turn out to be zeros:
> head(chi2avgs)
[1] 0.000000 0.000000 0.000000 0.000000 2.638835 0.000000
However, when aside from the loop I replace n with any number, the last line works well:
chi2avgs[1] = rowSums(pchi)[1]
chi2avgs[2] = rowSums(pchi)[2]
chi2avgs[3] = rowSums(pchi)[3]
chi2avgs[4] = rowSums(pchi)[4]
chi2avgs[5] = rowSums(pchi)[5]
> head(chi2avgs)
[1] 136.476367 3.112781 75.416334 9.481914 2.638835 0.000000
I wonder what causes this problem. Do you have an idea how I can fix it?
You can try directly rowsums without [n]
chi2avgs = rowSums(pchi)

Get maximum distance between points in a vector (R)

I have two vectors of latitudes and longitudes. I would like to find the maximum distance between the points. The way I see it, I should get a matrix of distances between all points and get the max of those.
So far I’ve done (using geosphere package for the last command):
> lat = dt[assetId == u_assetIds[1000], latitude]
> lon = dt[assetId == u_assetIds[1000], longitude]
>
> head(cbind(lat, lon))
lat lon
[1,] 0.7266145 -1.512977
[2,] 0.7270650 -1.504216
[3,] 0.7267265 -1.499622
[4,] 0.7233676 -1.487970
[5,] 0.7232196 -1.443160
[6,] 0.7225059 -1.434848
>
> distm(c(lat_1K[1], lon_1K[1]), c(lat_1K[4], lon_1K[4]), fun = distHaversine)
[,1]
[1,] 2807.119
How do I convert the last command into giving me a matrix of all pairwise distances? I am not familiar of how to do that in R, having more experience in Python.
Thanks.
Just briefly read the help document of distm, here is what I found:
distm(x, y, fun=distHaversine)
x: longitude/latitude of point(s). Can be a vector of two numbers, a matrix of 2 columns (first one is longitude, second is latitude) or a SpatialPoints* object
y: Same as x. If missing, y is the same as x
So what you should do is to simply input your cbind(lat, lon) as the first argument x. Here is some test:
> lat <- c(0.7266145, 0.7270650, 0.7267265, 0.7233676, 0.7232196, 0.7225059)
> lon <- c(-1.512977, -1.504216, -1.499622, -1.487970, -1.443160, -1.434848)
> distm(cbind(lon,lat))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.0000 976.4802 1486.6045 2806.912 7780.5544 8708.6036
[2,] 976.4802 0.0000 512.7471 1854.601 6809.6464 7738.0538
[3,] 1486.6045 512.7471 0.0000 1349.813 6296.9308 7225.3240
[4,] 2806.9123 1854.6008 1349.8129 0.000 4987.8561 5913.8213
[5,] 7780.5544 6809.6464 6296.9308 4987.856 0.0000 928.6189
[6,] 8708.6036 7738.0538 7225.3240 5913.821 928.6189 0.0000

R: How to avoid 2 'for' loops in R in this function

I know there are many topics on how to avoid R loops, but I was not able to understand how I could vectorize my iterations.
I have a data set which here I represent by m. I want to generate a new matrix with this function, that will be composed by the p.values of the correlation coefficients of each column of the data (m).
m<-matrix(rnorm(100),nrow=10,ncol=10)
sig.p<-function(x){
n= ncol(x)
p.values<-numeric(n)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:C){
for (t in 1:C){
p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
}
}
p.values
}
sig.p(m)
I was not able to understand how I could use mapply (if that's the case).
Could anyone help with suggestions of how I could vectorize these iterations (with mapply or other)
Thanks in advance!
Cesar
You could use rcorr from library(Hmisc)
library(Hmisc)
rcorr(m)$P
Or use
library(psych)
corr.test(as.data.frame(m))$p
Or using outer from base R
outer(1:ncol(m),1:ncol(m), FUN= Vectorize(function(x,y)
cor.test(m[,x], m[,y])$p.value))
Benchmarks
I tried on a smaller dataset (100*100) and a slightly bigger dataset (1e3*1e3). Here are the functions:
akrun <- function() {outer(1:ncol(m1),1:ncol(m1),
FUN= Vectorize(function(x,y) cor.test(m1[,x],
m1[,y])$p.value))}
akrun2 <- function(){rcorr(m1)$P}
agstudy <- function() {M <- expand.grid(seq_len(ncol(m1)),
seq_len(ncol(m1)))
mapply(function(x,y)cor.test(m1[,x], m1[,y])$p.value,M$Var1,M$Var2)}
vpipk <-function(){
n <- ncol(m1)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:(n-1)){
for (t in (i+1):n){
p.values[t,i]<-cor.test(m1[,i],m1[,t])$p.value
}
}
p.values
}
nrussell <- function(){
sapply(1:ncol(m1), function(z){
sapply(1:ncol(m1), function(x,Y=z){
cor.test(m1[,Y],m1[,x])$p.value
})
})
}
On a 100*100 dataset
library(microbenchmark)
set.seed(25)
m1 <- matrix(rnorm(1e2*1e2),nrow=1e2,ncol=1e2)
microbenchmark(akrun(), akrun2(), agstudy(), vpipk(),
nrussell(), unit='relative', times=10L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun() 257.2310 255.9766 252.2163 254.4946 248.9807 246.5429 10 c
# akrun2() 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 10 a
# agstudy() 255.5920 258.0813 253.5411 256.0581 250.4833 249.0503 10 c
# vpipk() 125.8218 126.3337 125.4592 126.8479 124.9835 124.1383 10 b
#nrussell() 257.9283 256.8480 252.5297 256.0160 250.8853 242.0896 10 c
If I change 1e2 to 1e3 (didn't get time to do microbenchmark, but here are the system.time
system.time(akrun())
# user system elapsed
#403.563 0.751 404.198
system.time(akrun2())
# user system elapsed
# 3.110 0.008 3.117
system.time(agstudy())
# user system elapsed
#445.108 0.877 445.947
system.time(vpipk())
# user system elapsed
#155.597 0.224 155.760
system.time(nrussell())
# user system elapsed
#452.524 1.220 453.713
Not nearly as succinct as #akrun's answer, but here's a base R solution:
sig.p <- function(M){
sapply(1:ncol(M), function(z){
sapply(1:ncol(M), function(x,Y=z){
cor.test(M[,Y],M[,x])$p.value
})
})
}
##
R> sig.p(m)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.00000000 0.08034470 0.244411381 0.03293644 0.3234899 0.80352003 0.5326317 0.03896285 0.702987267 0.57721440
[2,] 0.08034470 0.00000000 0.087168145 0.44828479 0.4824117 0.76469973 0.8222813 0.17662866 0.607145382 0.41460977
[3,] 0.24441138 0.08716815 0.000000000 0.20634394 0.9504582 0.11864029 0.2148186 0.28450468 0.009396629 0.51450066
[4,] 0.03293644 0.44828479 0.206343943 0.00000000 0.8378530 0.78122849 0.0544312 0.22943728 0.524608029 0.66329385
[5,] 0.32348990 0.48241166 0.950458153 0.83785303 0.0000000 0.66105999 0.3157296 0.35715193 0.927945195 0.63163949
[6,] 0.80352003 0.76469973 0.118640294 0.78122849 0.6610600 0.00000000 0.7181462 0.67602651 0.749641726 0.03218081
[7,] 0.53263166 0.82228134 0.214818607 0.05443120 0.3157296 0.71814620 0.0000000 0.39393423 0.266039043 0.38619000
[8,] 0.03896285 0.17662866 0.284504679 0.22943728 0.3571519 0.67602651 0.3939342 0.00000000 0.512083873 0.30980598
[9,] 0.70298727 0.60714538 0.009396629 0.52460803 0.9279452 0.74964173 0.2660390 0.51208387 0.000000000 0.92533524
[10,] 0.57721440 0.41460977 0.514500658 0.66329385 0.6316395 0.03218081 0.3861900 0.30980598 0.925335242 0.00000000
This is a typical use of mapply:
M <- expand.grid(seq_len(ncol(m),seq_len(ncol(m)))
mapply(function(x,y)cor.test(m[,x], m[,y])$p.value,M$Var1,M$Var2)
Vectorizing is not always all it is cracked up to be. Not sure how large your actual matrix is, but for this size or even 100 x 100 it is a reasonably small one-time cost.
You can more than double performance by modifying your loop structure as follows:
sig.p<-function(x){
n <- ncol(x)
p.values<-matrix(nrow=n,ncol=n)
for (i in 1:(n-1)){
for (t in (i+1):n){
p.values[t,i]<-cor.test(x[,i],x[,t])$p.value
}
}
p.values
}
Basically only compute the lower triangle, since you know the diagonal will be zero's and the matrix is symmetric. mapply or sapply applied over the whole matrix may not perform better than this.

Replicate each time with different standard deviation

I have a vector of standard deviations:
sd_vec<-runif(10,0,20) with 10 values between 0 and 20.
[1] 11.658106 9.693493 12.695608 4.091922 5.761061 18.410951 14.710990 12.095944 18.023123
[10] 13.294963
I would like to replicate the following process:
a<-rnorm(10,0,30)
[1] -21.265083 85.557147 23.958170 -32.843328 6.629831 -23.745339 46.094324 51.020059
[9] 1.041724 13.757235
n_columns=50
replicate(n_columns, a+rnorm(length(a), mean=0,sd=sd_vec))
The result should be 10 columns each of which are:
column 1: a + rnorm(length(a),0,11.658106)
column 2: a + rnorm(length(a),0,9.693493)
column 3: a + rnorm(length(a),0,12.695608)
.
.
.
column 10:a + rnorm(length(a),0,13.294963)
Will this use different values of sd_vec for each replication or will it use it for each random number generation?
According to your edit, then you may want to try
a+sapply(sd_vec, rnorm, n=100, mean=0)
# example
> set.seed(1)
> sd_vec <-runif(10,0,20)
> set.seed(1)
> a<-rnorm(100,0,30)
> n_columns=10
> head(a+sapply(sd_vec, rnorm, n=100, mean=0))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] -22.087869 -15.746650 -8.554735 0.7226986 -18.481801 -24.921835 -32.16206 -33.158153 -38.187974
[2,] 5.732942 18.078702 -6.489666 39.9422684 4.311839 32.504554 42.75921 -18.624133 7.954302
[3,] -29.906010 -13.260709 -2.483113 -36.0217953 -29.841630 -15.576334 -26.76925 -11.915258 -21.741820
[4,] 48.697584 45.395650 43.463125 40.7586401 47.903975 57.600406 47.59359 47.701659 33.782184
[5,] 6.409275 -7.122582 28.836887 2.3249113 13.884993 7.429514 -11.34081 1.960571 18.075706
[6,] -15.229450 -6.025260 -7.288529 -31.4375515 -18.184563 -45.038651 -50.00938 -26.965804 -37.610292
[,10]
[1,] -17.391109
[2,] 6.883342
[3,] -26.144900
[4,] 48.118830
[5,] 9.970987
[6,] -26.668629
Your current solution will replicate sd_vec for each replication, not using each sd for each replication.
If you want to have columns for each sd then you may work on matrices. Create matrix of rnorm with desire sd by:
X <- rnorm(length(a)*n_columns, mean=0, sd=sd_vec)
X <- matrix(X, nrow=length(a), ncol=n_columns, byrow=TRUE)
Then add it to a converted to matrix:
matrix(a, nrow=length(a), ncol=n_columns) + X

Resources