I want to compute the distance among all points in a very large matrix using distm from geosphere.
See a minimal example:
library(geosphere)
library(data.table)
coords <- data.table(coordX=c(1,2,5,9), coordY=c(2,2,0,1))
distances <- distm(coords, coords, fun = distGeo)
The issue is that due to the nature of the distances I am computing, distm gives me back a symmetric matrix, therefore, I could avoid to calculate more than half of the distances:
structure(c(0, 111252.129800202, 497091.059564718, 897081.91986428,
111252.129800202, 0, 400487.621661164, 786770.053508848, 497091.059564718,
400487.621661164, 0, 458780.072878927, 897081.91986428, 786770.053508848,
458780.072878927, 0), .Dim = c(4L, 4L))
May you help me to find a more efficient way to compute all those distances avoiding doing twice each one?
If you want to compute all pairwise distances for points x, it is better to use distm(x) rather than distm(x,x). The distm function returns the same symmetric matrix in both cases but when you pass it a single argument it knows that the matrix is symmetric, so it won't do unnecessary computations.
You can time it.
library("geosphere")
n <- 500
xy <- matrix(runif(n*2, -90, 90), n, 2)
system.time( replicate(100, distm(xy, xy) ) )
# user system elapsed
# 61.44 0.23 62.79
system.time( replicate(100, distm(xy) ) )
# user system elapsed
# 36.27 0.39 38.05
You can also look at the R code for geosphere::distm to check that it treats the two cases differently.
Aside: Quick google search finds parallelDist: Parallel Distance Matrix Computation on CRAN. The geodesic distance is an option.
You can prepare a data frame of possible combinations without repetitions (with gtools packages). Then to compute distances for those pairs. Here is the code:
library(gtools)
library(geosphere)
library(data.table)
coords <- data.table(coordX = c(1, 2, 5, 9), coordY = c(2, 2, 0, 1))
pairs <- combinations(n = nrow(coords), r = 2, repeats.allowed = F, v = c(1:nrow(coords)))
distances <- apply(pairs, 1, function(x) {
distm(coords[x[1], ], coords[x[2], ], fun = distGeo)
})
# Construct distances matrix
dist_mat <- matrix(NA, nrow = nrow(coords), ncol = nrow(coords))
dist_mat[upper.tri(dist_mat)] <- distances
dist_mat[lower.tri(dist_mat)] <- distances
dist_mat[is.na(dist_mat)] <- 0
print(dist_mat)
The results:
[,1] [,2] [,3] [,4]
[1,] 0.0 111252.1 497091.1 400487.6
[2,] 111252.1 0.0 897081.9 786770.1
[3,] 497091.1 400487.6 0.0 458780.1
[4,] 897081.9 786770.1 458780.1 0.0
Using combn() from base R might be slightly simpler and probably faster than loading additional packages. Then, distm() uses distGeo() as a source, so using the latter should be even faster.
coords <- as.data.frame(coords) # this won't work with data.tables though
cbind(t(combn(1:4, 2)), unique(geosphere::distGeo(coords[combn(1:4, 2), ])))
# [,1] [,2] [,3]
# [1,] 1 2 111252.1
# [2,] 1 3 497091.1
# [3,] 1 4 897081.9
# [4,] 2 3 786770.1
# [5,] 2 4 400487.6
# [6,] 3 4 458780.1
We could check it out with a benchmark.
Unit: microseconds
expr min lq mean median uq max neval cld
distm 555.690 575.846 597.7672 582.352 596.1295 904.718 100 b
distGeo 426.335 434.372 450.0196 441.516 451.8490 609.524 100 a
Looks good.
Related
I can make one pseudo-random matrix with the following :
nc=14
nr=14
set.seed(111)
M=matrix(sample(
c(runif(58,min=-1,max=0),runif(71, min=0,max=0),
runif(nr*nc-129,min=0,max=+1))), nrow=nr, nc=nc)
The more important question: I need 1000 matrices with the same amount of negative, positive and zero values, just the location in the matrices need to be various.
I can make matrices one by one, but I want to do this task faster.
The less important question: If I have the 1000 matrices, I need to identify for every point of the matrices, that how many positive negative or zero value got there, for example:
MATRIX_A
[,1]
[9,] -0,2
MATRIX_B
[,1]
[9,] -0,5
MATRIX_C
[,1]
[9,] 0,1
MATRIX_D
[,1]
[9,] 0,0
MATRIX_E
[,1]
[9,] 0,9
What I need:
FINAL_MATRIX_positive
[,1]
[9,] (2/5*100)=40% or 0,4 or 2
,because from 5 matrix in this point were 2 positive value, and also need this for negative and zero values too.
If it isn't possible to do this in R, I can compare them "manually" in Excel.
Thank you for your help!
Actually you are almost there!
You can try the code below, where replicate can make 1000 times for generating the random matrix, and Reduce gets the statistics of each position:
nc <- 14
nr <- 14
N <- 1000
lst <- replicate(
N,
matrix(sample(
c(
runif(58, min = -1, max = 0),
runif(71, min = 0, max = 0),
runif(nr * nc - 129, min = 0, max = +1)
)
), nrow = nr, nc = nc),
simplify = FALSE
)
pos <- Reduce(`+`,lapply(lst,function(M) M > 0))/N
neg <- Reduce(`+`,lapply(lst,function(M) M < 0))/N
zero <- Reduce(`+`,lapply(lst,function(M) M == 0))/N
I use a function for your simulation scheme:
my_sim <- function(n_neg = 58, n_0 = 71, n_pos = 67){
res <- c(runif(n_neg, min=-1, max=0),
rep(0, n_0),
runif(n_pos, min=0, max=+1))
return(sample(res))
}
Then, I simulate your matrices (I store them in a list):
N <- 1000
nr <- 14
nc <- nr
set.seed(111)
my_matrices <- list()
for(i in 1:N){
my_matrices[[i]] <- matrix(my_sim(), nrow = nr, ncol = nc)
}
Finally, I compute the proportion of positive numbers for the position row 1 and column 9:
sum(sapply(my_matrices, function(x) x[1,9]) > 0)/N
# [1] 0.366
However, if you are interested in all the positions, these lines will do the job:
aux <- lapply(my_matrices, function(x) x > 0)
FINAL_MATRIX_positive <- 0
for(i in 1:N){
FINAL_MATRIX_positive <- FINAL_MATRIX_positive + aux[[i]]
}
FINAL_MATRIX_positive <- FINAL_MATRIX_positive/N
# row 1, column 9
FINAL_MATRIX_positive[1, 9]
# [1] 0.366
I have a matrix filled with somewhat random elements. I need every row sorted in decreasing order, then a function is called on the matrix, and finally the resulting matrix needs to be unsorted to the original order.
This is quickly accomplished vector-wise as shown here, but what's the fastest way to do this to every row in a matrix? Right now I'm doing:
# Example matrix
m <- matrix(runif(100), nrow = 25, ncol = 4)
# Get the initial order by row
om <- t(apply(m, 1, order, decreasing = T))
sm <- m
for (i in seq_len(nrow(m))) {
sm[i, ] <- sm[i, om[i, ]]
}
# ** Operations performed on sm **
# Then unsort
for (i in seq_len(nrow(m))) {
sm[i, ] <- sm[i, order(om[i, ])]
}
# sm is now sorted by-row in the same order as m
Is there some way given om in the above to sort and unsort while avoiding the for loop or an apply function (both of which make this operation very slow for big m). Thanks!
Edit: There are pointers here: Fastest way to sort each row of a large matrix in R
The operation is done inside a function that is already called using parallel, so this operation must be done using serial code.
Row-wise sorting seems to be straightforward. To get the original order back (un-sort) we need the row-wise ranks rather than their order. Thereafter, what works for column sorting in #Josh O'Brien's answer we can adapt for rows.
Base R solution:
rr <- t(apply(m, 1, rank)) ## get initial RANKS by row
sm <- t(apply(m, 1, sort)) ## sort m
## DOING STUFF HERE ##
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))] ## un-sort
all(m == sm) ## check
# [1] TRUE
Seems to work.
In your linked answer, the rowSort function of the Rfast package stands out well in terms of performance, which may cover the sorting issue. Moreover there's also a rowRanks function that will cover our ranking issue. So we can avoid apply.
Let's try it out.
m[1:3, ]
# [,1] [,2] [,3] [,4]
# [1,] 0.9148060 0.5142118 0.3334272 0.719355838
# [2,] 0.9370754 0.3902035 0.3467482 0.007884739
# [3,] 0.2861395 0.9057381 0.3984854 0.375489965
library(Rfast)
rr <- rowRanks(m) ## get initial RANKS by row
sm <- rowSort(m) ## sort m
sm[1:3, ] # check
# [,1] [,2] [,3] [,4]
# [1,] 0.36106962 0.4112159 0.6262453 0.6311956
# [2,] 0.01405302 0.2171577 0.5459867 0.6836634
# [3,] 0.07196981 0.2165673 0.5739766 0.6737271
## DOING STUFF HERE ##
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))] ## un-sort
all(sm == m) ## check
# [1] TRUE
Dito.
Benchmark
m.test <- matrix(runif(4e6), ncol = 4)
dim(m.test)
# [1] 1000000 4
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Rfast 897.6286 910.91 956.6259 924.1914 986.1246 1048.058 3 a
# baseR 87931.2824 88004.73 95659.8671 88078.1737 99524.1594 110970.145 3 c
# forloop 58927.7784 59434.54 60317.3903 59941.2930 61012.1963 62083.100 3 b
Not so bad!!
Data/Code:
set.seed(42)
m <- matrix(runif(100), nrow = 25, ncol = 4)
## benchmark
m.test <- matrix(runif(4e6), ncol = 4)
microbenchmark::microbenchmark(
Rfast={
rr <- rowRanks(m.test)
sm <- rowSort(m.test)
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))]},
baseR={
rr <- t(apply(m.test, 1, rank))
sm <- t(apply(m.test, 1, sort))
sm[] <- sm[cbind(as.vector(row(rr)), as.vector(rr))]
},
forloop={
om <- t(apply(m.test, 1, order, decreasing = T))
sm <- m.test
for (i in seq_len(nrow(m.test))) {
sm[i, ] <- sm[i, om[i, ]]
}
for (i in seq_len(nrow(m.test))) {
sm[i, ] <- sm[i, order(om[i, ])]
}
}, times=3L
)
I am performing calculations with constants and vectors (approximate length = 100) for which I need to simulate normal distributions N (with rnorm). For constants (K, with standard deviation = KU) I use rnorm() in the standard way:
K <- 2
KU <- 0.2
set.seed(123)
KN <- rnorm(n = 3, mean = K, sd = KU)
what provides a vector of length 3 (KN):
[1] 1.887905 1.953965 2.311742
Now, I need to do the same thing with a vector (V, standard deviation VU). My first guess is to use:
V <- c(1, 2, 3)
VU <- 0.1 * V
set.seed(123)
VN <- rnorm(3, V, VU)
but only a vector of 3 elements is produced, one for each vector element:
[1] 0.9439524 1.9539645 3.4676125
This is actually the first simulation of the vector, but I need 3 times this vector. One solution is to create 9 numbers, but VN is a vector of 9 elements:
[1] 0.9439524 1.9539645 3.4676125 1.0070508 2.0258575 3.5145195 1.0460916 1.7469878 2.7939441
not 3 vectors of 3 elements. What I want is VN =
[1] 0.9439524 1.0070508 1.0460916
[2] 1.9539645 2.0258575 1.7469878
[3] 3.4676125 3.5145195 2.7939441
so, VN are 3 vectors which I can subsequently use in other calculations, such as KN * VN. The solution that I have found is:
set.seed(123)
VN <- as.data.frame(t(matrix(rnorm(3 * length(V), V, VU), nrow = length(V))))
but in my opinion this is a rather cumbersome expression (which I need to repeat several times in different places with rather long variable names). Is there a simpler way in base R to produce random vectors? I would like to see something like:
VN <- rnorm.vector(3, V, VU)
We can use replicate
set.seed(123)
replicate(3, rnorm(3, V, VU))
# [,1] [,2] [,3]
#[1,] 0.9439524 1.007051 1.046092
#[2,] 1.9539645 2.025858 1.746988
#[3,] 3.4676125 3.514519 2.793944
Or it could be
mapply(rnorm, n = 3, mean = V, sd = VU)
In addition to #akrun's great options, you may also use something slightly simpler than your approach:
matrix(rnorm(n * length(V), V, VU), nrow = n, byrow = TRUE)
# [,1] [,2] [,3]
# [1,] 0.9439524 1.953965 3.467612
# [2,] 1.0070508 2.025858 3.514519
# [3,] 1.0460916 1.746988 2.793944
or also the MASS package with mvrnorm letting to sample from a multivariate normal distribution:
library(MASS)
mvrnorm(n, VU, diag(VU))
# [,1] [,2] [,3]
# [1,] 0.6650715 0.37923044 0.05590089
# [2,] 0.2574341 0.24949882 0.97045721
# [3,] -0.5218990 -0.04857971 0.49707815
where
diag(VU)
# [,1] [,2] [,3]
# [1,] 0.1 0.0 0.0
# [2,] 0.0 0.2 0.0
# [3,] 0.0 0.0 0.3
The latter option is the way to go in case you want the variance-covariance matrix not to be diagonal.
I would like to compute the standard deviation of the nearest neighbors (3*3 moving window) of each element in a matrix. I wrote some code in R to implement it:
library(FNN)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
df_ <- reshape2::melt(df)
df_index <- df_[, c(1,2)]
df_query <- df_index
neighbor_index <- knnx.index(df_index, df_query, k = 9, algorithm = 'kd_tree')
neighbor_coor<- apply(neighbor_index, 1, function(x) df_query[x, ])
neighbor_sd <- lapply(neighbor_coor, function(x) sd(df[x[, 1], x[, 2]]))
sd <- do.call(rbind, neighbor_sd)
But the speed is too slow. Would you give me some advice to speed up? Are there other ways to implement it?
As #romanlustrik proposed in his comment, we can use a raster::focal() for this problem.
library(raster)
df <- matrix(1:10000, nrow = 100, ncol = 100, byrow = TRUE)
dfR <- raster(df)
dfSD <- as.matrix(focal(dfR, w = matrix(1,3,3), fun = sd))
where, w is the a matrix representing the nearest neighbors and their weighting within fun (in this case 3x3 which is the cell itself and it 8 neighbors). Thus, any neighborhood pattern is imaginable as long as it it can be represented by a matrix.
matrix(1,3,3)
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 1
# [3,] 1 1 1
An example with only the 4 neighbors (excluding diagonals and the cell itself):
matrix(c(0,1,0,1,0,1,0,1,0), 3, 3)
# [,1] [,2] [,3]
# [1,] 0 1 0
# [2,] 1 0 1
# [3,] 0 1 0
I have a matrix with n rows of observations. Observations are frequency distributions of the features. I would like to transform the frequency distributions to probability distributions where the sum of each row is 1. Therefore each element in the matrix should be divided by the sum of the row of the element.
I wrote the following R function that does the work but it is very slow with large matrices:
prob_dist <- function(x) {
row_prob_dist <- function(row) {
return (t(lapply(row, function(x,y=sum(row)) x/y)))
}
for (i in 1:nrow(x)) {
if (i==1) p_dist <- row_prob_dist(x[i,])
else p_dist <- rbind(p_dist, row_prob_dist(x[i,]))
}
return(p_dist)
}
B = matrix(c(2, 4, 3, 1, 5, 7), nrow=3, ncol=2)
B
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
prob_dist(B)
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3 0.7
Could you suggest R function that does the job and/or tell me how can I optimise my function to perform faster?
Here's an attempt, but on a dataframe instead of a matrix:
df <- data.frame(replicate(100,sample(1:10, 10e4, rep=TRUE)))
I tried a dplyr approach:
library(dplyr)
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
Here are the results:
library(microbenchmark)
mbm = microbenchmark(
dplyr = df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(df) / rep(rowSums(df), each=ncol(df))),
apply = t(apply(df, 1, prop.table)),
times = 100
)
#> mbm
#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 123.1894 124.1664 137.7076 127.3376 131.1523 445.8857 100
# t 384.6002 390.2353 415.6141 394.8121 408.6669 787.2694 100
# apply 1425.0576 1520.7925 1646.0082 1599.1109 1734.3689 2196.5003 100
Edit: #David benchmark is more in line with OP so I suggest you consider his approach if you are to work with matrices.
Without apply, a vectorized solution in one line:
t(t(B) / rep(rowSums(B), each=ncol(B)))
[,1] [,2]
[1,] 0.6666667 0.3333333
[2,] 0.4444444 0.5555556
[3,] 0.3000000 0.7000000
Or:
diag(1/rowSums(B)) %*% B
Actually I gave it a quick thought and the best vecotization would be simply
B/rowSums(B)
# [,1] [,2]
# [1,] 0.6666667 0.3333333
# [2,] 0.4444444 0.5555556
# [3,] 0.3000000 0.7000000
Actually #Stevens benchmark was misleading because OP has a matrix, while Steven benchmark on a data frame.
Here's a benchmark with a matrix. So for matrices, both vectorized solution will be better than dplyr which doesn't work with matrices
set.seed(123)
m <- matrix(sample(1e6), ncol = 100)
library(dplyr)
library(microbenchmark)
Res <- microbenchmark(
dplyr = as.data.frame(m) %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs),
t = t(t(m) / rep(rowSums(m), each=ncol(m))),
apply = t(apply(m, 1, prop.table)),
DA = m/rowSums(m),
times = 100
)
I'm not sure that your function has any value, since you could just use the hist or density functions to accomplish the same result. Also, the use of apply would work as mentioned. But it serves as a reasonable programming example.
There are several inefficiencies in your code.
you use a for loop instead of vectorizing your code. This is very expensive. You should use apply as mentioned in the comments above.
You are using rbind instead of pre-allocating space for your output. This is extremely expensive as well.
out <- matrix(NA, nrow= n, ncol= ncol(B))
for (i in 1:nrow(B)) {
out[i,] <- row_prob_dist(B[i,])
}