Most efficient way to calculate euclidean distances in a large matrix - r

I would like to find what is the most memory and time efficient way to calculate euclidean distances on a large matrix. I've ran this small benchmark below comparing a few packages I know: parallelDist, geodist, fields and stats. I've also considered this customized function that combines Rcppand bigmemory. Here are the results I've found (reprex below), but I'd like to know whether there are other efficient pacakges / solutions to do this task:
Results
benchmrk
#> package time alloc
#>1: parDist 0.298 5.369186e-04
#>2: fields 1.079 9.486198e-03
#>3: rcpp 54.422 2.161113e+00
#>4: stats 0.770 5.788603e+01
#>5: geodist 2.513 1.157635e+02
# plot
ggplot(benchmrk, aes(x=alloc , y=time, color= package, label=package)) +
geom_label(alpha=.5) +
coord_trans(x="log10", y="log10") +
theme(legend.position = "none")
Reprex
library(parallelDist)
library(geodist)
library(fields)
library(stats)
library(bigmemory)
library(Rcpp)
library(lineprof)
library(geobr)
library(sf)
library(ggplot2)
library(data.table)
# data input
df <- geobr::read_weighting_area()
gc(reset = T)
# convert projection to UTM
df <- st_transform(df, crs = 3857)
# get spatial coordinates
coords <- suppressWarnings(st_coordinates( st_centroid(df) ))
# prepare customized rcpp function
sourceCpp("euc_dist.cpp")
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat#address, zeros#address)
return(zeros)
}
### Start tests
perf_fields <- lineprof(dist_fields <- fields::rdist(coords) )
perf_geodist <- lineprof(dist_geodist <- geodist::geodist(coords, measure = "cheap") )
perf_stats <- lineprof(dist_stats <- stats::dist(coords) )
perf_parDist <- lineprof(dist_parDist <- parallelDist::parDist(coords, method = "euclidean") )
perf_rcpp <- lineprof(dist_rcpp <- bigMatrixEuc( as.big.matrix(coords) ) )
perf_fields$package <- 'fields'
perf_geodist$package <- 'geodist'
perf_stats$package <- 'stats'
perf_parDist$package <- 'parDist'
perf_rcpp$package <- 'rcpp'
# gather results
benchmrk <- rbind(perf_fields, perf_geodist, perf_stats , perf_parDist, perf_rcpp)
benchmrk <- setDT(benchmrk)[, .(time =sum(time), alloc = sum(alloc)), by=package][order(alloc)]
benchmrk

Here, I try to propose an answer 'theoretically'.
I think a combination of the rccp approach (here) and the parDist (here) might allow for working on very large data sets while keeping execution times at an acceptable level?
Unfortunately, I did not work with rccp, RcppParallel nor RcppArmadilloyet. But it seems the parDist and the rccp-big.matrix approach build upon the same 'infrastructure'.
Maybe some more experienced users will take up the challenge.

Related

Efficiently create many polygons

I want to create polygons inside an apply and want to do this as quickly as possible from a matrix of coordinates. I have some code and realized this is one of the slowest parts of my code. How can I do this efficiently? I tried two different approaches:
Approach 1
library(sp)
library(terra)
t0 <- Sys.time()
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
Polygons(list(Polygon(coords)), idx)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
# show time passed
print(Sys.time() - t0)
# Time difference of 2.082166 secs
Approach 2
t0 <- Sys.time()
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
vect(coords, type = "polygon")
})
# convert to terra polygons
poly_terra <- vect(poly_list)
print(Sys.time() - t0)
# Time difference of 16.38044 secs
Why is it faster to create sp polygons and convert them afterwards than directly creating terra polygons? The code with vect(SpatialPolygons(Polygons(list(Polygon(coords)), idx))) seems somewhat complicated. Is there a faster or at least more elegant way?
Edit Currently my fastest option, although it feels illegal:
t0 <- Sys.time()
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(idx)
new#Polygons[[1]]#coords <- coords
return(new)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
print(Sys.time() - t0)
# Time difference of 0.7147191 secs
This is faster than your examples
t0 <- Sys.time()
p <- lapply(1:10000, function(idx){
cbind(id=idx, x=rnorm(100), y=rnorm(100))
})
p <- do.call(rbind, p)
v <- vect(p, "polygons")
print(Sys.time() - t0)
#Time difference of 0.483578 secs
This uses lapply and you state that you want to use apply; but in the context of your example apply does not seem to be a good choice.
I do not see much performance difference between your two sp approaches. Below I use a streamlined version of the one you say is fastest and benchmark it with my approach:
with_terra <- function() {
p <- lapply(1:10000, function(idx){
cbind(id=idx, x=rnorm(100), y=rnorm(100))
})
p <- do.call(rbind, p)
vect(p, "polygons")
}
with_sp <- function() {
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
dummy#ID <- as.character(idx)
dummy#Polygons[[1]]#coords <- cbind(rnorm(100), rnorm(100))
dummy
})
vect(SpatialPolygons(poly_list))
}
bm <- microbenchmark::microbenchmark(
sp = with_sp(),
terra = with_terra(),
times = 10
)
bm
#Unit: milliseconds
# expr min lq mean median uq max neval
# sp 836.8434 892.8411 930.5261 935.3788 968.2724 1039.2840 10
# terra 261.2191 276.0770 298.3603 282.7462 296.3674 437.0505 10
I'm not really sure if this will be of help, but I had some good time experimenting and fine-tuning and thought I'll share my preliminary results at least.
Foremost, let me share some input for further reading:
This article was the starting point for some code optimization I worked on some time ago: FasteR! HigheR! StrongeR!
Maybe it's just me, but I prefer for-loops over apply and this approach does not seem to be slower (c.f. here). On the contrary, the median execution time of your first approach was ~0.12 s faster on my machine after I used a loop instead, but maybe there is another reason for you using apply here.
If you choose to go for a loop, here is another great guide how to reduce execution time.
Making use of namespaces actually slows down your code (c.f. here), so better attach the packages you are going to use - like you did.
The native pipe does not seem to have any overhead (c.f. here), so this might be a great way to un-nest your functions and make them look tidier without penalties.
For timing purposes, I came across {tictoc} some time ago as a nice implementation of Sys.time() - t0 from my point of view, for actual benchmarking, {benchmarking} is great.
Noam Ross suggests to find better (= faster) packages for your purpose. You already noticed {sp} operates faster than {terra} with your example. Let me present a third option:
library(sp)
library(terra)
library(sf)
# first node has to be equal to the last node for a polygon to be closed
coords <- cbind(rnorm(99), rnorm(99))
coords <- rbind(coords, coords[1, ])
mbm <- microbenchmark::microbenchmark(
sp = Polygon(coords) |> list() |> Polygons(1),
terra = vect(coords, type = "polygons"),
sf = list(coords) |> st_polygon(),
times = 100
)
ggplot2::autoplot(mbm)
If your target object has to be of class SpatVector, you may consider applying terra::vect() once as a final step. However, what exactly is your goal once you created your polygon objects? This might affect which package / workflow to use. E.g. if you only need geometries in a specific order, you might drop attributes etc.
Considering your third approach - it should not feel illegal from my point of view, pre-allocating objects and exchanging some attributes seems like a smart move to do - a condensate might encompass loops, pipes and maybe the {sf} package, whereat I failed implementing the latter at the moment, but at least I did not slow down your code so far:
# your take on this
illegal_approach_a <- function() {
dummy <- Polygons(list(Polygon(cbind(rep(0,4), rep(0,4)))), "0")
poly_list <- apply(matrix(1:10000), 1, function(idx){
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(idx)
new#Polygons[[1]]#coords <- coords
return(new)
})
# convert to terra polygons
poly_terra <- vect(SpatialPolygons(poly_list))
}
# my take on this
illegal_approach_b <- function() {
dummy <- cbind(rep(0, 4), rep(0, 4)) |> Polygon() |> list() |> Polygons("0")
poly_list <- list()
for (i in 1:10000) {
# set coordinates
coords <- cbind(rnorm(100), rnorm(100))
# create polygon
new <- dummy
new#ID <- as.character(i)
new#Polygons[[1]]#coords <- coords
poly_list[[i]] <- new
}
# convert to terra polygons
poly_terra <- SpatialPolygons(poly_list) |> vect()
}
mbm <- microbenchmark::microbenchmark(
your_take = illegal_approach_a(),
my_take = illegal_approach_b(),
times = 100
)
ggplot2::autoplot(mbm)

Out of memory error when projecting a bipartite network in igraph

I have a directed, bipartite graph g with 215473 vertices and 2326714 edges. When creating a bipartite.projection of g, I keep running out of memory (it uses ~35 gig of RAM before crashing).
I tried to calculate how much memory I need, by following a previous thread on nongnu.org.
From the information provided in this thread, to store a graph in memory costs (in bytes):
(4*|E|+2*|V|) * 8 + 4*|V|
To calculate the projection requires the following memory (in bytes):
16*|V| + (2*|V|+2*|E|) * 8
Thus, for my graph g, it would cost:
((4*2326714+2*215473) * 8 + 4*215473) + (16*215473 + (2*215473+2*2326714) * 8)
= 78764308 + 44122560
= 122886868 (bytes)
= 122.886868 (mb)
Clearly, this isn't correct, and I must be doing something wrong.
Can anyone please help figure out how to create a bipartite projection of my graph?
Working with sparse matrices could possibly solve your problem.
# Load tiny toy data as edgelist
df <- data.frame( person =
c('Sam','Sam','Sam','Greg','Tom','Tom','Tom','Mary','Mary'), group =
c('a','b','c','a','b','c','d','b','d'), stringsAsFactors = F)
# Transform data to a sparse matrix
library(Matrix)
A <- spMatrix(nrow=length(unique(df$person)),
ncol=length(unique(df$group)),
i = as.numeric(factor(df$person)),
j = as.numeric(factor(df$group)),
x = rep(1, length(as.numeric(df$person))) )
row.names(A) <- levels(factor(df$person))
colnames(A) <- levels(factor(df$group))
To do the projection you have acutally multiple possiblities, here are two:
# Use base r
Arow <- tcrossprod(A)
# Alternatively, if you want to project on the other mode:
Acol <- tcrossprod(t(A))
# Use the igraph package, which works with sparse matrices
library(igraph)
g <- graph.incidence(A)
# The command bipartite.projection does both possible projections at once
proj <- bipartite.projection(g)
#proj[[1]]
#proj[[2]]
You can also read-in the data and do the transformation within the spMatrix command by using data.table, which will speed up those operations as well.
UPDATE:
Here is an example with a larger graph and some memory benchmarks:
# Load packages
library(data.table)
library(igraph)
# Scientific collaboration dataset
# Descriptives as reported on https://toreopsahl.com/datasets/#newman2001
# mode 1 elements: 16726
# mode 2 elements: 22016
# two mode ties: 58595
# one mode ties: 47594
d <- fread("http://opsahl.co.uk/tnet/datasets/Newman-Cond_mat_95-99-two_mode.txt",
stringsAsFactors=TRUE, colClasses = "factor", header=FALSE)
# Transform data to a sparse matrix
A <- spMatrix(nrow=length(unique(d[, V1])),
ncol=length(unique(d[, V2])),
i = as.numeric(d[, V1]),
j = as.numeric(d[, V2]),
x = rep(1, length(as.numeric(d[, V1]))) )
row.names(A) <- levels(d[, V1])
colnames(A) <- levels(d[, V2])
#To do the projection you have acutally multiple possiblities, here are two:
# Use base r
Arow <- tcrossprod(A)
# Alternatively, if you want to project on the other mode:
Acol <- tcrossprod(t(A))
Here is a overview how much memory got used, i.e. the sparse matrix approach worked on my laptop for this network, but the approach using regular matrices did give a memory allocation error (even after removing the Bcol object from memory with rm(Brow) and then calling the garbage collector gc())
object.size(A) # Spare matrix: 3108520 bytes
object.size(Arow) # 2713768 bytes
object.size(Acol) # 5542104 bytes
# For comparison
object.size(B <- as.matrix(A)) # Regular matrix: 2945783320 bytes
object.size(Brow <- tcrossprod(B)) # 2239946368 bytes
object.size(Bcol <- tcrossprod(t(B))) # Memory allocation error on my laptop

dplyr - error message after applying function

I am trying to apply a IDW (inverse distance weighting) to different groups in a database. I am trying to use dplyr to apply this function to each group, but i am making a mistake in the Split-Apply-Combine. The current function returns 10 values for each group of 10 observations, but currently dplyr tries to insert 10 return values in each mutated cell, rather than one new value for mutated cell.
The problem is likely function-agnostic, but i could unfortunately not find a simpler function that showcases the same error.
I get the error message that the dataframe is corrupt, and the new column is filled with values.
group N Lat Long Obs idw_val
1 A 1 49.43952 20.42646 11 <dbl[10]>
2 B 1 49.76982 19.70493 8 <dbl[10]>
The example hopefully clarifies this. The solution is probably very simple - some pointers to help me much appreciated...
require(ggmap)
require(dplyr)
require(raster)
require(sp)
require(gstat)
require(lattice)
####create dataset
set.seed(123)
dh = expand.grid(group = c("A","B","C"),
N=1:10)
dh$Lat=rnorm(nrow(dh),50,1)
dh$Long=rnorm(nrow(dh),20,1)
dh$Obs=rpois(nrow(dh),10)
dh
#####create grid
pixels <- 10
#####function defintion
idw_w=function(x,y,z){
geog2 <- data.frame(x,y,z)
coordinates(geog2) = ~x+y
geog.grd <- expand.grid(x=seq(floor(min(coordinates(geog2)[,1])),
ceiling(max(coordinates(geog2)[,1])),
length.out=pixels),
y=seq(floor(min(coordinates(geog2)[,2])),
ceiling(max(coordinates(geog2)[,2])),
length.out=pixels))
# Assigning coordinates results in spdataframe.
grd.pts <- SpatialPixels(SpatialPoints((geog.grd)))
grd <- as(grd.pts, "SpatialGrid")
##### IDW interpolation.
geog2.idw <- idw(z ~ 1, geog2, grd, idp=4)
####overlay
pts <- SpatialPoints(cbind(x, y))
over(pts, geog2.idw["var1.pred"])
}
#### test function
idw_w(dh$Lat,dh$Long,dh$Obs)
####groupwise dplyr
dh2 = dh %>%
# arrange(Block, Species, Date) %>%
group_by(group) %>%
mutate(idw_val=idw_w(x=Lat,y=Long,z=Obs))
dh2
str(dh2)
If I understand what you want correctly it's just a matter of making sure your function returns a vector of values rather than a data.frame object. I think this function will do what you want when run through the mutate() step:
idw_w=function(x,y,z){
geog2 <- data.frame(x,y,z)
coordinates(geog2) = ~x+y
geog.grd <- expand.grid(x=seq(floor(min(coordinates(geog2)[,1])),
ceiling(max(coordinates(geog2)[,1])),
length.out=pixels),
y=seq(floor(min(coordinates(geog2)[,2])),
ceiling(max(coordinates(geog2)[,2])),
length.out=pixels))
# Assigning coordinates results in spdataframe.
grd.pts <- SpatialPixels(SpatialPoints((geog.grd)))
grd <- as(grd.pts, "SpatialGrid")
##### IDW interpolation.
geog2.idw <- idw(z ~ 1, geog2, grd, idp=4)
####overlay
pts <- SpatialPoints(cbind(x, y))
(over(pts, geog2.idw["var1.pred"]))[,1]
}

R Simulation Programming Efficiency

I am a relatively new R programmer and have written a script that takes some statistical results and will ultimately compare it to a vector of results in which the target variable has been randomized. The result vector contains the statistical results of n simulations. As the number of simulations increases (I would like to run 10,000 simulations at least) the run time is longer than I would like. I have tried increasing the performance in ways I know to modify the code, but would love the help of others in optimizing it. The relevant part of the code is below.
#CREATE DATA
require(plyr)
Simulations <- 10001
Variation <- c("Control", "A", "B","C")
Trials <- c(727,724,723,720)
NonResponse <- c(692,669,679,682)
Response <- c(35,55,44,38)
ConfLevel <- .95
#PERFORM INITIAL CALCS
NonResponse <- Trials-Response
Data <-data.frame(Variation, NonResponse, Response, Trials)
total <- ddply(Data,.(Variation),function(x){data.frame(value = rep(c(0,1),times = c(x$NonResponse,x$Response)))})
total <- total[sample(1:nrow(total)), ]
colnames(total) <- c("Variation","Response")
#CREATE FUNCTION TO PERFORM SIMULATIONS
targetshuffle <- function(x)
{
shuffle_target <- x[,"Response"]
shuffle_target <- data.frame(sample(shuffle_target))
revised <- cbind(x[,"Variation"], shuffle_target)
colnames(revised) <- c("Variation","Yes")
yes_variation <- data.frame(table(revised$Yes,revised$Variation))
colnames(yes_variation) <- c("Yes","Variation","Shuffled_Response")
Shuffled_Data <- subset(yes_variation, yes_variation$Yes==1)
Shuffled_Data <- Shuffled_Data[match(Variation, Shuffled_Data$Variation),]
yes_variation <- cbind(Data,Shuffled_Data)
VectorPTest_All <- yes_variation[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
Control_Only <- yes_variation[yes_variation$Variation=="Control",]
VectorPTest_Chall <- subset(yes_variation,!(Variation=="Control"))
VectorPTest_Chall <- VectorPTest_Chall[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
ControlResponse <- Control_Only$Response
ControlResponseRevised <- Control_Only$Shuffled_Response
ControlTotal <- Control_Only$Trials
VariationCount <- length(VectorPTest_Chall$Variation)
VP <- data.frame(c(VectorPTest_Chall,rep(ControlResponse),rep(ControlResponseRevised),rep(ControlTotal)))
names(VP) <- c("Variation","NonResponse","Response", "Trials", "ResponseShuffled", "ControlReponse",
"ControlResponseShuffled","ControlTotal")
VP1 <<- data.frame(VP[,c(5,7,4,8)])
VP2 <<- data.frame(VP[,c(3,6,4,8)])
ptest <- apply(VP1, 1, function(column) prop.test(x=c(column[1], column[2]),
n=c(column[3], column[4]), alternative="two.sided",
conf.level=ConfLevel, correct=FALSE)$p.value)
min_p_value <- min(ptest)
return(min_p_value)
}
#CALL FUNCTION
sim_result <- do.call(rbind, rlply(Simulations, targetshuffle(total)))
Offhand, one thing to look at is creating all the data frames. Each time you do that you're copying all the data in the constituent object. If the dimensions are predictable you might consider creating empty matrices at the beginning of the function and populating them as you go.

get coordinates of a patch in a raster map (raster package in R)

I have a raster map with many patches (clumps of continguous cells with the same value). What I need to do is to obtain the coordinates of the center (or close to the center) of each patch.
I am very unexperienced with raster package but it seems I can get coordinates only if I know the position of the cells in the map. Is there any way to get coordinates giving a value of the cells instead? Thank you
If by patch you mean clumps, Raster package allows you to find , and isolate, clumps. Taking the clump() raster package example, and extending it:
library(raster)
library(igraph)
detach("package:coin", unload=TRUE)
r <- raster(ncols=12, nrows=12)
set.seed(0)
r[] <- round(runif(ncell(r))*0.7 )
rc <- clump(r)
clump_id <- getValues(rc)
xy <- xyFromCell(rc,1:ncell(rc))
df <- data.frame(xy, clump_id, is_clump = rc[] %in% freq(rc, useNA = 'no')[,1])
df[df$is_clump == T, ]
plot(r)
plot(rc)
text(df[df$is_clump == T, 1:2], labels = df[df$is_clump == T, 3])
May not be as interesting as you could expect.
You do it all over with directions = 4
rc <- clump(r, directions = 4)
clump_id <- getValues(rc)
xy <- xyFromCell(rc,1:ncell(rc))
df <- data.frame(xy, clump_id, is_clump = rc[] %in% freq(rc, useNA = 'no')[,1])
df[df$is_clump == T, ]
to get
and maybe clump 'centroids'
dfm <- ddply(df[df$is_clump == T, ], .(clump_id), summarise, xm = mean(x), ym = mean(y))
plot(rc)
text(dfm[, 2:3], labels = dfm$clump_id)
Notes:
There will be an error if you try to use clump() without first
detach modeltools library. modeltools is called by coin and maybe
other statistical libraries.
You could take the mean of the coordinates of each patch:
# some dummy data
m <- matrix(c(
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,4,4,0,
0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,
0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,
0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,
0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,
0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,
0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,
0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,
0,0,2,3,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,
0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow=20, byrow=T)
# create a raster
r <- raster(m)
# convert raster to points
p <- data.frame(rasterToPoints(r))
# filter out packground
p <- p[p$layer > 0,]
# for each patch calc mean coordinates
sapply(split(p[, c("x", "y")], p$layer), colMeans)

Resources