Pairwise Dijkstra's with early termination according to hop-count in R - r

I am looking for the most computational and memory friendly approach to computing particular entries of the distance matrix D obtained by pairwise Dijkstra's algorithm in R. More precisely, I only need D[i,j] if the hop-count (unweighted) distance between node i and node j is at most a particular integer k (D[i,j] itself may computed as a weighted shortest path length for which the number of hops may be greater than k). D should be encoded as a sparse matrix for memory efficiency.
I was wondering if there has been some work done on this or if there is an efficient approach towards optimizing the current igraph functions to account for this restriction. E.g., early exit in pairwise Dijkstra's algorithm could really improve the efficiency of solving my problem.
I have tried to make this as efficient as possible myself, but with no luck so far. Some first attempt is illustrated below.
library(igraph)
library(Matrix)
library(spam)
# Hope this to the more efficient one
bounded_hop_pairG_1 <- function(G, k=2){
to <- ego(G, order=k)
D <- sparseMatrix(i=unlist(lapply(1:length(V(G)), function(v) rep(v, length(to[[v]])))),
j=unlist(to),
x=unlist(lapply(1:length(V(G)), function(v) distances(G, v=v, to=to[[v]]))))
return(D)
}
# Hope this to be the less efficient one
bounded_hop_pairG_2 <- function(G, k=2){
D <- distances(G)
D[distances(G, weight=NA) > k] <- 0
return(as.spam(D))
}
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
# Check whether 'distances' actually implements early termination
start_time <- Sys.time()
d1 <- distances(G, v=1)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.00497961 secs
start_time <- Sys.time()
d2 <- distances(G, v=1, to=521)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 0.002238274 secs (consistently smaller than above)
start_time <- Sys.time()
D1 <- bounded_hop_pairG_1(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 2.671333 secs
start_time <- Sys.time()
D2 <- bounded_hop_pairG_2(G)
end_time <- Sys.time()
print(end_time - start_time)
# Time difference of 1.101419 secs
Though I suspect my first function to apply early termination and never stores the full pairwise distance matrix, it appears to be much less efficient than my second function (which also performs a full unweighted pairwise distance computation) in terms of computational time. Hence, I was hoping somebody could point out the most efficient way to implement the first function in R.

you could try cppRouting package available via github.
It provides functions like get_distance_matrix() which can use all cores.
library(cppRouting)
library(igraph)
library(spam)
library(Matrix)
# Sample graph
set.seed(42)
G <- sample_bipartite(500, 500, p=0.1)
E(G)$weight <- runif(length(E(G)))
#Graph to data frame
G2<-as_long_data_frame(G)
#Weighted graph
graph1<-makegraph(G2[,1:3],directed = F)
#Unweighted graph
graph2<-makegraph(cbind(G2[,1:2],rep(1,nrow(G2))),directed = F)
nodes<-unique(c(G2$from,G2$to)) %>% sort
myfunc<-function(Gr1,Gr2,nd,k=2,cores=FALSE){
test<-get_distance_matrix(graph,nd,nd,allcores = cores)
test2<-get_distance_matrix(graph2,nd,nd,allcores = cores)
test[test2>k]<-0
return(as.spam(test))
}
#Your first function
system.time(
D1 <- bounded_hop_pairG_1(G)
)
#2.18s
#Your second function
system.time(
D2 <- bounded_hop_pairG_2(G)
)
#1.01s
#One core
system.time(
D3 <- myfunc(graph1,graph2,nodes))
#0.69s
#Parallel
system.time(
D4 <- myfunc(graph1,graph2,nodes,cores=TRUE))
#0.32s
If you really want to stop the algorithm when k-nodes is reached and have a little knowledge in C++, it seems rather simple to slightly modify original Dijkstra algorithm then use it via Rcpp.

Related

Speed of Daisy Function

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
}
}
dist[i,j] <- dist_v
}
}
return(dist)
}
data("iris")
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))
one option:
euclidean3 <- function(df) {
require(data.table)
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
d
}
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

R aborts when using function DIST (110 GB vector)

I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns.
When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".
My Mac and my Windows pc have 4 GB of RAM.
Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.
Thank you
The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.
dist2 <- function(X){
cmb <- combn(seq_len(ncol(X)), 2)
d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
if(!is.null(colnames(X)))
dimnames(d) <- list(colnames(X), colnames(X))
for(i in seq_len(ncol(cmb))){
ix <- cmb[1, i]
iy <- cmb[2, i]
res <- sqrt(sum((X[, ix] - X[, iy])^2))
d[ix, iy] <- d[iy, ix] <- res
diag(d) <- 0
}
d
}
Now test the function with a data.frame of the dimensions in the question.
set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)
dist2(m)
First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.
Speaking about the issue, you can use sample_frac function (if I am not mistaken, this is a function from tidyverse package). For example, sample_frac(your_data, .5) will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.
The other way is to extend the memory.limit(size = n) where n is a number in megabytes.

Poor speed gain in using `future` for parallelization

I find that the speed gain in using the future (and furrr) package for parallelization in R is not satisfactory. In particular, the speed improvement is not close to linear. My machine has 4 workers, so I thought the speed gain should be around linear when the number of workers I specify is not larger than the number of cores available in my machine. However, it is not the case.
The following is an example that illustrates the problem, where I draw 10^7 random numbers for 500 times.
library(future)
library(furrr)
# Parameters
n <- 1e7
m <- 500
# Compute the mean
rmean <- function(x, n) {
rand.vec <- runif(n)
rand.mean <- mean(rand.vec)
return(rand.mean)
}
# Record the time used to compute the mean of n numbers for m times
rtime <- function(m, n) {
t1 <- Sys.time()
temp <- future_map(.x = 1:m,
.f = rmean,
n = n,
.options = furrr::furrr_options(seed = TRUE))
t2 <- Sys.time()
# Print the time used
print(t2 - t1)
return(temp)
}
# Print the time used for different number of workers
plan(multisession, workers = 1)
set.seed(1)
x <- rtime(m, n)
# Time difference of 2.503885 mins
plan(multisession, workers = 2)
set.seed(1)
x <- rtime(m, n)
# Time difference of 1.341357 mins
plan(multisession, workers = 3)
set.seed(1)
x <- rtime(m, n)
# Time difference of 57.25641 secs
plan(multisession, workers = 4)
set.seed(1)
x <- rtime(m, n)
# Time difference of 47.31929 secs
In the above example, the speed gain that I get are:
1.87x for 2 workers
2.62x for 3 workers
3.17x for 4 workers
The speed gain in the above example is not close to linear, especially when I use 4 workers. I thought this might be because of the overhead time from the plan function. However, the speed gain is similar if I run the procedure multiple times after setting the number of workers. This is illustrated as follows:
plan(multisession, workers = 3)
set.seed(1)
x <- rtime(m, n)
# Time difference of 58.07243 secs
set.seed(1)
x <- rtime(m, n)
# Time difference of 1.012799 mins
set.seed(1)
x <- rtime(m, n)
# Time difference of 57.96777 secs
I also tried to use the future_lapply function from the future.apply package instead of the future_map function from the furrr package. However, their speed gain is similar as well. Therefore, I would appreciate any advice on what is going on here. Thank you!

Improve processing performance in R for Social Network Analysis

I am doing social network analysis using igraph package in R and I am dealing with close to 2 million vertices and edges. Also calculating degrees of separations which are nearly 8 million vertices and edges. Usually, it takes somewhere between 2 to 3 hours for execution which is way too much high. I need some input and suggestions to improve this performance. Below is the sample code I am using:
g <- graph.data.frame( ids, directed = F) # ids contains approximately 2 million records
distances(graph = g, v = t_ids$ID_from[x], to = t_ids$ID_to[x], weights = NA)
# t_ids contains approximately 8 million records for which degrees of separation is to be calculated using Shortest Path Algorithms
Thanking in advance!
I don't think so, but I'd be very happy to be proven wrong.
You should look into other ways of optimising the code that is running.
If your data is fixed, you could compute the distances once, save the (probably rather big) distance matrix, and ask that for degrees of separation.
If your analysis does not require distances between all x vertices, you should look into making optimisations in your code by shortening t_ids$ID_from[x]. Get only the distances you need. I suspect that you're already doing this, though.
distances() actually computes rather quickly. At 10'000 nodes (which amounts to 4,99*10^6 undirected distances), my crappy machine gets a full 700MB large distance-matrix in a few seconds.
I first thought about the different algorithms you can choose in distances(), but now I doubt that they will help you. I ran a speed-test on the different algorithms to see if I could recommend any of them to you, but they all seem to run at more or less the same speed (results are relations to time to compute using automatic algorithm that would be used in your code above):
sample automatic unweighted dijkstra bellman-ford johnson
1 10 1 0.9416667 0.9750000 1.0750000 1.0833333
2 100 1 0.9427083 0.9062500 0.8906250 0.8958333
3 1000 1 0.9965636 0.9656357 0.9977090 0.9873998
4 5000 1 0.9674200 0.9947269 0.9691149 1.0007533
5 10000 1 1.0070885 0.9938136 0.9974223 0.9953602
I don't think anything can be concluded from this, but it's running on an Erdős-Rényi model. It's possible that your network structure favours one algorithm over another, but they would still not give you anywhere near the performance boost that you're hoping for.
The code is here:
# igrpah
library(igraph)
# setup:
samplesizes <- c(10, 100, 1000, 5000, 10000)
reps <- c(100, 100, 15, 3, 1)
algorithms = c("automatic", "unweighted", "dijkstra", "bellman-ford", "johnson")
df <- as.data.frame(matrix(ncol=length(algorithms), nrow=0), stringsAsFactors = FALSE)
names(df) <- algorithms
# any random graph
g <- erdos.renyi.game(10000, 10000, "gnm")
# These are the different algorithms used by distances:
m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="automatic")
m.unwg <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="unweighted")
m.dijk <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="dijkstra")
m.belm <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="bellman-ford")
m.john <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="johnson")
# They produce the same result:
sum(m.auto == m.unwg & m.auto == m.dijk & m.auto == m.belm & m.auto == m.john) == length(m.auto)
# Use this function will be used to test the speed of distances() run using different algorithms
test_distances <- function(alg){
m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm=alg)
(TRUE)
}
# Build testresults
for(i.sample in 1:length(samplesizes)){
# Create a random network to test
g <- erdos.renyi.game(samplesizes[i.sample], (samplesizes[i.sample]*1.5), type = "gnm", directed = FALSE, loops = FALSE)
i.rep <- reps[i.sample]
for(i.alg in 1:length(algorithms)){
df[i.sample,i.alg] <- system.time( replicate(i.rep, test_distances(algorithms[i.alg]) ) )[['elapsed']]
}
}
# Normalize benchmark results
dfn <- df
dfn[,1:length(df[,])] <- df[,1:length(df[,])] / df[,1]
dfn$sample <- samplesizes
dfn <- dfn[,c(6,1:5)]
dfn

Parallel Monte Carlo Simulation in R using snowfall

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta.
I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw.
My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)
So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))
vec <- vector(mode = "integer", length = N )
for(i in 1:NDRAWS){
# order probabilities after a single draw for every theta
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
# sum up winning positions for every theta
vec[pos] <- vec[pos] + 1:N
}
# order thetas
ord <- order(-vec)
df[ord,]
This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???
# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()
Any help is appreciated!
This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.
Here's your complete example converted to use the foreach package with the doParallel backend:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)
vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
.combine='+') %dopar% {
v <- integer(N)
for(i in 1:ndraws) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
v[pos] <- v[pos] + 1:N
}
v
}
ord <- order(-vec)
df[ord,]
Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.
Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.
Perhaps try this?
myFunc <- function(xx, N) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
vec[pos] + 1:N
}
Using doParallel will allow you to add results:
require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)

Resources