(local/nodal and global) Efficiency using igraph shortest.paths function - r

I am trying to calculate local efficiency of a graph using shortest.paths of igraph package.
The local efficiency of a vertice v, by definition, is the "global efficiency" computed among all direct neighbors of v (Latora & Machiori, 2001).
I came up with the code below for global and local efficiency. However, the latter is including the target vertex in its calculation. And in the paper above they say the target vertex has to be taken out.
#Global Efficiency (average inverse shortest paths between all u--v vertices)
eff<-1/(shortest.paths(my.graph))
eff[!is.finite(eff)]<-0
gl.eff<-mean(eff,na.rm=TRUE)
#Mean local efficiency (global efficiency for each node)
gn<-graph.neighborhood(my.graph,1) #list with subgraphs of directly connected graphs
names(gn)<-colnames(my.corr.matrix)
local.eff<-numeric(length(gn))
for (i in 1:length(gn)){
gn[[i]]<-gn[[i]] - vertex(V(gn[[i]])[grep(names(gn[i]),V(gn[[i]]))]) #doesn't match
eff.gn<-1/(shortest.paths(gn[[i]]))
eff.gn[!is.finite(gleff.gn)]<-0
eff.gn<-mean(eff.gn,na.rm=TRUE)
local.eff[i]<-gleff.gn
mean.local.eff<-mean(local.eff, na.rm=TRUE)
}
I am trying to match the list name (each element of the list is a subgraph) with the name of the vertex inside that subgraph. I am trying to use 'grep()', but haven't been able to get it right. Could someone give me a hand on that?
Thanks in advance,

I have already written a function to do this that is many times faster than what you've written. See if the following will suit your needs. For smaller graphs (or if you are using Windows), you will possibly want to replace simplify2array(mclapply(nodes, with sapply(nodes,, and then of course remove the argument mc.cores=detectCores(). However this really helps performance on large graphs.
You can see the code at the following link:
Local efficiency code
EDIT: Included some benchmark info (where the function f is yours, and g is what I pasted above). This was done on a laptop with 4 cores #2.10 GHz (Intel i3-2310m).
g.rand <- sample_gnp(100, .1)
V(g.rand)$degree <- degree(g.rand)
compare <- microbenchmark(f(g.rand), g(g.rand), times=1e2)
compare
Unit: milliseconds
expr min lq mean median uq max neval cld
f(g.rand) 476.9853 4097.2202 4544.720 4539.911 4895.020 9346.873 100 b
g(g.rand) 299.2696 329.6629 1319.377 1114.054 2314.304 3003.966 100 a

In case that someone needs the local efficiency in python, here is my code for that:
Python version
import numpy as np
from igraph import *
np.seterr(divide='ignore')
def nodal_eff(g):
"""
This function calculates the nodal efficiency of a weighted graph object.
Created by: Loukas Serafeim (seralouk), Nov 2017
Args:
g: A igraph Graph() object.
Returns:
The nodal efficiency of each node of the graph
"""
sp = g.shortest_paths_dijkstra(weights = g.es["weight"])
sp = np.asarray(sp)
temp = 1.0 / sp
np.fill_diagonal(temp, 0)
N = temp.shape[0]
ne = ( 1.0 / (N - 1)) * np.apply_along_axis(sum, 0, temp)
return ne

I know that my title doesn't fully account for what my question really was. So I'm answering it myself since I just got it to work.
#Mean local efficiency (global efficiency for each node)
gn<-graph.neighborhood(my.graph,1) #list with subgraphs of directly connected graphs
names(gn)<-V(my.graph)$name
local.eff<-numeric(length(gn))
for (i in 1:length(gn)){
gn[[i]]<-gn[[i]] - vertex(V(gn[[i]])[match(names(gn[i]),V(gn[[i]])$name)]) #MATCHES
eff.gn<-1/(shortest.paths(gn[[i]]))
eff.gn[!is.finite(eff.gn)]<-0
eff.gn<-mean(eff.gn,na.rm=TRUE)
local.eff[i]<-eff.gn
}
local.eff[local.eff %in% NaN]<-NA
mean.local.eff<-mean(local.eff, na.rm=TRUE)

Related

time complexity issue when checking the connectivity between nodes in R igraph

I'd like to see which first degree nodes are connected to which second degree nodes for every node in a given graph. Suppose I generate a graph with 1000 nodes.
library(igraph)
g <- erdos.renyi.game(1000, 0.2)
When I calculate the set of adjacent nodes and second degree nodes for every node in the graph, there's no issue.
The code runs pretty quick independent from the size of the network. Once I add an if statement to check if two nodes are connected as:
for(j in adjacent){
for(k in secondDegreNodes){
if( are.connected(g, j, k) ){
}
}
}
My code takes forever. What is the exact complexity issue that I am facing? Is there a better way to conduct this operation?
There is something odd definetely happening. Even the following code block does not converge though it's the simplest operation.
g <- erdos.renyi.game(1000, 0.3)
A <- as_adjacency_matrix(g)
from<- 1
to <- nrow(A)
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1){
#do nothing
}
}
}
EDIT: I believe there could be some issues with igraph package. I generated the graph in R igraph and coded everything in Java language, and it worked. As I expected, there is no complexity issue with the algorithm. However, I have no idea what is wrong with igraph.
In the example that you give, indexing matrix A repeatedly in a for-loop is rather inefficient. In this particular instance it is due to A being of class dgCMatrix from package Matrix.
When you compare the performance before and after converting A to another class, you will notice the difference. With N = 300 nodes, the duration of the for-loop decreases from 23.5 seconds to 0.1 seconds on my machine once I convert to the standard matrix class. There are moreover (N^2 + N) / 2 comparisons to be made. The squared term means going from 300 nodes to 3 x 300 = 900 nodes will roughly increase the computing time ninefold (at the very least).
If you further Rprofile the code, you will see that when subsetting an object of class dgCMatrix (i.e. A[i, j]) a number of further R functions are called, whereas the function [ is implemented straight in C for the basic matrix class. In addition, dgCMatrix is from the S4 object system. That means i.a. that finding the right method to use for [ is a little more costly than usual.
Finally, if you rely on R, you will be much better of (in general) using vectorised operations. These will typically avoid a deep call stack of further (inefficient) R functions and will often be implemented in C. With the adjacency matrix, you can quickly find second degree nodes by inspecting A_2 = A %*% A which will be very fast also (or I suspect especially so) for an object of class dgCMatrix.
Timing:
library(igraph)
N <- 300
g <- erdos.renyi.game(N, 0.3)
A <- as_adjacency_matrix(g)
from<- 1
to <- nrow(A)
class(A)
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"
# run through matrix via for loop
# 23.5 seconds
system.time({
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1) {}
}
}
})
# change class
A <- as.matrix(A)
class(A)
# [1] "matrix"
# run for loop again
# 0.097 seconds
system.time({
for(i in from:to){
for(j in i:to){
if(A[i,j] == 1) {}
}
}
})

How to use doParallel for calculating distance between zipcodes in R?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?
What I tried
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?
Edit: Benchmarks from (some) replies
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.
R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.
Without any sample data this snippet is untested.
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
Note: For most of the geosphere functions to work correctly, the proper order is: longitude first then latitude.
The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.
I used #SymbolixAU's suggestion to use the geodist package to perform the 2.1M distance calculations on my datasets. I found it to be significantly faster than the geosphere package for every test (I have added one of them in my main question). The measure=cheap option in the geodist uses the cheap ruler method which has low error rates below distances of 100kms. See the geodist vignette for more information. Given some of my distances were higher than 100km, I settled on using the Vincenty Ellipsoid measure.
If you are going to use geosphere, I would either use a fast approximate method like distHaversine, or the still fast and very precise distGeo method. (The distVincenty* these are mainly implemented for curiosity).

How could I speed up my R code? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
Disclaimer
Hello everyone! I recently started programming in R. My codes are working just fine, but in terms of speed some of them are taking way too long to put to good use. I hope someone can help me making this code run faster either by optimising the code, or with the use of one of the multicore packages.
About the code
I have large datasets containing about 15000 numeric data each. The code takes two parameters (p, n) where p >= n, and make subsets of the data. It applies the zyp.yuepilon function (from the zyp package) to each row of the subsets. Then the parameter n is used to apply the same function on an n sized subset.
Problem is I run this code in a nested for loop: p in 10:40 and n in 10:40 so it takes an eternity to get the results, and it's just one dataset among many others.
sp <- function(p, n){
library(zyp)
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
Benchmark results in seconds:
expr min lq median uq max neval
sp(7, 6) 92.77266 94.24901 94.53346 95.10363 95.64914 10
Here is a series of comments, rather than an answer.
Looking at the zyp.yuepilon function body, by calling the function without parenthesis in a R session, you see that this function, and the function zyp.sen are written in plain R code (as opposed to compiled code).
The biggest speed-up is likely attained by using the Rcpp package which facilitates calling (compiled) C++ code within R. In fact, there is a small linear model example here Fast LM model using Rcpp/RcppArmadillo.
I would be inclined to rewrite the two functions zyp.yuepilon and zyp.sen in C++, using Rcpp, including the loop over subset vectors (for which you are currently using apply to do).
For general R speed-up issues see this question R loop performance, as well as the R package plyr, which may provide an entry point for taking a map-reduce type of approach to your problem.
If you want to steer clear of C++, then a series of micro-optimisations would be your quickest win. To speed up the apply aspect of your code, you could use something like this
library(doParallel)
library(parallel)
library(foreach)
library(zyp)
cl<-makeCluster(4)
registerDoParallel(cl)
sp_1<-function(p=7, n=6){
N_ob=15000;
off_set=81;
N_ob_o=N_ob-off_set;
am<-matrix(runif(N_ob*p),ncol=p);
subsets<-am[-(1:off_set),];
ret=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subsets[i,]),use.names=FALSE),ncol=11, byrow=TRUE);
subset_n <- subsets[, 1:n]
ret2=matrix(unlist( foreach(i=1:N_ob_o) %dopar% zyp::zyp.yuepilon(subset_n[i,]),use.names=FALSE),nrow=11);
return(list(ret, ret2))
}
sp<-function(p=7, n=6){
data <- runif(15000, 1, 4)
lower <- seq(80 - p + 1, by=1, length.out=length(data)-81)
upper <- lower + p - 1
subsets <- matrix(nrow=length(lower), ncol=p)
for(j in 1:length(lower)){
subsets[j, ] = data[lower[j] : upper[j]]
}
ret <- apply(subsets, 1, zyp.yuepilon)
subset_n <- subsets[, 1:n]
ret2 <- apply(subset_n, 1, zyp.yuepilon)
return(list(ret, ret2))
}
system.time(sp_1())
system.time(sp())
This gives me a speed-up of around a factor of 2. But this will depend on your platform, etc. Check out the help files for the functions and packages above, and tune the number of clusters using makeCluster to see what works best for your platform (in the absence of any information about your particular set-up).
Another route might be to make use of the byte-code compiler via library(compiler) to see if the various functions can be optimised, this way.
library(compiler)
enableJit(3);
zyp_comp<-cmpfun(zyp.yuepilon);

Condensing a for-loop in R

I have the following code in R
library(mvtnorm)
m = matrix(rnorm(2000000),nrow=200)
A = matrix(rnorm(40000),ncol=200)
A = A%*%t(A)
C = array(A,c(200,200,10000))
B = 10000
S = 100
postpred = array(NA,c(200,S,B))
for(i in 1:B){
postpred[,,i] = t(rmvnorm(S,m[,i],C[,,i],method="svd"))
}
but this code is extremely slow because I have to loop 10,000 times while also simulating from the multivariate normal 100 times and m and C can be very large as well. So what I would like to do is be able to calculate postpred outside of a loop. I have tried using the apply function but to no avail. Any help or suggestions greatly appreciated.
Others have pointed out that apply (and similar functions) won't help you much in your case, and they are right.
For what it is worth, I checked whether your would have a gain of performance by compiling your code. Here is a little benchmark that I made with your problem (I reduced the size of the matrices, because otherwise I cannot run them):
library(mvtnorm)
func = function()
{
m = matrix(rnorm(200000),nrow=100)
A = matrix(rnorm(10000),ncol=100)
A = A%*%t(A)
C = array(A,c(100,100,1000))
B = 1000
S = 10
postpred = array(NA,c(1000,S,B))
for(i in 1:B){
postpred[,,i] = t(rmvnorm(S,m[,i],C[,,i],method="svd"))
}
}
require(compiler)
func_compiled <- cmpfun(func)
require(microbenchmark)
microbenchmark(func_compiled(), func(), times=10) # grab a coffee, this takes some time
The results show that compiling won't give you any advantage:
Unit: seconds
expr min lq median uq max neval
slow_func_compiled() 9.938632 10.12269 10.18237 10.48215 15.43299 10
slow_func() 9.969320 10.07676 10.21916 15.44664 15.66109 10
(this could have been expected, as the library mvtnorm should be already compiled)
Overall, you have only two ways left to optimize your code in R:
use smaller numbers (if acceptable)
parallelize your code
As Josillber says, vectorisation (apply family of functions) ain't going to do much for you, it really is a bit of an R myth that it gives significant speed improvements.
Suggest you look at parallel options, there are parallel mcapply and snow packages. Read more here http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

Algorithm to generate stars-and-bars alike solutions from a known one?

In my pursue of doing a Simulated Annealing heuristic to solve a problem, I am trying to find the best way to generate neighbours of my current proposed solution.
Solutions come in form of a vector of integer positions (p1,...,pN), which I understand as a binary chain
0 0 0 0 1 0 ... 0 0 1 0 ... 0 1 0 0
p1 pj pN
With some restrictions (pj - p(j-1) > D for all j, and p1 > D/2, length - pN > D/2).
Now, my idea is to use something similar to the Levenshtein distance to create new solutions, so if I have [0,1,0,0,1,0] (D=3) and I want a new state within a distance lesser or equal than 1, then I can get [1,0,0,0,1,0], for example, but not [1,0,0,1,0,0].
What I do (in R) is the following:
GenNewSeq <- function(seq, D, dist){
for(i in 1:dist){
Diffs <- c((ceiling(D/2)+seq[1]),diff(seq),(ceiling(D/2)+seq[length(seq)]))
position <- sample((2:length(seq))[Diffs > D], size=1)
move <- sample(c(-1*as.integer(Diffs[position-1]>D),0,1*as.integer(Diffs[position]>D)), size = 1)
seq[position-1] <- seq[position-1]+move
}
seq
}
Maybe it is a bit obscure, if you want I can explain better what it does. The thing is that this is 1) slow (I don't know how can I avoid the for), 2) weirdly not working as intended. It tends too much to move only the last positions and/or stabilizing moving forward and backward the same element all the time, so I get biased results on my Simulated Annealing.
I have thought of removing the restriction of distances and put it in the fitness function (something like exp(D-(pj-p(j-1)))), so I can simply move them with normals, or make them move altogether and then oscillate... and I am starting to think that it would be the easiest way. However, I would appreciate very much a reference to how can I do an efficient and reliable algorithm that does what I ask for, I don't mind if I have to do it in C. I have checked this but I wasn't able to solve my doubts.
Thank you very much for your help.
The bug in your program is this. When you select position at random, you are choosing a segment at random from the set of segments of length at least D. The element you are going to end up moving is the right-hand endpoint of this segment.
And, although it seems as though you are choosing the direction of the move at random, in fact the move is more likely to be in the downward direction than upward. This is because Diffs[position-1] is guaranteed to be greater than D (due to the way position was selected), but Diffs[position] is not. Which means that in some cases move is going to be chosen at random from c(-1,0,1) and in other cases it is going to be chosen at random from c(-1,0,0). So, over time, downwards moves will occur more than upwards moves.
Your algorithm can be fixed by selecting at random among all points for which either adjacent segment has length at least D, that where there won't be any bias in move direction:
GenNewSeq2 <- function(seq, D, dist){
for(i in 1:dist){
Diffs <- c((ceiling(D/2)+seq[1]),diff(seq))
bigGaps <- Diffs>D
moveable <- bigGaps[-1] | head(bigGaps,-1)
position <- sample(which(moveable),1)
move <- sample(c(-1*(Diffs[position]>D),1*(Diffs[position+1]>D)), size = 1)
seq[position] <- seq[position]+move
}
seq
}
It is also possible to generate a random new sequence without a for loop. Here is one possible implementation:
newseq<-function(seq,D,dist){
diffs <- c((ceiling(D/2)+seq[1]),diff(seq))
bigGaps<-diffs>D
selected<-sample(which(bigGaps),min(length(bigGaps),dist))
directions<-sample(c(-1,1),length(selected),T)
down<-directions<0
up<-directions>0
selected[up]<-selected[up]-1
move<-rep(0,length(seq))
move[selected[up]]<-1
move[selected[down]]<-move[selected[down]]-1
move[length(seq)]<-0 ## the last element of seq stays fixed always
seq+move
}
This implementation is more efficient, and it doesn't slow down nearly as much when dist grows.
> set.seed(123)
> seq<-sort(sample(1000,20))
> microbenchmark(newseq(seq,20,3),GenNewSeq2(seq,20,3))
Unit: microseconds
expr min lq median uq max neval
newseq(seq, 20, 3) 53.503 55.0965 56.026 56.761 68.804 100
GenNewSeq2(seq, 20, 3) 183.091 188.0490 189.492 191.249 367.094 100
> microbenchmark(newseq(seq,20,6),GenNewSeq2(seq,20,6))
Unit: microseconds
expr min lq median uq max neval
newseq(seq, 20, 6) 54.027 56.4960 57.3865 58.2955 70.258 100
GenNewSeq2(seq, 20, 6) 368.306 373.7745 377.5225 381.4565 559.037 100
>
We can also verify that GenNewSeq2 and newseq don't drift towards zero by running the following code for each of the three functions, and then plotting the mean of seq over time:
set.seed(12345)
seq<-sort(sample(1000,20))
x<-rep(0,20000)
for(i in 1:20000){
x[i]<-mean(seq)
seq<-GenNewSeq(seq,20,3)
}
plot(x,type='l')

Resources