R igraph: shortest path extraction - r

This is the first time I am working with graphs and R igraph package and I need some help with processing graph objects.
What I want to achieve:
From a given contact matrix extract shortest confident path between nodes. By confident I mean that edge weights are higher then neighbouring edges.
Examples:
A
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 1 1 5
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 1 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 5 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is one cluster (clique?) between B-D-E (ie., egde weights between those nodes are high). However, as there is weight between A and F I am also getting cluster there, even though edge weight is low (only 5).
Question A: How to extract only those clusters that have high edge weight? I can transform those contacts to 0 with m[which(m <= 5)] <- 0, but I hope that there is more "mathy" solution for this in igraph package.
B
m <- read.table(row.names = 1, header = TRUE, text =
" A B C D E F
A 0 1 1 5 1 1
B 1 0 1 1e2 1e2 1
C 1 1 0 1 1 1
D 5 1e2 1 0 1e2 1
E 1 1e2 1 1e2 0 1
F 1 1 1 1 1 0")
m <- as.matrix(m)
ig <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
sp <- shortest.paths(ig, algorithm = "dijkstra")
In matrix m there is cluster between B-D-E, but as there is low weight between A and B - A is also connected to this cluster.
Question B: How to not assign nodes to a cluster if edge weight is low?
This is my first question here, if you need clarification or better examples I will improve my questions.

First, it is good to know that when looking up paths, igraph understands weights as costs, i.e. on edges with higher weight it costs more to travel, so it will consider shorter the paths with lower sum weight. It is easy to turn this into the opposite, just take the reciprocal of your weights (1 / E(ig)$weight). Between 2 vertices there might be only one shortest path, but sometimes there are more equally short paths. You can look up all of them (all_shortest_paths), or tell igraph to return only one of the shortests for each pairs of vertices (shortest_paths). Each call of these methods returns the paths from one selected vertex, to have the paths between all pairs, you need to call these once for each vertex (ok, at an undirected graph, it is enough to call for half of the vertices). To formulate what I explained until this point:
spaths <- lapply(V(ig),
function(v){
all_shortest_paths(ig, v,
weight = 1 / E(ig)$weight
)
}
)
Here spaths will be a list of lists, access the paths from C to all the vertices like this:
spaths$C$res
[[1]]
+ 2/6 vertices, named:
[1] C A
[[2]]
+ 2/6 vertices, named:
[1] C B
[[3]]
+ 1/6 vertex, named:
[1] C
[[4]]
+ 2/6 vertices, named:
[1] C D
[[5]]
+ 2/6 vertices, named:
[1] C E
[[6]]
+ 2/6 vertices, named:
[1] C F
spaths$C$res[[2]] # this is the path from `C` to `B`,
# a vector of 2 vertices
Note, the third element is actually from C to itself, you can either ignore it, or provide a vector of all other vertices to the to parameter of all_shortest_paths. Also, in your example all shortest paths will be of length 1, but if I set for example the weight of B--E to 1 instead of 100, we see that the method works, and from B to E the shortest path will be B-D-E.
Regarding your second question, here it is not completely clear what do you want to achieve, especially how do you get these clusters? If you want to find communities, i.e. more closely connected group of vertices, taking into account also the edge weights, there are many methods for this, all those named cluster_[...] or community.[...] in igraph. For example, if we run the fastgreedy method on your graph, it will detect the cluster you mentioned:
fg <- fastgreedy.community(ig, weights = E(ig)$weight)
IGRAPH clustering fast greedy, groups: 2, mod: 0.059
+ groups:
$`1`
[1] "A" "C" "F"
$`2`
[1] "B" "D" "E"
So here we have the B, D, E cluster, what is connected with higher weight edges. If we run the same method without weights, all the vertices will belong to one group (fastgreedy.community(ig, weights = NULL)). Note, at community detection, igraph understands weights as strength, so vertices connected with higher weight edges more likely to cluster together, this is kind of opposite like how it works at calculating paths.

Related

igraph: summarize each node's neighbours characteristics

With an igraph object I would like to capture some features of each node's neighbours, for example the average degree of its neighbours.
I come up with this code, which is inelegant and quite slow.
How should I rethink it for large and complex networks?
library(igraph)
# Toy example
set.seed(123)
g <- erdos.renyi.game(10,0.2)
# Loop to calculate average degree of each node's neighbourhood
s <- character(0)
for(i in 1:gorder(g)){
n <- ego_size(g, nodes = i, order = 1, mindist = 1)
node_of_interest <- unique(unlist(ego(g, nodes = i, order = 1, mindist = 1)))
m <- mean(degree(g, v = node_of_interest, loops = TRUE, normalized = FALSE)-1)
s <- rbind(s,data.frame(node = i, neighbours = n, mean = m))
}
Expanding the data structure with rbind in a loop can get quite slow in R, because at every step it needs to allocate the space for the new object, and then copy it (see section 24.6 here). Also, you might be computing the degree of a node many times, if it s the neighbor of multiple nodes.
A possibly better alternative could be:
# add vertex id (not really necessary)
V(g)$name <- V(g)
# add degree to the graph
V(g)$degree <- degree(g, loops = TRUE, normalized = FALSE)
# get a list of neighbours, for each node
g_ngh <- neighborhood(g, mindist = 1)
# write a function that gets the means
get.mean <- function(x){
mean(V(g)$degree[x]-1)
}
# apply the function, add result to the graph
V(g)$av_degr_nei <- sapply(g_ngh, get.mean)
# get data into dataframe, if necessary
d_vert_attr <- as_data_frame(g, what = "vertices")
d_vert_attr
name degree av_degr_nei
1 1 0 NaN
2 2 1 2.0000000
3 3 2 1.0000000
4 4 1 1.0000000
5 5 2 1.0000000
6 6 1 1.0000000
7 7 3 0.6666667
8 8 1 0.0000000
9 9 1 0.0000000
10 10 0 NaN

Shortest path of length l in R

I want to find shortest path of length l or smaller with least cost in a weighted graph consisting of vertices and edges.
shortest_paths(g,from,to,output="both",weights=wts)
in R(from igraph package) gives shortest path between from and to vertices of least cost with no constraint on length l.
For example in this graph shortest path between 2 and 7 is 2 1 3 7 of length 3 but I want shortest path of length 2 i.e 2 1 7 of minimum cost.
Can someone guide me on how to proceed.
In your example, there is only one path of length two from 2 to 7. That makes it difficult to test if we are really getting the minimum cost path. So, I have added a link to create an extra path of length 2.
## Extended example
to = c(1,1,1,1,1,1,2,2,3,3,6)
from = c(2,3,4,5,6,7,4,6,5,7,7)
weight = c(19,39,40,38,67,68,14,98,38,12,10)
EDF = data.frame(to, from, weight)
G = graph_from_data_frame(EDF, directed = FALSE)
LO = layout_as_star(G, center=1, order = c(1,4,2,5,6,3,7))
plot(G, layout=LO, edge.label=E(G)$weight)
The idea is to start with all paths from 2 to 7 and select only those that meet the constraint - length of path <= 2 (Note that means numbers of vertices <=3). For those paths, we compute the weight and select the one with the minimum cost.
maxlength = 2 ## maximum number of links
ASP = all_simple_paths(G, "2", "7")
ShortPaths = which(sapply(ASP, length) <= maxlength+1)
ASP[ShortPaths]
[[1]]
+ 3/7 vertices, named, from af35df8:
[1] 2 1 7
[[2]]
+ 3/7 vertices, named, from af35df8:
[1] 2 6 7
As you can see there are two paths with length two. We need to find the one with minimum cost. To make this easy, we create a function to compute the weight of a path.
PathWeight = function(VP) {
EP = rep(VP, each=2)[-1]
EP = EP[-length(EP)]
sum(E(G)$weight[get.edge.ids(G, EP)])
}
Now it is easy to get all of the path weights.
sapply(ASP[ShortPaths], PathWeight)
[1] 87 108
And to choose the smallest one
SP = which.min(sapply(ASP[ShortPaths], PathWeight))
ASP[ShortPaths[SP]]
[[1]]
+ 3/7 vertices, named, from af35df8:
[1] 2 1 7

turning a weighted edgelist into an unweighted in r

I have data on every interaction that could and did happen at a university club weekly social hour
id1 id2 timestalked date
1 2 1 1/1/2010
1 3 0 1/1/2010
...
100 2 4 1/8/2010
...
I want to first load this in as a directed graph for the entire time period for visualization. For the weighted matrix I did.
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G <- graph.data.frame(el,directed=TRUE);
A <- as_adjacency_matrix(G,type="both",names=TRUE,sparse=FALSE,attr="timestalked");
I thought removing attr="timestalked" would turn the weights > 0 into 1 but that does not seem to work
library(igraph);
el <- read.csv("el.csv", header = TRUE);
G_unweight <- graph.data.frame(el,directed=TRUE);
A_unweight <- as_adjacency_matrix(G_unweight,type="both",names=TRUE,sparse=FALSE)
as_adjacency_matrix() doesn't provide any argument to control weights. Note that it just provides the number of edges between nodes from the graph.
To turn the weighted edgelist into an unweighted one, try this
A <- as_adjacency_matrix(G, type = "both", names = TRUE, sparse = FALSE)
A[A > 1] <- 1
Note that you can also use the graph_from_adjacency_matrix() function to create an unweighted igraph graph from the adjacency matrix by specifying weighted = NULL.

Summatory vector elements

I have a named numb vector of probabilities, like this
Vector elements
Like you can see, the sum of this vector elements it's 1, I have to generate a random number between 0 and 1 and get the element of this vector that don't overcome this random number, for example:
The random number generate: 0.01
I will get the water element because water it's between 0.09 and 0.11. I attach an graphic example
Example
I don't know how to get the element of this probability.
I am not going to type in all of your data, so I will use a comparable, small example:
set.seed(2017)
probabilidad = runif(5)
probabilidad= probabilidad/sum(probabilidad)
names(probabilidad) = LETTERS[1:5]
probabilidad
A B C D E
0.30918062 0.17969799 0.15695684 0.09655216 0.25761238
sum(probabilidad)
[1] 1
We can use cumsum to set up a vector to make the choices you want. But cumsum will give the upper bounds for the regions and we want the lower bounds, so we adjust the output a little.
Test = c(0, cumsum(probabilidad)[-length(probabilidad)])
names(Test) = names(probabilidad)
Test
A B C D E
0.0000000 0.2533938 0.5129561 0.5922145 0.8222417
Now you can easily test random numbers against the distribution.
(Selector = runif(1))
[1] 0.5190959
names(probabilidad)[max(which(Selector > Test))]
[1] "C"

R lpsolve how to define constraints travelling salesman

I want to code travelling salesman problem in R. I am going to begin with 3 cities at first then I will expand to more cities. distance matrix below gives distance between 3 cities. Objective (if someone doesn't know) is that a salesman will start from a city and will visit 2 other cities such that he has to travel minimum distance.
In below case he should start either from ny or LA and then travel to chicago and then to the remaining city. I need help to define A_ (my constraint matrix).
My decision variables will of same dimension as distances matrix. It will be a 1,0 matrix where 1 represents travel from city equal to row name to a city equal to column name. For instance if a salesman travels from ny to chicago, 2nd element in row 1 will be 1. My column and row names are ny,chicago and LA
By looking at the solution of the problem I concluded that my constraints will be::
Row sums have to be less than 1 as he cannot leave from same city twice
Column sums have to be less than 1 as he cannot enter the same city twice
total sum of matrix elements has to be 2 as the salesman will be visiting 2 cities and leaving from 2 cities.
I need help to define A_ (my constraint matrix). How should I tie in my decision variables into constraints?
ny=c(999,9,20)
chicago=c(9,999,11)
LA=c(20,11,999)
distances=cbind(ny,chicago,LA)
dv=matrix(c("a11","a12","a13","a21","a22","a23","a31","a32","a33"),nrow=3,ncol=3)
c_=c(distances[1,],distances[2,],distances[3,])
signs = c((rep('<=', 7)))
b=c(1,1,1,1,1,1,2)
res = lpSolve::lp('min', c_, A_, signs, b, all.bin = TRUE)
There are some problems with your solution. The first is that the constraints you have in mind don't guarantee that all the cities will be visited -- for example, the path could just go from NY to LA and then back. This could be solved fairly easily, for example, by requiring that each row and column sum to exactly one rather than at most 1 (although in that case you'd be finding a traveling salesman tour rather than just a path).
The bigger problem is that, even if we fix this problem, your constraints wouldn't guarantee that the selected vertices actually form one cycle through the graph, rather than multiple smaller cycles. And I don't think that your representation of the problem can be made to address this issue.
Here is an implementation of Travelling Salesman using LP. The solution space is of size n^3, where n is the number of rows in the distance matrix. This represents n consecutive copies of the nxn matrix, each of which represents the edge traversed at time t for 1<=t<=n. The constraints guarantee that
At most one edge is traversed each step
Ever vertex is visited exactly once
The startpoint of the i'th edge traversed is the same as the endpoint of the i-1'st
This avoids the problem of multiple small cycles. For example, with four vertices, the sequence (12)(21)(34)(43) would not be a valid solution because the endpoint of the second edge (21) does not match the start point of the third (34).
tspsolve<-function(x){
diag(x)<-1e10
## define some basic constants
nx<-nrow(x)
lx<-length(x)
objective<-matrix(x,lx,nx)
rowNum<-rep(row(x),nx)
colNum<-rep(col(x),nx)
stepNum<-rep(1:nx,each=lx)
## these constraints ensure that at most one edge is traversed each step
onePerStep.con<-do.call(cbind,lapply(1:nx,function(i) 1*(stepNum==i)))
onePerRow.rhs<-rep(1,nx)
## these constraints ensure that each vertex is visited exactly once
onceEach.con<-do.call(cbind,lapply(1:nx,function(i) 1*(rowNum==i)))
onceEach.rhs<-rep(1,nx)
## these constraints ensure that the start point of the i'th edge
## is equal to the endpoint of the (i-1)'st edge
edge.con<-c()
for(s in 1:nx){
s1<-(s %% nx)+1
stepMask<-(stepNum == s)*1
nextStepMask<- -(stepNum== s1)
for(i in 1:nx){
edge.con<-cbind(edge.con,stepMask * (colNum==i) + nextStepMask*(rowNum==i))
}
}
edge.rhs<-rep(0,ncol(edge.con))
## now bind all the constraints together, along with right-hand sides, and signs
constraints<-cbind(onePerStep.con,onceEach.con,edge.con)
rhs<-c(onePerRow.rhs,onceEach.rhs,edge.rhs)
signs<-rep("==",length(rhs))
list(constraints,rhs)
## call the lp solver
res<-lp("min",objective,constraints,signs,rhs,transpose=F,all.bin=T)
## print the output of lp
print(res)
## return the results as a sequence of vertices, and the score = total cycle length
list(cycle=colNum[res$solution==1],score=res$objval)
}
Here is an example:
set.seed(123)
x<-matrix(runif(16),c(4,4))
x
## [,1] [,2] [,3] [,4]
## [1,] 0.2875775 0.9404673 0.5514350 0.6775706
## [2,] 0.7883051 0.0455565 0.4566147 0.5726334
## [3,] 0.4089769 0.5281055 0.9568333 0.1029247
## [4,] 0.8830174 0.8924190 0.4533342 0.8998250
tspsolve(x)
## Success: the objective function is 2.335084
## $cycle
## [1] 1 3 4 2
##
## $score
## [1] 2.335084
We can check the correctness of this answer by using a primitive brute force search:
tspscore<-function(x,solution){
sum(sapply(1:nrow(x), function(i) x[solution[i],solution[(i%%nrow(x))+1]]))
}
tspbrute<-function(x,trials){
score<-Inf
cycle<-c()
nx<-nrow(x)
for(i in 1:trials){
temp<-sample(nx)
tempscore<-tspscore(x,temp)
if(tempscore<score){
score<-tempscore
cycle<-temp
}
}
list(cycle=cycle,score=score)
}
tspbrute(x,100)
## $cycle
## [1] 3 4 2 1
##
## $score
## [1] 2.335084
Note that, even though these answers are nominally different, they represent the same cycle.
For larger graphs, though, the brute force approach doesn't work:
> set.seed(123)
> x<-matrix(runif(100),10,10)
> tspsolve(x)
Success: the objective function is 1.296656
$cycle
[1] 1 10 3 9 5 4 8 2 7 6
$score
[1] 1.296656
> tspbrute(x,1000)
$cycle
[1] 1 5 4 8 10 9 2 7 6 3
$score
[1] 2.104487
This implementation is pretty efficient for small matrices, but, as expected, it starts to deteriorate severely as they get larger. At about 15x15 it starts slowing down quite a bit:
timetsp<-function(x,seed=123){
set.seed(seed)
m<-matrix(runif(x*x),x,x)
gc()
system.time(tspsolve(m))[3]
}
sapply(6:16,timetsp)
## elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed elapsed
## 0.011 0.010 0.018 0.153 0.058 0.252 0.984 0.404 1.984 20.003
## elapsed
## 5.565
You can use the gaoptim package to solve permutation/real valued problems - it's pure R, so it's not so fast:
Euro tour problem (see ?optim)
eurodistmat = as.matrix(eurodist)
# Fitness function (we'll perform a maximization, so invert it)
distance = function(sq)
{
sq = c(sq, sq[1])
sq2 <- embed(sq, 2)
1/sum(eurodistmat[cbind(sq2[,2], sq2[,1])])
}
loc = -cmdscale(eurodist, add = TRUE)$points
x = loc[, 1]
y = loc[, 2]
n = nrow(eurodistmat)
set.seed(1)
# solving code
require(gaoptim)
ga2 = GAPerm(distance, n, popSize = 100, mutRate = 0.3)
ga2$evolve(200)
best = ga2$bestIndividual()
# solving code
# just transform and plot the results
best = c(best, best[1])
best.dist = 1/max(ga2$bestFit())
res = loc[best, ]
i = 1:n
plot(x, y, type = 'n', axes = FALSE, ylab = '', xlab = '')
title ('Euro tour: TSP with 21 cities')
mtext(paste('Best distance found:', best.dist))
arrows(res[i, 1], res[i, 2], res[i + 1, 1], res[i + 1, 2], col = 'red', angle = 10)
text(x, y, labels(eurodist), cex = 0.8, col = 'gray20')

Resources