Computing the un-weighted length of a weighted shortest path - r

I have started investigating whether igraph would be a more efficient method for calculating the length of a least cost path. Using the package gdistance it is straightforward to supply a cost surface and generated least cost paths between two (or many) points. The function costDistance returns the actual length of the paths as the sum of all the segments lengths (i.e. not the cumulative COST of the least cost path).
My question is whether there is way to do this in igraph so that i can compare computation time. Using get.shortest.paths, i can obtain the length of the shortest path between vertices, but, when edge weights are provided, the path path length is reported as the weighted path length.
In short: i would like to find shortest paths on a weighted network but have the lengths reported in terms of edge length, not weighted edge length.
Note: I can see how this is possible by looping through each shortest path and then writing some extra code to just add up the unweighted edge lengths, but i fear this will cancel out my original need for increased efficiency of pairwise distance calculations over massive networks.

In get.shortest.paths, there is a weights argument! If you read ?get.shortest.paths you will find that weights is
Possibly a numeric vector giving edge weights. If this is NULL and the graph has a weight edge attribute, then the attribute is used. If this is NA then no weights are used (even if the graph has a weight attribute).
So you should set weights = NA. See below for an example:
require(igraph)
# make a reproducible example
el <- matrix(nc=3, byrow=TRUE,
c(1,2,.5, 1,3,2, 2,3,.5) )
g2 <- add.edes(graph.empty(3), t(el[,1:2]), weight=el[,3])
# calculate weighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3)
# calculate unweighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3, weights=NA)

I'm not sure whether I completely understand what "edge length" and "weighted edge length" means in your post (I guess that "edge length" is simply "the number of edges along the path" and "weighted edge length" is "the total weights of the edges along the path"), but if I'm right, your problem simply boils down to "finding shortest paths where edges are weighted by one particular criteria and then returning a length for each path which is the sum of some other properties of the edges involved".
If this is the case, you can pass the output="epath" parameter to get.shortest.paths; in this case, igraph will report the IDs of the edges along the weighted shortest path between two nodes. You can then use these IDs as indices into a vector containing the values of that other property that you wish to use when the lengths are calculated. E.g.:
> g <- grg.game(100, 0.2)
> E(g)$weight <- runif(ecount(g), min=1, max=20)
> E(g)$length <- runif(ecount(g), min=1, max=20)
> path <- unlist(get.shortest.paths(g, from=1, to=100, output="epath")[[1]]$epath)
> sum(E(g)$length[path])
This will give you the sum of the length attributes of the edges involved in the shortest path between nodes 1 and 100, while the shortest paths are calculated using the weight attribute (which is the default for get.shortest.paths, but you can also override it with the weights=... argument).
If you are simply interested in the number of edges on the path, you can either use a constant 1 for the lengths, or simply call length(path) in the last line.

Related

Is eigenvector centrality in igraph wrong?

I am trying to improve my understanding of eigenvector centrality. This overview from the University of Washington was very helpful, especially when read in conjunction with this R code. However, when I use evcent(graph_from_adjacency_matrix(A)), the result differs.
The below code
library(matrixcalc)
library(igraph)
# specify the adjacency matrix
A <- matrix(c(0,1,0,0,0,0,
1,0,1,0,0,0,
0,1,0,1,1,1,
0,0,1,0,1,0,
0,0,1,1,0,1,
0,0,1,0,1,0 ),6,6, byrow= TRUE)
EV <- eigen(A) # compute eigenvalues and eigenvectors
max(EV$values) # find the maximum eigenvalue
centrality <- data.frame(EV$vectors[,1])
names(centrality) <- "Centrality"
print(centrality)
B <- A + diag(6) # Add self loops
EVB <- eigen(B) # compute eigenvalues and eigenvectors
# they are the same as EV(A)
c <- matrix(c(2,3,5,3,4,3)) # Degree of each node + self loop
ck <- function(k){
n <- (k-2)
B_K <- B # B is the original adjacency matrix, w/ self-loops
for (i in 1:n){
B_K <- B_K%*%B #
#print(B_K)
}
c_k <- B_K%*%c
return(c_k)
}
# derive EV centrality as k -> infinity
# k = 100
ck(100)/frobenius.norm(ck(100)) # .09195198, .2487806, .58115487, .40478177, .51401731, .040478177
# Does igraph match?
evcent(graph_from_adjacency_matrix(A))$vector # No: 0.1582229 0.4280856 1.0000000 0.6965127 0.8844756 0.6965127
The rank correlation is the same, but it is still bothersome that the values are not the same. What is going on?
The result returned by igraph is not wrong, but note that there are subtleties to defining eigenvector centrality, and not all implementations handle self-loops in the same way.
Please see what I wrote here.
One way to define eigenvector centrality is simply as "the leading eigenvector of the adjacency matrix". But this is imprecise without specifying what the adjacency matrix is, especially what its diagonal elements should be when there are self-loops present. Depending on application, diagonal entries of the adjacency matrix of an undirected graph are sometimes defined as the number of self-loops, and sometimes as twice the number of self-loops. igraph uses the second definition when computing eigenvector centrality. This is the source of the difference you see.
A more intuitive definition of eigenvector centrality is that the centrality of each vertex is proportional to the sum of its neighbours centralities. Thus the details of the computation hinge on who the neighbours are. Consider a single vertex with a self-loop. It is its own neighbour, but how many times? We can traverse the self-loop in both directions, so it is reasonable to say that it is its own neighbour twice. Indeed, its degree is conventionally taken to be 2, not 1.
You will find that different software packages treat self-loops differently when computing the eigenvector centrality. In igraph, we made a choice by looking at the intuitive interpretation of eigenvector centrality rather than rigidly following a formal definition, with no regard for the motivation behind that definition.
Note: What I wrote about refers to how eigenvector centrality computations work internally, not to what as_adjacency_matrix() return. as_adjacency_matrix() adds one (not two) to the diagonal for each self-loop.

R Igraph Error: "Weight vector must be positive, Invalid value"

I've built several graphs in iGraph. In each graph, nodes represent words, and edge weights represent the number of times Word A was given as a response (in a word association task) to Word B. In each graph, I've normalised the edge weights so that they vary between 0 and 1 using the following code:
E(G)$weight <- E(G)$weight / max(E(G)$weight)
These values are appropriate when analysing node/network strength, but when calculating functions pertaining to betweenness (e.g. calling the betweenness function, or using betweenness-based community detection, they need to be changed into distances - i.e. inverted:
G2 = G
E(G2)$weight = 1 - E(G2)$weight
The problem is that this results in vectors which contain several 0's (i.e. for those which had a strength of 1 before being inverted. This results (at least, I think that this is the cause) in error messages such as:
Error in cluster_edge_betweenness(G2.JHJ.strong, weights = E(G2.JHJ.strong)$weight, :
At community.c:455 : weights must be strictly positive, Invalid value
What can be done about this?
Thanks,
Peter
If you want to play it safe, you can try sum instead of max to normalize the weights, e.g.,
E(G)$weight <- E(G)$weight / sum((E(G)$weight)
or
E(G)$weight <- 2**((E(G)$weight - min(E(G)$weight)) / diff(range(E(G)$weight)))

How do we define betweenness centrality when weights have a positive meaning?

I've read that betweenness centrality is defined as the number of times a vertex lies on the shortest path of the other pairs of nodes.
However, in case weights have a positive meaning (i.e. the more the weight of an edge the merrier), then how does one define betweenness centrality?
In this case, is there another way to calculate betweenness centrality? Or is it simply interpreted in a different way?
Computing the betweenness centrality of a vertex v relies on the following fraction, for any u and w: s(u,w,v) / s(u,w) where s(u,w,v) is the number of shortest paths between u and w that involve v, and s(u,w) is the total number of shortest paths between u and w.
With positive edge weights, I would suggest that you count each shortest path with its own weight: replace s(u,w,v) by the sum of weights of shortest paths between u and w that involve v; and s(u,w) by the sum of weights of all shortest paths between u and w.
Then, you have to define the weight of paths, and this depends on what you have in mind. You may for instance consider the sum of edge weights, their product, their minimum or maximal value, etc.
Warning: this definition still relies on shortest unweighted paths; if longer paths with higher weights exist, they will be ignored, which means that graph structure prevails. This may not be satisfactory.
Note: this approach is somewhat equivalent, if edges have integer weight and a path weight is its edge weight product, to use the classical definition on a multi-graph (an unweighted graph where several edges may exist between two same vertices).

How are weights treated in the cluster_walktrap function in igraph?

I am working with character networks of plays. Nodes represent characters, edges represent speeches they address to one another. It is a directed network, and the edge weights are equal to the number of words the source character says to the target.
In iGraph, edge weight sometimes means distance, and sometimes means closeness. To get the correct results for betweenness, for instance, I need to invert the edge weights so the more words a character says to another, the 'closer' they are in the network:
edgeData <- data.frame(source, target, weight = numWords)
graph <- graph_from_data_frame(edgeData)
betweenness(graph, weights = 1/E(graph)$weight)
Now I want to study the community structure of my plays, and I don't know how to use the algorithms correctly. Should I treat edge weights as distances, and invert the weights so characters who talk more are 'closer' to one another?
cluster_walktrap(graph, weights = 1/E(graph)$weight)
Or should I treat the weights as, well, weights, and use the algorithm in its default state?
cluster_walktrap(graph)
Thanks for the help!
cluster_walktrap(graph)is OK. Community detection, weight is weight, not distance.
I am also confused when I calculate some graph indices.
but if you want to calculate shortest path (or other ), you should use
weights = 1/E(graph)$weight
you can design a demo.

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

Resources