R Igraph Error: "Weight vector must be positive, Invalid value" - r

I've built several graphs in iGraph. In each graph, nodes represent words, and edge weights represent the number of times Word A was given as a response (in a word association task) to Word B. In each graph, I've normalised the edge weights so that they vary between 0 and 1 using the following code:
E(G)$weight <- E(G)$weight / max(E(G)$weight)
These values are appropriate when analysing node/network strength, but when calculating functions pertaining to betweenness (e.g. calling the betweenness function, or using betweenness-based community detection, they need to be changed into distances - i.e. inverted:
G2 = G
E(G2)$weight = 1 - E(G2)$weight
The problem is that this results in vectors which contain several 0's (i.e. for those which had a strength of 1 before being inverted. This results (at least, I think that this is the cause) in error messages such as:
Error in cluster_edge_betweenness(G2.JHJ.strong, weights = E(G2.JHJ.strong)$weight, :
At community.c:455 : weights must be strictly positive, Invalid value
What can be done about this?
Thanks,
Peter

If you want to play it safe, you can try sum instead of max to normalize the weights, e.g.,
E(G)$weight <- E(G)$weight / sum((E(G)$weight)
or
E(G)$weight <- 2**((E(G)$weight - min(E(G)$weight)) / diff(range(E(G)$weight)))

Related

Is eigenvector centrality in igraph wrong?

I am trying to improve my understanding of eigenvector centrality. This overview from the University of Washington was very helpful, especially when read in conjunction with this R code. However, when I use evcent(graph_from_adjacency_matrix(A)), the result differs.
The below code
library(matrixcalc)
library(igraph)
# specify the adjacency matrix
A <- matrix(c(0,1,0,0,0,0,
1,0,1,0,0,0,
0,1,0,1,1,1,
0,0,1,0,1,0,
0,0,1,1,0,1,
0,0,1,0,1,0 ),6,6, byrow= TRUE)
EV <- eigen(A) # compute eigenvalues and eigenvectors
max(EV$values) # find the maximum eigenvalue
centrality <- data.frame(EV$vectors[,1])
names(centrality) <- "Centrality"
print(centrality)
B <- A + diag(6) # Add self loops
EVB <- eigen(B) # compute eigenvalues and eigenvectors
# they are the same as EV(A)
c <- matrix(c(2,3,5,3,4,3)) # Degree of each node + self loop
ck <- function(k){
n <- (k-2)
B_K <- B # B is the original adjacency matrix, w/ self-loops
for (i in 1:n){
B_K <- B_K%*%B #
#print(B_K)
}
c_k <- B_K%*%c
return(c_k)
}
# derive EV centrality as k -> infinity
# k = 100
ck(100)/frobenius.norm(ck(100)) # .09195198, .2487806, .58115487, .40478177, .51401731, .040478177
# Does igraph match?
evcent(graph_from_adjacency_matrix(A))$vector # No: 0.1582229 0.4280856 1.0000000 0.6965127 0.8844756 0.6965127
The rank correlation is the same, but it is still bothersome that the values are not the same. What is going on?
The result returned by igraph is not wrong, but note that there are subtleties to defining eigenvector centrality, and not all implementations handle self-loops in the same way.
Please see what I wrote here.
One way to define eigenvector centrality is simply as "the leading eigenvector of the adjacency matrix". But this is imprecise without specifying what the adjacency matrix is, especially what its diagonal elements should be when there are self-loops present. Depending on application, diagonal entries of the adjacency matrix of an undirected graph are sometimes defined as the number of self-loops, and sometimes as twice the number of self-loops. igraph uses the second definition when computing eigenvector centrality. This is the source of the difference you see.
A more intuitive definition of eigenvector centrality is that the centrality of each vertex is proportional to the sum of its neighbours centralities. Thus the details of the computation hinge on who the neighbours are. Consider a single vertex with a self-loop. It is its own neighbour, but how many times? We can traverse the self-loop in both directions, so it is reasonable to say that it is its own neighbour twice. Indeed, its degree is conventionally taken to be 2, not 1.
You will find that different software packages treat self-loops differently when computing the eigenvector centrality. In igraph, we made a choice by looking at the intuitive interpretation of eigenvector centrality rather than rigidly following a formal definition, with no regard for the motivation behind that definition.
Note: What I wrote about refers to how eigenvector centrality computations work internally, not to what as_adjacency_matrix() return. as_adjacency_matrix() adds one (not two) to the diagonal for each self-loop.

eigen() and the correct eigenvectors

My problem is the following:
I'm trying to use R in order to compute numerically this problem.
So I've correctly setup the problem in my console, and then I tried to compute the eigenvectors.
But I expect that the eigenvector associated with lambda = 1 is (1,2,1) instead of what I've got here. So, the scaling is correct (0.4082483 is effectively half of 0.8164966), but I would like to obtain a consistent result.
My original problem is to find a stationary distribution for a Markov Chain using R instead of doing it on paper. So from a probabilistic point of view, my stationary distribution is a vector whose sum of the components is equal to 1. For that reason I was trying to change the scale in order to obtain what I've defined "a consistent result".
How can I do that ?
The eigen vectors returned by R are normalized (for the square-norm). If V is a eigen vector then s * V is a eigen vector as well for any non-zero scalar s. If you want the stationary distribution as in your link, divide by the sum:
V / sum(V)
and you will get (1/4, 1/2, 1/4).
So:
ev <- eigen(t(C))$vectors
ev / colSums(ev)
to get all the solutions in one shot.
C <- matrix(c(0.5,0.25,0,0.5,0.5,0.5,0,0.25,0.5),
nrow=3)
ee <- eigen(t(C))$vectors
As suggested by #Stéphane Laurent in the comments, the scaling of eigenvectors is arbitrary; only the relative value is specified. The default in R is that the sum of squares of the eigenvectors (their norms) are equal to 1; colSums(ee^2) is a vector of 1s.
Following the link, we can see that you want each eigenvector to sum to 1.
ee2 <- sweep(ee,MARGIN=2,STATS=colSums(ee),FUN=`/`)
(i.e., divide each eigenvector by its sum).
(This is a good general solution, but in this case the sum of the second and third eigenvectors are both approximately zero [theoretically, they are exactly zero], so this only really makes sense for the first eigenvector.)

How does distances weighting work in KNN?

I'm writing KNN classifier in R. I want to add weighting scheme, e. g. inverted indices 1/d. As it is, for Iris dataset I get almost perfect 66% accuracy (no matter the metric used) since value no. 3 ("virginica") almost never shows up and I want to make it better with weighting. My question is: what exactly and how do I weight? I've read that I should weight classes of K nearest neighbours with those distances.
I've tried creating vectors of classes and distances to K nearest neighbours and then taking weighted mean from it:
inverted <- function(vals, distances)
{
inv_distances <- 1 / distances
# eliminate division-by-zero errors
inv_distances <- ifelse((inv_distances < 0.01), 0.01, inv_distances)
weighted.mean(vals, inv_distances)
}
My results are weird: for correct vectors vals (classes) and distances I sometimes get NaN (Not a Number) or NA values. Also my weights don't sum to 1, and... they probably should? I'm not sure. I just need someone to clear this weighting scheme for me.
EDIT:
I've debugged above code, since it multiplied by weight too late (therefore not eliminating distance 0 and causing NaNs). I've also changed it to harmonic series weights, not using distance (so first neighbour has weight 1, second 1/2, third 1/3 etc.). I still don't know exactly how it works and what other weights may be.
inverted <- function(vals)
{
weights <- 1 / seq(length(vals))
res <- weighted.mean(vals, weights)
res
}

How are weights treated in the cluster_walktrap function in igraph?

I am working with character networks of plays. Nodes represent characters, edges represent speeches they address to one another. It is a directed network, and the edge weights are equal to the number of words the source character says to the target.
In iGraph, edge weight sometimes means distance, and sometimes means closeness. To get the correct results for betweenness, for instance, I need to invert the edge weights so the more words a character says to another, the 'closer' they are in the network:
edgeData <- data.frame(source, target, weight = numWords)
graph <- graph_from_data_frame(edgeData)
betweenness(graph, weights = 1/E(graph)$weight)
Now I want to study the community structure of my plays, and I don't know how to use the algorithms correctly. Should I treat edge weights as distances, and invert the weights so characters who talk more are 'closer' to one another?
cluster_walktrap(graph, weights = 1/E(graph)$weight)
Or should I treat the weights as, well, weights, and use the algorithm in its default state?
cluster_walktrap(graph)
Thanks for the help!
cluster_walktrap(graph)is OK. Community detection, weight is weight, not distance.
I am also confused when I calculate some graph indices.
but if you want to calculate shortest path (or other ), you should use
weights = 1/E(graph)$weight
you can design a demo.

Computing the un-weighted length of a weighted shortest path

I have started investigating whether igraph would be a more efficient method for calculating the length of a least cost path. Using the package gdistance it is straightforward to supply a cost surface and generated least cost paths between two (or many) points. The function costDistance returns the actual length of the paths as the sum of all the segments lengths (i.e. not the cumulative COST of the least cost path).
My question is whether there is way to do this in igraph so that i can compare computation time. Using get.shortest.paths, i can obtain the length of the shortest path between vertices, but, when edge weights are provided, the path path length is reported as the weighted path length.
In short: i would like to find shortest paths on a weighted network but have the lengths reported in terms of edge length, not weighted edge length.
Note: I can see how this is possible by looping through each shortest path and then writing some extra code to just add up the unweighted edge lengths, but i fear this will cancel out my original need for increased efficiency of pairwise distance calculations over massive networks.
In get.shortest.paths, there is a weights argument! If you read ?get.shortest.paths you will find that weights is
Possibly a numeric vector giving edge weights. If this is NULL and the graph has a weight edge attribute, then the attribute is used. If this is NA then no weights are used (even if the graph has a weight attribute).
So you should set weights = NA. See below for an example:
require(igraph)
# make a reproducible example
el <- matrix(nc=3, byrow=TRUE,
c(1,2,.5, 1,3,2, 2,3,.5) )
g2 <- add.edes(graph.empty(3), t(el[,1:2]), weight=el[,3])
# calculate weighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3)
# calculate unweighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3, weights=NA)
I'm not sure whether I completely understand what "edge length" and "weighted edge length" means in your post (I guess that "edge length" is simply "the number of edges along the path" and "weighted edge length" is "the total weights of the edges along the path"), but if I'm right, your problem simply boils down to "finding shortest paths where edges are weighted by one particular criteria and then returning a length for each path which is the sum of some other properties of the edges involved".
If this is the case, you can pass the output="epath" parameter to get.shortest.paths; in this case, igraph will report the IDs of the edges along the weighted shortest path between two nodes. You can then use these IDs as indices into a vector containing the values of that other property that you wish to use when the lengths are calculated. E.g.:
> g <- grg.game(100, 0.2)
> E(g)$weight <- runif(ecount(g), min=1, max=20)
> E(g)$length <- runif(ecount(g), min=1, max=20)
> path <- unlist(get.shortest.paths(g, from=1, to=100, output="epath")[[1]]$epath)
> sum(E(g)$length[path])
This will give you the sum of the length attributes of the edges involved in the shortest path between nodes 1 and 100, while the shortest paths are calculated using the weight attribute (which is the default for get.shortest.paths, but you can also override it with the weights=... argument).
If you are simply interested in the number of edges on the path, you can either use a constant 1 for the lengths, or simply call length(path) in the last line.

Resources