How are weights treated in the cluster_walktrap function in igraph? - r

I am working with character networks of plays. Nodes represent characters, edges represent speeches they address to one another. It is a directed network, and the edge weights are equal to the number of words the source character says to the target.
In iGraph, edge weight sometimes means distance, and sometimes means closeness. To get the correct results for betweenness, for instance, I need to invert the edge weights so the more words a character says to another, the 'closer' they are in the network:
edgeData <- data.frame(source, target, weight = numWords)
graph <- graph_from_data_frame(edgeData)
betweenness(graph, weights = 1/E(graph)$weight)
Now I want to study the community structure of my plays, and I don't know how to use the algorithms correctly. Should I treat edge weights as distances, and invert the weights so characters who talk more are 'closer' to one another?
cluster_walktrap(graph, weights = 1/E(graph)$weight)
Or should I treat the weights as, well, weights, and use the algorithm in its default state?
cluster_walktrap(graph)
Thanks for the help!

cluster_walktrap(graph)is OK. Community detection, weight is weight, not distance.
I am also confused when I calculate some graph indices.
but if you want to calculate shortest path (or other ), you should use
weights = 1/E(graph)$weight
you can design a demo.

Related

How to calculate NME(Normalized Mean Error) between ground-truth and predicted landmarks when some of gt has no corresponding in predicted?

I am trying to learn some facial landmark detection model, and notice that many of them use NME(Normalized Mean Error) as performance metric:
The formula is straightforward, it calculate the l2 distance between ground-truth points and model prediction result, then divided it by a normalized factor, which vary from different dataset.
However, when adopting this formula on some landmark detector that some one developed, i have to deal with this non-trivial situation, that is some detector may not able to generate enough number landmarks for some input image(might because of NMS/model inherited problem/image quality etc). Thus some of ground-truth points might not have their corresponding one in the prediction result.
So how to solve this problem, should i just add such missing point result to "failure result set" and use FR to measure the model, and ignore them when doing the NME calculation?
If you have as output of neural network an vector 10x1 as example
that is your points like [x1,y1,x2,y2...x5,y5]. This vector will be fixed length cause of number of neurons in your model.
If you have missing points - this is because (as example you have 4 from 5 points) some points are go beyond the image width and height. Or are with minus (negative) like [-0.1, -0.2, 0.5,0.7 ...] there first 2 points you can not see on image like they are mission but they will be in vector and you can callculate NME.
In some custom neural nets that can be possible, because missing values will be changed to biggest error points.

Why betweenness centrality differs in igraph?

I'm not a pro programmer, but I've been learning on my own. I have a problem with betweenness centrality. I have an undirected weighted graph of 28 actors, in an adjacency matrix. When I run the code betweenness(PG_Network4, v = V(PG_Network4), directed = FALSE, nobigint = TRUE, normalized = TRUE) the results differs significantly than those I ran in UCINET, Pajek, Gephi (all of them are the same results). What is weird is that, if I do not load my network into R as "weighted" then the results are the same. And the differences are really considerable, for instance, in igraph, those nodes at the 2nd, 3rd, 4th (betweenness degree) in UCINET, Gephi or Pajek passes to the 6th, 10th, and 13th positions. What am I doing wrong here? I thought it was the weighted specification, but UCINET considers that, and still maintain the same results than Pajek and Gephi.
Thank you so much.
The issue you're having is caused by the weighted edges in the graph. UCINET and the other programs make use of it automatically, but in igraph you need to specifically define it in the betweenness function. Add edge weight to your graph: E(g)$weight <- sum(ecount(g)).
Then you can call it in the function betweenness(g, weights = E(g)$weight).
This should resolve your issue and get your values to match up with the other programs.

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

Computing the un-weighted length of a weighted shortest path

I have started investigating whether igraph would be a more efficient method for calculating the length of a least cost path. Using the package gdistance it is straightforward to supply a cost surface and generated least cost paths between two (or many) points. The function costDistance returns the actual length of the paths as the sum of all the segments lengths (i.e. not the cumulative COST of the least cost path).
My question is whether there is way to do this in igraph so that i can compare computation time. Using get.shortest.paths, i can obtain the length of the shortest path between vertices, but, when edge weights are provided, the path path length is reported as the weighted path length.
In short: i would like to find shortest paths on a weighted network but have the lengths reported in terms of edge length, not weighted edge length.
Note: I can see how this is possible by looping through each shortest path and then writing some extra code to just add up the unweighted edge lengths, but i fear this will cancel out my original need for increased efficiency of pairwise distance calculations over massive networks.
In get.shortest.paths, there is a weights argument! If you read ?get.shortest.paths you will find that weights is
Possibly a numeric vector giving edge weights. If this is NULL and the graph has a weight edge attribute, then the attribute is used. If this is NA then no weights are used (even if the graph has a weight attribute).
So you should set weights = NA. See below for an example:
require(igraph)
# make a reproducible example
el <- matrix(nc=3, byrow=TRUE,
c(1,2,.5, 1,3,2, 2,3,.5) )
g2 <- add.edes(graph.empty(3), t(el[,1:2]), weight=el[,3])
# calculate weighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3)
# calculate unweighted shortest distance between vertice 1 and 3
get.shortest.paths(g2, 1, 3, weights=NA)
I'm not sure whether I completely understand what "edge length" and "weighted edge length" means in your post (I guess that "edge length" is simply "the number of edges along the path" and "weighted edge length" is "the total weights of the edges along the path"), but if I'm right, your problem simply boils down to "finding shortest paths where edges are weighted by one particular criteria and then returning a length for each path which is the sum of some other properties of the edges involved".
If this is the case, you can pass the output="epath" parameter to get.shortest.paths; in this case, igraph will report the IDs of the edges along the weighted shortest path between two nodes. You can then use these IDs as indices into a vector containing the values of that other property that you wish to use when the lengths are calculated. E.g.:
> g <- grg.game(100, 0.2)
> E(g)$weight <- runif(ecount(g), min=1, max=20)
> E(g)$length <- runif(ecount(g), min=1, max=20)
> path <- unlist(get.shortest.paths(g, from=1, to=100, output="epath")[[1]]$epath)
> sum(E(g)$length[path])
This will give you the sum of the length attributes of the edges involved in the shortest path between nodes 1 and 100, while the shortest paths are calculated using the weight attribute (which is the default for get.shortest.paths, but you can also override it with the weights=... argument).
If you are simply interested in the number of edges on the path, you can either use a constant 1 for the lengths, or simply call length(path) in the last line.

Probability to visit nodes in a random walk on graph

I have a finite undirected graph in which a node is marked as "start" and another is marked as "goal".
An agent is initially placed at the start node and it navigates through the graph randomly, i.e. at each step it chooses uniformly at random a neighbor node and moves to it.
When it reaches the goal node it stops.
I am looking for an algorithm that, for each node, gives an indication about the probability that the agent visits it, while traveling from start to goal.
Thank you.
As is often the case with graphs, it's simply a matter of knowing an appropriate way to describe the problem.
One way of writing a graph is as an adjacency matrix. If your graph G = (V, E) has |V| nodes (where |V| is the number of vertices), the this matrix will be |V| x |V|. If an edge exists between a pair of vertices, you set the item in the adjacency matrix to 1, or 0 if it isn't present.
A natural extension of this is to weighted graphs. Here, rather than 0 or 1, the adjacency matrix has some notion of weight.
In the case you're describing, you have a weighted graph where the weights are the probability of transitioning from one node to another. This type of matrix has a special name, it is a stochastic matrix. Depending on how you've arranged your matrix, this matrix will have either rows or columns that sum to 1, right and left stochastic matrices respectively.
One link between stochastic matrices and graphs is Markov Chains. In Markov chain literature the critical thing you need to have is a transition matrix (the adjacency matrix with weights equal to the probability of transition after one time-step). Let's call the transition matrix P.
Working out the probability of transitioning from one state to another after k timesteps is given by P^k. If you have a known source state i, then the i-th row of P^k will give you the probability of transitioning to any other state. This gives you an estimate of the probability of being in a given state in the short term
Depending on your source graph, it may be that P^k reaches a steady state distribution - that is, P^k = P^(k+1) for some value of k. This gives you an estimate of the probability of being in a given state in the long term
As an aside, before you do any of this, you should be able to look at your graph, and say some things about what the probability of being in a given state is at some time.
If your graph has disjoint components, the probability of being in a component that you didn't start in is zero.
If your graph has some states that are absorbing, that is, some states (or groups of states) are inescapable once you've entered them, then you'll need to account for that. This may happen if your graph is tree-like.

Resources