Find highest scoring pair of nodes in graph. - graph

I'm trying so solve an optimization problem where I want to find a combination of two nodes with the highest impact/importance in a graph. Lets say I want to base this on betweenness centrality (BC). I guess the more sensible approach is to select one node (maybe one with a high BC), then calculate the BC for the resulting network and then remove the node with the highest value for BC. My goal is to generate a list of the highest scoring combinations of nodes when removed from the original graph. I've implemented a simplified method that picks out random nodes and if the score is higher than the previous, one of the two nodes is reused in the next combination. I'm not sure if this approach is good enough of if the code will "get stuck" at local optima combinations.
Any pointers to steer me in the right direction would be appreciated.

Unless there are properties of the graph and/or function that you can exploit, you have to check all pairs to be sure that the maximum is found.

Several approximate betweenness centrality calculation algorithms have been proposed.
Your general method is good, and it is somewhat similar to the one used in Fast approximation of betweenness centrality through sampling [Riondato, Kornaropoulos] 2015. here here.
Quoting:
"Since exact computation in large networks is prohibitively expensive,
we present two efficient randomized algorithms for betweenness
estimation. The algorithms are based on random sampling of shortest
paths and offer probabilistic guarantees on the quality of the
approximation. [...] The first algorithm estimates the betweenness of all vertices (or edges): all approximate values are within an additive factor ε ∈ (0, 1) from the real values, with probability at least 1 − δ. The second algorithm focuses on the top-K vertices (or edges) with highest betweenness and estimate their betweenness value to within a multiplicative factor ε, with probability at least 1 − δ. This is the first algorithm that can compute such approximation for the top-K vertices (or edges). "
The time complexity for both algorithms is O(r*(|E|+|V| log |V|)), where r is the sample size (which determines the accuracy).
Their second algorithm is quite relevant for your use case (K=2):
"This is the first algorithm that can compute such approximation for
the top-K vertices (or edges)."

First, calculate betweeness centrality value for all nodes. Sort in ascending order. Select the node with the highest BC value and remove it from the network. Reconnect the remaining nodes and repeat the process continuously. This will enable you pick the nodes with the highest BC on the network.

Related

Is PageRank always better then eigenvector or Katz centrality?

As far as I understand, there is classical eigenvector centrality and there are variants such as Katz centrality or PageRank. I wonder if the latter is the "latest stage" in the evolution of eigenvector centrality and therefore always superior? Or are there certain conditions, depending on which one should use one or the other. If so, what conditions would that be?
Might be a little bit late, but
Eigen Vector Centrality assumes that nodes with more important connections are important. For example, people who know the president are probably important. mathematically, this is performed by calculating the centrality measurements by finding the eigen vector of the largest eigenvalue of the adjacency matrix.
The problem with Eigen Vector Centrality is that it does not handle directed graphs well as centrality is not passed to incoming edges, leading to lots of zeroes for centrality despite having many outgoing edges. Katz Centrality seeks to fix this problem by adding a small bias term so that no node has strictly zero centrality, thus affecting the centralities of the neighboring nodes as well.
However, the problem with Katz Centrality is that when a node becomes very central in a network, it passes its centrality to all of its outgoing links, making all those nodes very popular. For example, even though people who know the president are important, not all of them are (the car driver of the president for example). To fix this, PageRank Centrality utilizes the degree centrality of the node, mixed with Katz centrality to balance this problem.
In Conclusion, If graph is undirected, use Eigen Vector Centrality. If graph is directed, using Katz or PageRank is dependent upon the situation. If you want nodes that are extremely central to highly influence its neighbors, then use Katz; else, use PageRank.
you can not compare these three cause they are base on different prospective and definition of Centrality. PageRank uses eigenvector centrality concept to determine how important a website is read this
for instance :
in eingenvector centrality we use right eigenvector in the power Iteration algorithm. Now in Pagerank algorithm, we are interesting in inlinks of nodes not outlinks(directed graph). so instead of using right eigenvector, we use left eigenvector. Eigenvector centrality
Also read : Katz centrality

How to choose k in Shi-Malik Algorithm?

I'm wondering how one chooses a specific k in Shi-Malik Algo.
Do we choose several ks and rank them via their SSE measures?
Does k reflect the number of clusters we assume for the data?
kind regards Mikey
Yes, K is the number of natural grouping we believe their is in the data.
You can find K by exploring the eigenvalues.
One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the value of k that maximizes the eigengap (difference between consecutive eigenvalues). i.e., choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large.
The larger this eigengap is, the closer the eigenvectors of the ideal case and hence the better spectral clustering works. If you're interested on the justifications for this procedure, it is based on perturbation theory and spectral graph theory.
You can read more here: A Tutorial on Spectral Clustering - Ulrike von Luxburg
Other way to explore the natural grouping: number of connected components and the spectrum of the Laplacian matrix - the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. Your affinity matrix can be considered as a graph and then, try to look how many connected components you have in the graph. That will give you a sense of the neutral structure of your data..
In addition, as you mentioned, we can set a validation criterion (for example, SSE) and see its value under different values of K. That's fine once you have a labeled data (which is not always the case in clustering) and you know that this criterion/quality measure is really meaningful.

What are the differences between community detection algorithms in igraph?

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.
I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.
Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.
Here is a short summary about the community detection algorithms currently implemented in igraph:
edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).
fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.
walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).
spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).
leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.
label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.
igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.
Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.
A summary of the different community detection algorithms can be found here: http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Notably, the InfoMAP algorithm is a recent newcomer that could be useful (it supports directed graphs too).

JUNG graphs: vertex similarity?

I have a JUNG graph containing about 10K vertices and 100K edges, and I'd like to get a measure of similarity between any pair of vertices.
The vertices represent concepts (e.g. dog, house, etc), and the links represent relations between concepts (e.g. related, is_a, is_part_of, etc).
The vertices are densely inter-linked, so a shortest-path approach doesn't give good results (the shortest paths are always very short).
What approaches would you recommend to rank the connectivity between vertices?
JUNG has some algorithms to score the importance of vertices, but I don't understand if there are measures of similarity between 2 vertices.
SimPack seems also promising.
Any hints?
The centrality scores don't measure similarity of pairs of vertices, but some kind of (depending on the method) centrality of single nodes of the network in general. Therefore this approach is possibly not what you want.
SimPack indeed has a nice goal set out, but for graphs it implements isomorphism-based comparations, which rather compare multiple graphs for similarity than pairs of nodes of one given graph. Therefore this is out of scope for now.
What you are seeking are so-called graph clustering methods (also called network module determination or network community determination methods), which divide the graph (network) into multiple partitions so that the nodes in each partition are more strongly interconnected with each other than with nodes of other partitions.
The most classic method is maybe the betweenness centrality clustering of Newman & Girvan where you can exploit the dendrogram for similarity calculation, and it is in JUNG. Of course there are throngs of methods nowadays. You may want to try (shameless plug) our ModuLand method, or read the fine table of module detection algorithms at the end of the Electronic Supplementary Material. That is an overlapping graph clustering method family, that is its result for each node is a vector containing the strengths of belonging to any respective cluster of the network. Pairwise node similarity is easy to derive from pairs of these node-to-cluster vectors.
Graph clustering is non-trivial, and possible you would need to adapt any method for very precise domain-specific results, but that's up to the reader ;) Good luck!

In a graph, how to find the nearest node to a group of nodes?

I have an undirected, unweighted graph, which doesn't have to be planar. I also have a subset of graph's nodes (true subset) and I need to find a node not belonging to the subset, with minimum sum of distances to all nodes in the subset.
So far, I have implemented breath-first search starting from each node in the subset, and the intersection that occurs first is the node I am looking for. Unfortunately, it is running too slow since the graph contains a large number of nodes.
An all-pair shortest path algorithm allows you to find the distance of all nodes to each other in O(V^3) time, see Floyd-warshall. Then summing afterwards will at least be quadratic and I believe worst case cubic as well. It's a very straightforward and not terribly fast way of doing it, but it sounds like it might be an order of magnitude faster than what you're doing right now.

Resources