How to find the most important nodes in each Louvain community? - nebula-graph

I use the distributed NebulaGraph to run community division and want to find important nodes in each community and mine the hidden relationships.
I feel there are two ways to try, one using the nebulagraph algorithm and the other using python, except that it needs to be converted to a distributed database to run it.
For the first way of using nebulagraph algorithm.
I am running the Louvain community partitioning algorithm using a direct submission of the algorithm package:
spark-submit --master "local" --class com.vesoft.nebula.algorithm.Main /opt/offline/nebula/nebula-algorithm-3.0.0.jar -p /opt/offline/nebula/application.conf
My question is that when the community division is run and I want to dig out the important nodes in each community, and I feel it like degree centrality, mediation centrality, or proximity centrality etc.
I see that the application.conf config file has these algorithms, but it is not clear if I can write multiple execution algorithms after executeAlgo?
algorithm: {
# the algorithm that you are going to execute, pick one from [pagerank, louvain,
connectedcomponent
# labelpropagation, shortestpaths, degreestatic, kore, stronglyconnectedcomponent, trianglecount,
# betweenness, graphtriangleCount, clusteringcoefficient, bfs, hap, closeness, jaccard, node2vec]
executeAlgo: louvain
# PageRank parameter
pagerank:
maxIter: 10
resetProb: 0.15 # default 0.15
}
# Louvain parameter
louvain: {
maxIter: 20
internalIter: 10
tol: 0.5
}
...
If I write Louvain and then write Betweenness, will the Louvain algorithm be executed first and then the betweenness algorithm? Even if the betweenness algorithm is executed, will it run the algorithm on the entire graph rather than on top of the Louvain community division result?

Related

How to set resolution parameter for Cluster Info Map in R-igraph?

I am applying the Cluster Info Map algorithm for community detection for a set of large networks. I can achieve high modularity scores of around 0.6 however this comes with getting very high numbers of communities (18-24) which complicates my final analysis. Ideally, I would like to work with 10-15 communities. Is there any way to adjust how many communities are ultimately detected in my network using a resolution/modularity parameter or otherwise?
I've seen similar questions for Louvain and other algorithms, but nothing specifically on the Cluster Info Map algorithm. Nor does the official documentation for the standard function show anything about a resolution parameter. https://search.r-project.org/CRAN/refmans/igraph/html/cluster_infomap.html
I'm wondering if there is some other workaround available? Example of my code below for a weighted, directed network.
ci <-cluster_infomap(network,
e.weights = E(network)$weight,
nb.trial=50)
The Infomap implementation included in igraph does not have any parameters that allow controlling the number of communities or the resolution. igraph's cluster_infomap is based on the old Infomap package.
The new Python Infomap package supports several features which you might find helpful, including hierarchical partitioning.

DAG-shortest path vs Dijkstra algorithm

I have implemented the Dijkstra algorithm from the pseudocode found in the reference "Introduction to Algorithms", 3rd edition by Cormen, for the single-source shortest path problem.
My implementation was made on python using linked lists to represent graphs in an adjacency list representation. This means that the list of nodes is a linked list and each node has a linked list to represent the edges of each node. Furthermore, I didn't implement or use any binary heap or fibonacci heap for the minimum priority queue that the algorithm needs, so I search for each node in O(V) time inside the linked list of nodes when the procedure needs to extract the next node with the smallest distance from the source.
On the other hand, the reference also provides an algorithm for DAG's (which I have implemented) using a topological sort before applying the relaxation procedure to all the edges.
With all these context, I have a Dijkstra algorithm with a complexity of
O(V^2)
And a DAG-shortest path algorithm with a complexity of
O(V+E)
By using the timeit.default_timer() function to calculate the running times of the algorithms, I have found that the Dijkstra algorithm is faster that the DAG algorithm when applied to DAGs with positive edge weights and different graph densities, all for 100 and 1000 nodes.
Shouldn't the DAG-shortest path algorithm be faster than Dijkstra for DAGs?
Your running time analysis for both algorithms is correct and it's true that DAG shortest path algorithm is faster than Dijkstra's algorithm for DAGs.
However, there are 3 possible reasons for your testing results:
The graph you used for testing is very dense. When the graph is very dense, E ≈ V^2, so the running time for both algorithms approach O(V^2).
The number of vertices is still not large enough. To solve this, you can use a much larger graph for further testing.
The initialization of the DAG costs a lot of running time.
Anyway, DAG shortest path algorithm should be faster than Dijkstra's algorithm theoretically.

Clustering algorithm for unweighted graphs

I am having unweighted and undirected graph as my network which is basically the network of proteins.I want to cluster this graph and divide this graph in to disjoint clusters. Can any 1 suggest clustering algorithms which i can apply on the biological network which is unweighted and undirected graph.
Several graph partitioning algorithms exist, they use different paradigm to tackle the same problem.
The most common is the Louvain's method optimizing Newman's modularity.
In python using Networkx as graph library you can use community to partition your graph.
The fastest graph partitioning uses Metis. It is based hierarchical graph coarsening.
You also have N-cut originally designed to segment images.
Finally, you can partition graphs using stochastic block-model techniques. A very good python implementation of Louvain and several block-model techniques can be found in Graph-tool.
My favorite is the latter, it is fast (based on the Boost graph library), relatively easy to use and tuneable.
EDIT: Note that in graph-tool, what we call Louvain modularity is indeed Newman's algorithm, the docs are here.

Graph Shortest Paths w/Dynamic Weights (Repeated Dijkstra? Distance Vector Routing Algorithm?) in R / Python / Matlab

I have a graph of a road network with avg. traffic speed measures that change throughout the day. Nodes are locations on a road, and edges connect different locations on the same road or intersections between 2 roads. I need an algorithm that solves the shortest travel time path between any two nodes given a start time.
Clearly, the graph has dynamic weights, as the travel time for an edge i is a function of the speed of traffic at this edge, which depends on how long your path takes to reach edge i.
I have implemented Dijkstra's algorithm with
edge weights = (edge_distance / edge_speed_at_start_time)
but this ignores that edge speed changes over time.
My questions are:
Is there a heuristic way to use repeated calls to Dijkstra's algorithm to approximate the true solution?
I believe the 'Distance Vector Routing Algorithm' is the proper way to solve such a problem. Is there a way to use the Igraph library or another library in R, Python, or Matlab to implement this algorithm?
EDIT
I am currently using Igraph in R. The graph is an igraph object. The igraph object was created using the igraph command graph.data.frame(Edges), where Edges looks like this (but with many more rows):
I also have a matrix of the speed (in MPH) of every edge for each time, which looks like this (except with many more rows and columns):
Since I want to find shortest travel time paths, then the weights for a given edge are edge_distance / edge_speed. But edge_speed changes depending on time (i.e. how long you've already driven on this path).
The graph has 7048 nodes and 7572 edges (so it's pretty sparse).
There exists an exact algorithm that solves this problem! It is called time-dependent Dijkstra (TDD) and runs about as fast as Dijkstra itself.
Unfortunately, as far as I know, neither igraph nor NetworkX have implemented this algorithm so you will have to do some coding yourself.
Luckily, you can implement it yourself! You need to adapt Dijkstra in single place.
In normal Dijkstra you assign the weight as follows:
With dist your current distance matrix, u the node you are considering and v its neighbor.
alt = dist[u] + travel_time(u, v)
In time-dependent Dijkstra we get the following:
current_time = start_time + dist[u]
cost = weight(u, v, current_time)
alt = dist[u] + cost
TDD Dijkstra was described by Stuart E. Dreyfus. An appraisal of some shortest-path
algorithms. Operations Research, 17(3):395–412, 1969
Currently, much faster heuristics are already in use. They can be found with the search term: 'Time dependent routing'.
What about igraph package in R? You can try get.shortest.paths or get.all.shortest.paths function.
library(igraph)
?get.all.shortest.paths
get.shortest.paths()
get.all.shortest.paths()# if weights are NULL then it will use Dijkstra.

What are the differences between community detection algorithms in igraph?

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.
I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.
Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.
Here is a short summary about the community detection algorithms currently implemented in igraph:
edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).
fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.
walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).
spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).
leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.
label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.
igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.
Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.
A summary of the different community detection algorithms can be found here: http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Notably, the InfoMAP algorithm is a recent newcomer that could be useful (it supports directed graphs too).

Resources