I have a network of about 2k nodes and 4.8k edges which exhibits high small world attributes. I want to get its cohesion blocks with cohesive.blocks() in igraph, however, it has been running for days and still not output.
There was a link similar with this issue (https://lists.nongnu.org/archive/html/igraph-help/2008-01/msg00020.html), and MPI was mentioned to run cohesive.blocks() in parallel. So I wonder how to run cohesive.blocks() or other functions in igraph in parallel.
Thanks.
Related
I have a graph with 480k nodes and 34M edges. I want to create node embeddings using Node2Vec on this graph. But, It is not even able to calculate transition probabilities. I am using a Google Cloud Machine with 32 cores and 120 GB RAM. Infrastructure is not the problem, the problem is that the function _precompute_probabilities in the node2vec pip library is not paraller. It is using only a single thread to calculate the transition probabilities. Is there a way to make this parallel or is they any other parallel version of Node2Vec ?
TLDR
To compute embeddings on large graphs use GRAPE.
pip install grape
Example:
from grape import Graph
from grape.embedders import Node2VecGloVeEnsmallen
graph = Graph.from_csv(
## The path to the edges list tsv
edge_path="edges.csv",
sources_column="source",
destinations_column="destination",
directed=False,
)
embedding = Node2VecGloVeEnsmallen().fit_transform(graph)
Longer answer
To solve this issue, we developed GRAPE, our Rust library with Python bindings. We needed to run node2vec on big graphs and the libraries we found weren't fast enough. We re-implemented and optimized many models, including Node2vec's, from the ground up without using Tensorflow or Pytorch.
Here's some benchmarks on our server with 12 cores (24 threads) and 128GB of ram.
Info about the tested graphs:
WikiEN has 130M edges and 17M nodes.
CTD has 45M edges and 100K nodes.
PheKnowLator has 7M edges and 800K nodes.
Here's the tutorial on how to load your own custom graph, we support CSV-like formats:
https://github.com/AnacletoLAB/grape/blob/main/tutorials/Loading_a_Graph_in_Ensmallen.ipynb
and here's the complete tutorial on how to run and visualize node2vec with Glove on a given graph:
https://github.com/AnacletoLAB/grape/blob/main/tutorials/Using_Node2Vec_GloVe_to_embed_Cora.ipynb
Our paper is still under review ( https://arxiv.org/abs/2110.06196 ) and we are just two developers, so, if you need any help contact us on Discord, Github, or Twitter #GRAPElib.
I found a library Graph2Vec, it uses a CSR Matrix to generate walks instead of jumping from node to node in memory. It is way faster than Node2Vec.
Link: https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/
Github: https://github.com/VHRanger/graph2vec
Also, you can refer to this issue and try the mentioned libraries:
https://github.com/aditya-grover/node2vec/issues/10
I've tried https://github.com/eliorc/node2vec with the "temp_folder" property. Thought I didn't feel that it was much faster, so I ended up with the version with CSR Matrices.
Oh... was it yourself, who answered the question? :)
Good to know, thank you for the tip
I have several i7 desktops which I would like to use to speed up the computation time of function genoud in package rgenoud. In genoud, you can assign a cluster that was generated in parallel (and I assume snow also).
What kind of clustering software would you recommend for this? I have tried Beowulf clusters, but the documentation from this is mostly outdated, so I am looking for a guide which shows me how to do it which is still up to date.
This cluster is going to be over LAN, and I have assigned IPs to the nodes.
Thanks
I am using the InfoMap algorithm in the igraph package to perform community detection on a directed and non-weighted graph (34943 vertices, 206366 edges). In the graph, vertices represent websites and edges represent the existence of a hyperlink between websites.
A problem I have encountered after running the algorithm is that the majority of vertices have a membership in a single massive community (32920 or 94%). The rest of the vertices are dispersed into hundreds of other tiny communities.
I have tried different settings with the nb.trials parameter (i.e. 50, 100, and now running 500). However, this doesn't seem to change the result much.
I am feeling rather exasperated because the run-time on the algorithm is quite high, so I have to wait each time for the results (with no luck yet!!).
Many thanks.
Thanks for all the excellent comments. In the end, I got it working by downloading and running the source code for Infomap, which is available at: http://www.mapequation.org/code.html.
Due to licence issues, the latest code has not been integrated with igraph.
This solved the problem of too many nodes being 'lumped' into a single massive community.
Specifically, I used the following options from the command line: -N 10 --directed --two-level --map
Kudos to Martin Rosvall from the Infomap project for providing me with detailed help to resolve this problem.
For the interested reader, here is more information about this issue:
When a network collapses into one major cluster, it is most often because of a very dense and random link structure ... In the code for directed networks implemented in iGraph, teleportation is encoded. If many nodes have no outlinks, the effect of teleportation can be significant because it randomly connect nodes. We have made new code available here: http://www.mapequation.org/code.html that can cluster network without encoding the random teleportation necessary to make the dynamics ergodic. For details, see this paper: http://pre.aps.org/abstract/PRE/v85/i5/e056107
I was going to put this in a comment, but it ended up being too long and hard to read in that format, so this is a tangentially related answer.
One thing you should do is assess whether the algorithm is doing a good job at finding community structure. You can try to visualise your network to establish:
Is the algorithm returning community structures that make sense? Maybe there is one massive community?
If not does the visualisation provide insight as to why?
This will help inform your next steps. Maybe the structure of the network requires a different algorithm?
One thing I find useful for large networks is plotting your edges as a heatmap. This is simple to do if you have your edges stored in an adjacency matrix.
For this, you can use the image function, passing in your matrix of edges as the argument z. Hopefully this will allow you to see by eye the community structure.
However you also want to assess the correctness of your algorithm, so you want to sort the nodes (rows and columns of your adjacency matrix) by the community they've been assigned to.
Another thing to note is that if your edges are directed it may be more difficult to assess by eye as edges can appear on either side of the diagonal of the heatmap. One thing you can do is instead plot the underlying graph -- that is the adjacency matrix assuming your edges are undirected.
If your algorithm is doing a good job, you would expect to see square blocks along the diagonal, one for each detected community.
Recent research proposes the classification or characterization of graphs (instead of flat feature vectors) with Support vector machines. Is there any open source available in C/C++ which can perform such classification?
ChemCPP is oriented towards chemoinformatics but has some graph kernel computation code that you may find useful.
Karsten Borgwardt has done a lot of work on Graph Kernels. His page having code computing those kernels is here.
I want to carry out Graph Clustering in a huge undirected graph with millions of edges and nodes. Graph is almost clustered with different clusters joined together only by some nodes(kind of ambiguous nodes which can relate to multiple clusters). There will be very few or almost no edges between two clusters. This problem is almost similar to finding vertex cut set of a graph, with one exception that graph needs to be partitioned into many components(their number being unknown).(Refer this picture https://docs.google.com/file/d/0B7_3zLD0XdtAd3ZwMFAwWDZuU00/edit?pli=1)
Its almost like different strongly connected components sharing a couple of nodes between them and i am supposed to remove those nodes to separate those strongly connected components. Edges are weigthed but this problem is more like finding structures in a graph, so edge weigths won't be of relevance. (Another way to think about the problem would be to visualize Solid Spheres touching each other at some points with Spheres being those strongly connected components and touching points being those ambiguous nodes)
I am prototyping something, so am quiet short of time to pick up Graph Clustering Algorithms by myself and to select the best possible. Plus i need a solution that would cut nodes and not edges since different clusters share nodes and not edges in my case.
Is there any research paper, blog that addresses this or somewhat related problem? Or can anyone come up with a solution to this problem howsoever dirty.
Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. Any inputs, links for that too?
Is there any current open source implementation in MapReduce that can i directly use?
I think this problem is analogous to Finding Communities in online social networks by removing vertices.
Your problem is not so simple. I am afraid that it is related to the clique problem, which is NP complete, so unless you quantify somehow the statement "there are almost no edges between the clusters", your problem might be still very difficult. But what I would do in your shoes, would be to try one dirty, greedy approach, namely regarding the nodes as the following kind of quasi-neural net:
Each vertex I would consider to have inputs, outputs and a sigmoid activation function which convert the input value (sum of inputs) into the output value. The output value, and I consider this important, would not be cloned and sent to all the neighbors, but rather divided evenly between the neighbors. In addition to this, I would define a logarithmic decay of activity in a neuron (self-suppression, suppressive connection to itself), defined by a decay parameter global for the net.
Now, I would start simulation with all the neurons starting from activity 0.5 (activity range is 0 to 1) with very high decay parameter, which would lead to all the neuronst quickly stabilizing in 0 state. I would then gradually decrease the decay parameter until the steady state result would yield the first clique with non-zero stable activity.
The question is what to do next. One possibility is to subtract the found clique from the graph and run the same process again until we find all the cliques. This greedy approach might succeed if your graph is indeed as well behaved (really almost clustered) as you say, but might lead to unexpected results otherwise. Another possibility is to give the found clique a unique clique smell that would be repulsive (mutual suppresion) to other cliques an rerun the algorithm until the second clique is found, give it a different clique smell repulsive to all others etc., until each node has its own assigned smell.
I think this would be as many big ideas as i have about this.
The key is, that since it is probably not possible to solve this problem in the general case (likely NP complete), you need to take use of whatever special properties your graph has. That means you need to play with parameters for a while until the algorithm solves 99% of the cases that you encounter. I don't think that it is possible to give the numerically precise answer to your question without long experimentation with the actual datasets that you encounter.
Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. Any inputs, links for that too?
In my experience I doubt if using Map/Reduce over here would be truly advantageous. First 10^6 order of nodes isn't really that large [that too in a non hyper-connected graph, since you are considering clustering], and the over head of using Map/Reduce [unless you already have setup your hardware/software for it] for your problem will not be worth it.
Map/Reduce will work much better, where once you have solved the clustering issue, and then want to process each cluster with similar analysis. Basically when you can break your task into relatively isolated sub-tasks, which can be performed in parallel. This of course can be cascaded to several layers.
In a relatively similar situation, I personally first modelled my graph into a graph database (I used Neo4J, and would recommend it highly) and then ran my analytic and queries on it. You will be surprised as to how white board friendly this solution is, and even massively joined and connected queries will be executed near instantaneously especially at the scale of only a few million nodes. For example, you can do a filtered analysis, based on degrees of separation, followed by listing of commons.