fast graph partitioning for mpi parallel - graph

I am new to graph partitioning, but I think the question I am asking should already have a good answer. I just want to partition a huge network (billions of nodes) into a few sub-graphs. so when using MPI, each sub-graph is processed by different processors. I am currently using the adjacency list representation of the graph. what algorithms can do this? Thank you!

Yes you can do this and there are several open source tools available. The tool I use most frequently is parMETIS.
It is a MPI - based parallel library which provides a variety of functions including graph partitioning. How you use this library is entirely dependent on your application. Generally, I prefer feeding the input graph to parMETIS, obtaining the partition and then feed the partitions as an input to my MPI programs, however you could also call the functions from your application for graphs which change in realtime.

Related

NetworkX vs GraphDB: do they serve similar purposes? When to use one or the other and when to use them together?

I am trying to understand if I should use a GraphDB for my project. I am mapping a computer network and I use NetworkX. The relationships are physical or logical adjacency (L2 and L3) . In the current incarnation my program scans the network and dumps the adjacency info in a Postgress RDB. From there I use Python to build my graphs using NetworkX.
I am trying to understand if I should change my approach and if there is any benefit in storing the info in a GaphDB. Postgress has AgensGraph which seems to be built on top of Postgress as a GraphDB overlay or addon. I don not know yet if installing this on top will make my life easier. I barely survived the migration from SQLite to Postgress :-) and to SQLAlchemy so now in not even 3 months I am reconsidering things while I can (the migration is not complete)
I could chose to use a mix but I am not sure if it makes sense to use a GraphDB. From what I understand these has advantages as not needing a schema (which helps a lot for a DB newbie like me)
I am also wondering if NetworkX (Python librayr) and GraphDB overlap in any way. As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph while GraphDB is mainly used to query the data stored in the DB. Do they overlap in anyway? Can they be used together?
TLDR: Use Neo4j or OrientDB to store data and networkx for processing it (if you need complicated algorithms). It will be the best solution.
I strongly don't recommend you to use GraphDB for your purposes. GraphDB is based on RDF that is used for semantic web and common knowledge storage. It is not supposed to be used with problems like yours. There are many graph databases that will fit to you much better. I can recommend Neo4j (the most popular graph database, as you can see; free, but non-open-source) or OrientDB (the most popular open-source graph database).
I used graph database when I had a similar problem (but I used HP UCMDB, that is corporate software and is not free). It was really MUCH better than average relational DBs. So the idea of graph database usage is good and it fits to this kind of problems naturally.
I am not sure you really need networkx to analyze the graph (you can use graph query languages to it), but if you want, you can load the data from your DB to networkx with GraphML or some another methods (OrientDB is similar) to process it using networkx.
And the little question-answer quiz in the end:
As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph
Absolutely right.
while GraphDB is mainly used to query the data stored in the DB.
It is a database. And, yes, it is mainly used to query the data.
Do they overlap in anyway?
They are both about graphs. Of course they overlap :)
Can they be used together?
Yes, they can. No, they should not be used together for your problem.

Graph generation tool

Is there a free tool out there that would allow me to generate a directed/undirected weighted graph?
I'm thinking about something that I can draw the graph on some sort of canvas and then save it in a file in adjacency list format or egde list format, or any other common format.
Have you hear about Neo4j? Check it out maybe it's what you need:
http://www.neo4j.org/
This is Wikipedias definition:
Neo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables".
If you're using Java (or Jython or other Java-compatible language) and want a Swing graphical component, JGraphX is a decent library, though this may be higher-level than what you want (since the adjacency list/matrix is a detail internal to the library).
JGraph is also available in .NET, PHP, and JS.
Code: https://github.com/jgraph/jgraphx
Documentation: http://jgraph.github.io/mxgraph/
I find something: Gephi.
I can draw a graph and save/export it in multiple formats.
Hope this answer helps others.
http://gephi.github.io/
Although asking for tools is off-topic at Stack Overflow, I can recommend Bandage for visualization of sequence assemblies.

Memory virtualization with R on cluster

I don't know almost anything about parallel computing so this question might be very stupid and it is maybe impossible to do what I would like to.
I am using linux cluster with 40 nodes, however since I don't know how to write parallel code in R I am limited to using only one. On this node I am trying to analyse data which floods the memory (arround 64GB). So my problem isn't lack of computational power but rather memory limitation.
My question is, whether it is even possible to use some R package (like doSnow) for implicit parallelisation to use 2-3 nodes to increase the RAM limit or would I have to rewrite the script from ground to make it explicit parallelised ?
Sorry if my question is naive, any suggestions are welcomed.
Thanks,
Simon
I don't think there is such a package. The reason is that it would not make much sense to have one. Memory access is very fast, and accessing data from another computer over the network is very slow compared to that. So if such a package existed it would be almost useless, since the processor would need to wait for data over the network all the time, and this would make the computation very very slow.
This is true for common computing clusters, built from off-the-shelf hardware. If you happen to have a special cluster where remote memory access is fast, and is provided as a service of the operating system, then of course it might be not that bad.
Otherwise, what you need to do is to try to divide up the problem into multiple pieces, manually, and then parallelize, either using R, or another tool.
An alternative to this would be to keep some of the data on the disk, instead of loading all of it into the memory. You still need to (kind of) divide up the problem, to make sure that the part of the data in the memory is used for a reasonable amount of time for computation, before loading another part of the data.
Whether it is worth (or possible at all) doing either of these options, depends completely on your application.
Btw. a good list of high performance computing tools in R is here:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
For future inquiry:
You may want to have a look at two packages "snow" and "parallel".
Library "snow" extends the functionality of apply/lapply/sapply... to work on more than one core and/or one node.
Of course, you can perform simple parallel computing using more than one core:
#SBATCH --cpus-per-task= (enter some number here)
You can also perform parallel computing using more than one node (preferably with the previously mentioned libraries) using:
#SBATCH --ntasks-per-node= (enter some number here)
However, for several implications, you may wanna think of using Python instead of R where parallelism can be much more efficient using "Dask" workers.
You might want to take a look at TidalScale, which can allow you to aggregate nodes on your cluster to run a single instance of Linux with the collective resources of the underlying nodes. www.tidalscale.com. Though the R application may be inherently single threaded, you'll be able to provide your R application with a single, simple coherent memory space across the nodes that will be transparent to your application.
Good luck with your project!

Finding Connected Components using Hadoop/MapReduce

I need to find connected components for a huge dataset. (Graph being Undirected)
One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself.
I was just wondering if there is any existing API for the same since it is a very common problem in Social Network Analysis?
Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself?
Thanks
I blogged about it for myself:
http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html
But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.
I've written a connected components algorithm with MapReduce here: (Mindist search)
https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce
Also a BSP version for Apache Hama can be found here:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java
The implementation isn't as difficult as in MapReduce and it is at least 10 times faster.
If you're interested, checkout the latest version in TRUNK and visit our mailing list.
http://hama.apache.org/
http://apache.org/hama/mail-lists.html
I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).
The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.
I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.
You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.
The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.
PEGASUS: A Peta-Scale Graph Mining System - Implementation and
Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In
IEEE International Conference on Data Mining (ICDM 2009)
EDIT:
The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).
It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.
https://github.com/kwartile/connected-component

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Resources