Finding Connected Components using Hadoop/MapReduce

Finding Connected Components using Hadoop/MapReduce - graph

I need to find connected components for a huge dataset. (Graph being Undirected)
One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself.
I was just wondering if there is any existing API for the same since it is a very common problem in Social Network Analysis?
Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself?
Thanks

I blogged about it for myself:
http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html
But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.
I've written a connected components algorithm with MapReduce here: (Mindist search)
https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce
Also a BSP version for Apache Hama can be found here:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java
The implementation isn't as difficult as in MapReduce and it is at least 10 times faster.
If you're interested, checkout the latest version in TRUNK and visit our mailing list.
http://hama.apache.org/
http://apache.org/hama/mail-lists.html

I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).
The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.
I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.

You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.
The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.
PEGASUS: A Peta-Scale Graph Mining System - Implementation and
Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In
IEEE International Conference on Data Mining (ICDM 2009)
EDIT:
The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).

It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.
https://github.com/kwartile/connected-component

Related

fast graph partitioning for mpi parallel

I am new to graph partitioning, but I think the question I am asking should already have a good answer. I just want to partition a huge network (billions of nodes) into a few sub-graphs. so when using MPI, each sub-graph is processed by different processors. I am currently using the adjacency list representation of the graph. what algorithms can do this? Thank you!

Yes you can do this and there are several open source tools available. The tool I use most frequently is parMETIS.
It is a MPI - based parallel library which provides a variety of functions including graph partitioning. How you use this library is entirely dependent on your application. Generally, I prefer feeding the input graph to parMETIS, obtaining the partition and then feed the partitions as an input to my MPI programs, however you could also call the functions from your application for graphs which change in realtime.

GraphX Explanation

I have a couple of fundamental questions related to GraphX on Spark
Is there a resource that can help me understand how GraphX works under the covers in terms of
- how is parallelism done
- how is the graph partitioned
- can any graph algorithm be implemented in GraphX or are there only specific problems that can be implemented - for example - for Bipartite Graphs - can we write a matching algorithm using Path Augmentation etc
I have basic working knowledge of GraphX - and the methods and operators available there and I have worked on the basic problems in the examples using Scala.
Any help would be very appreciated

( Answer was provided to me by - Michal Malak - author of upcoming book - GraphX in Action - Manning Press )
These are great questions, and ones I should make sure are addressed in the book
Three major caveats to GraphX:
1. It's graph processing, not a graph database (this one is already mentioned in the book)
2. It's suited for massively parallel vertex-to-vertex communications in a SIMD-style execution model. It is not suited for classic graph algorithms, which is why the implementations in chapter 6 are not a great fit for GraphX
3. The dirty little secret is that although there is API control to partition the vertices (PartitionStrategy), edges are always randomly partitioned. Worst of all, edges and vertices are partitioned independently, so all opportunity for data locality is lost.
There is, however, a slightly unexpected optimization intrinsic to GraphX internals, and that is that each edge has routing information to the vertices.

Improve a puzzle game AI using machine learning

My motivation for asking this question is that I have found an interesting problem using machine learning on a graph data set. There are papers out there on this subject. For example, "Learning from Labeled and Unlabeled Data on a Directed Graph" (Zhou, Huang, Scholkopf). However I do not have a background in artificial intelligence or machine learning so I would like to write a smaller program for a more general audience before working on anything scientific.
Several years ago I wrote a game called Solumns. It is an evil variant of the classic Sega game Columns. Inspired by bastet, it bruteforces for colour combinations disadvantageous to the player. It is hard.
I would like to improve upon its AI. I figure the game space (grid of coloured blocks, column positions, column colours) fits a graph structure better than a list of attributes. If that is the case, then this problem is similar to my research problem.
I am considering using the following high-level plan to solve this problem:
I'm thinking what would be useful is if the AI opponent could assign a fitness rating to a possible move based on more data than the number of existing squares on the board after the move. I'm thinking using a categoriser. Train on the move and all past moves, using the course of the rest of the game as a measure of success.
I am also thinking of developing a player bot that can beat the standard AI opponent. This could be useful when generating data for 1.
Use a sample of the player bot's games to build an AI that beats the strategic player. Maybe use this data for 1, too.
Write a fun AI that delegates to a possible combination of 1, 3, and the original AI, when appropriate, which I will determine using experimentation to find heuristic fudge factors.
To build the player bot, I figured I could use brute force to compute the sample space. Then use machine learning techniques such as those used in building Random Forests to create some kind of decision maker.
Building the AI opponent is where I am most perplexed.
Specific questions then:
Rating moves sounds like the kind of thing people do with chess, and although I'll admit my approach may be ignorant, there is a lot about this in literature and I can learn from that. Question is, should the player bot and AI opponent create the data sample? It sounds like I'm getting confused between different sample sets, which sounds like a recipe for bad training. Should I just play the game a bunch?
What kind of algorithm should I consider for training the player bot against the current AI?
What kind of algorithm should I consider for training an AI opponent against the player bot?
Extra info:
I am deliberately not asking if this strategy is a good fit for programming a game AI. Sure, you might be able to write a great AI more simply (after all it is already difficult to play). I want to learn about machine learning by doing something fun.
The original game was written in a mixture of racket and C. I am porting it to jruby for various reasons, likely with extensions or RPC calls to another faster language. I am not so interested in existing language-specific solutions here. I want to develop skills in this area and am not afraid to implement an algorithm for myself.
You can get the source for the original game here

I would not go for machine learning here. Look at game playing AIs.
You have an adversarial games (like Go) with two asymmetric players:
The user who places the pieces,
and the computer who chooses the pieces (instead of choosing pieces by chance).
I would probably start with Monte Carlo Tree Search.

RaptorQ FEC Implementation Obstacle

I am trying to implement the RaptorQ Forward Error Correction Scheme in java as specified here:
https://datatracker.ietf.org/doc/html/draft-ietf-rmt-bb-fec-raptorq-04#section-5.3.3
The core of the problem is actually to execute gaussian elimination on a matrix A in a smart way to be fast.
The matrix A is composed of submatrices, among others these are G_LDPC,1 and G_LDPC,2.
(Generator matrices for Low Density Parity Checks)
On page 22 in section "5.3.3.3. Pre-coding relationships" it is stated that this matrices can be decuced from the code snippet on the same page.
My Problem: I am not able to derive the structure of these two submatrices from the code snipped.
Does someone see how to do that, or how the structure looks like?
Thanks for any kind of help!
Max

I'm also trying to implement RaptorQ, and ran into this exactly same problem. My suggestion is this book:
Raptor Codes (Foundations and Trends(R) in Communications and Information Theory) [Paperback]
Amin Shokrollahi (Author), Michael Luby (Author)
It has a better explanation on constructing the constraint matrix in section 3.3.3 (I'd quote it, but I don't have it digital).
#Max anyway we can chat or you can share your RFC5053 implementation? I really could use someone familiar with these difficulties to talk to and share some doubts/ideas.

After being stuck with the problem, I decided to implement the Raptor codec according to RFC 5053 as described here:
https://www.rfc-editor.org/rfc/rfc5053
This is actually the predecessor version of RaptorQ.
The general working principle seems to be the same, but it is less optimized and therefore has worse properties, especially in sense of reception efficiency.
But on the other hand it was less complex and more intuitive to me, and therefore I was able to code a working implementation in Java.
And after all, I have to admit that I'm very astonished by the capabilities of the created codec!
With the deeper understanding gained during coding the RFC 5053 implementation I was probably also able to realize the RaptorQ codec now.

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.

You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.

The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)

I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex