Hadoop MapReduce implementation of shortest PATH in a graph, not just the distance - graph

I have been looking for "MapReduce implementation of Shortest path search algorithms".
However, all the instances I could find "computed the shortest distance form node x to y", and none actually output the "actual shortest path like x-a-b-c-y".
As for what am I trying to achieve is that I have graphs with hundreds of 1000s of nodes and I need to perform frequent pattern analysis on shortest paths among the various nodes. This is for a research project I am working on.
It would be a great help if some one could point me to some implementation (if it exists) or give some pointers as to how to hack the existing SSSP implementations to generate the paths along with the distances.

Basically these implementations work with some kind of messaging. So messages are send to HDFS between map and reduce stage.
In the reducer they are grouped and filtered by distance, the lowest distance wins. When the distance is updated in this case, you have to set the vertex (well, some ID probably) where the message came from.
So you have additional space requirement per vertex, but you can reconstruct every possible shortest path in the graph.
Based on your comment:
yes probably
I will need to write another class of the vertex object to hold this
additional information. Thanks for the tip, though it would be very
helpful if you could point out where and when I can capture this
information of where the minimum weight came from, anything from your blog maybe :-)
Yea, could be a quite cool theme, also for Apache Hama. Most of the implementations are just considering the costs not the real path. In your case (from the blog you've linked above) you will have to extract a vertex class which actually holds the adjacent vertices as LongWritable (maybe a list instead of this split on the text object) and simply add a parent or source id as field (of course also LongWritable).
You will set this when propagating in the mapper, that is the for loop that is looping over the adjacent vertices of the current key node.
In the reducer you will update the lowest somewhere while iterating over the grouped values, there you will have to set the source vertex in the key vertex by the vertex that updated to the minimum.
You can actually get some of the vertices classes here from my blog:
Source
or directly from the repository:
Source
Maybe it helps you, it is quite unmaintained so please come back to me if you have a specific question.
Here is the same algorithm in BSP with Apache Hama:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/SSSP.java

Related

Jagged Result Array Gremlin Query

Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.
I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.

Gremlin: how to get all of the graph structure surrounding a single vertex into a subgraph

I would like to get all of the graph structure surrounding a single vertex into a subgraph.
The TinkerPop documentation shows how to do this for a fixed number of traversal steps.
My question: are there any recipes for getting the entire surrounding structure (which may include cycles) without knowing ahead of time how many traversal steps are needed?
I have updated this question for the benefit of anyone who might land on this question, here is a gremlin statement that will capture an arbitrary graph structure that surrounds a vertex with id='xxx'
g.V('xxx').repeat(out().simplePath()).until(outE().count().is(eq(0))).path()
-- this incorporates Stephen Mallete's suggestion to use simplePath within the repeat step.
That example uses repeat()...times() but you could easily replace times() with until() and not know the number of steps ahead of time. See more information in the Reference Documentation on how repeat() works to see how to express different types of flow control and traversal stop conditions.

Working with redis-graph

I'm a beginner in redis-graph and presently I'm working on K-shortest path algorithm which is implemented in JAVA(where a graph is created using hashmap) and as the dataset is quite large(27 million rows) I need to a database to store a graph and for the same reason I plan to use redis-graph, but redis-graph uses cypher query language. How can integrate both these applications?
Any other suggestion(s) would be welcome.
Although you can use RedisGraph to hold the graph for you at the moment there's no way of finding K shortest paths from node A to node B, I've implemented a shortest path algorithm within RedisGraph but have yet to expose it to clients, I'm not sure of the approach you had in mind for finding K shortest paths, *I've implemented one using a cost edge flow-network, you can find my javascript implementation here
I'll might include a k-shortest path algo within RedisGraph, I need some time to think about that, in any case, using the current sub-set of Cypher supported by RedisGraph finding K shortest path is not possible,
You'll might be able to retrieve a relevant sub-graph from RedisGraph to your Java application find path I out of K and once no additional paths can be found, extend that sub-graph be retrieving additional nodes / edges from RedisGraph.

Graph building algorithm given an infinite walk

I need help writing a resilient, mapping (graph building) algorithm. Here is the problem:
Imagine you have a text oriented virtual reality(TORG/MUD) where you can send movement commands (n, s, w, e, up, down, in, out ... etc.) through telnet to move your character from room to room. And the server sends back corresponding room description after each movement step. Your task is to generate a graph that represents the underlying map structure, so that you can simply do a DFS on the client side to figure out how to get from one room to another. Also you want to design the system so that minimum user input is required
Here are the assumptions:
The underlying graph topology on the server never change.
Room descriptions are not unique. Most of the rooms have distinct descriptions, but some of the rooms have the exact same description. Room description are changed slightly once in a while(days or weeks)
Your movement may fail randomly with a small probability, and you will get an error message instead of the new room description, such as "You stop to wait for the wagon to pass", "The door is locked", and your character will still be in the current room.
You cannot assume the unit spacial distance for each movement. For example you may have a topology like the one shown below, so assuming unit distance for each neighboring room and assigning a hard coordinate to each room is not going to work. However you may assume that the relative direction to be consistent, that is there will be no loop in a topological sort along X(west, east) and Y(south, north).
Objective: given a destination that you have visited before, the algorithm guarantees to eventually move you to that location, and will find the shortest path most of the time. Mistakes are allowed, but the algorithm should be able to detect and correct the mistakes on the fly.
Example graph:
A--B---B
| | <- k
C--D-E-F
I have already implemented a very simple solution that would record the room descriptions and construct a graph. The following is an example of a graph representation my program generates in json. The "exits" are movement direction mapped to node id. -1 represents an un-mapped room. If the user walks in a direction and detect a -1 in the graph representation, the algorithm will attempt to find nodes already in the graph. If nodes with the same description are found, it will prompt the user to decide whether the newly seen room is one of the old nodes. If not, it adds a new node and connect it to the graph.
"node": [
{
"description": "You are standing in the heart of the Example city. There is a fountain with large marble statue in it...",
"exits": {
"east": -1,
"north": 31,
"south": 574,
"west": 42
},
"id": 0,
"name": "cot",
"tags": [],
"title": "Center of Town",
"title_color": "\u001b[1m\u001b[36m Center of Town\u001b[0;37;40m"
},
{
...
This simple solution requires human input detect loops when building the graph. For example, in the graph shown above, assume same letters represent same room descriptions. If you start mapping at the first B, and to left, down, right...till you perform movement k, now you see B again, but mapper cannot determine whether it is the B it has seen before.
In short I want to be able to write a resilient graph building algorithm that takes a walk (possibly infinite) in a hidden target graph and generate(and keep updating) a graph that can (hopefully) as similar as the target graph. We then use the generated graph to help navigate in the target graph. Is there an existing algorithm for this category of problems?
I also thought about applying some machine learning techniques to this problem, but I am unable to write out a concrete model. I am thinking along the lines of defining a list of features for each room we see (room description, exits, neighboring nodes), and each time we see a room we attempt to find the graph node that best fit the features, and based on some update rule(like Winnow or Perceptron) update the description we see based on some mistakes detection metrics.
Any thoughts/suggestions would be very much appreciated!
Many MU*s will give you a way to get a unique identifier for rooms. (On MUSH and its offshoots, that’s think %L.) Others might be set up to describe rooms you’ve already been to in an abbreviated form. If not, you need some way to determine whether you have been in a room before. A simple way would be to compute a hash of enough information about each room to get a unique key. However, a maze might be deliberately set up to trick you into thinking you’re in a different location. Wizardry in particular was designed to make old-school players mapping the dungeon ny hand tear their hair out when they realized their map had to be wrong, and the Zork series had a puzzle where objects you dropped to mark your place in the maze would get moved around while you were elsewhere. In practice, coding your tool to solve these puzzles is unlikely to be worth it.
You should be able to memoize a table of all-pairs-shortest-paths and update it to match the subgraph you’ve explored whenever you search a new exit. This could be implemented as a N×N table where row i, column j tells you the next step on the shortest path from location i to location j. Normally, for a directed graph. Even running Dijkstra’s algorithm each time should suffice, but in practice each move adds one room to the map and doesn’t add a shorter path between many other rooms. You would want to automatically map connections between rooms you’ve already been too (unless they’re supposed to be hidden) and not force the explorer to tediously walk through each individual exit and back to see where it goes.
If you can design the map, you can also lay out the areas so that they’re easy to navigate between, and then you can keep your tables small: each only needs to contain maps of individual areas you’ve deliberately laid out as mazes to explore. That is, if you want to go from one dungeon to another, you just need to look up the nearest exit, and then directions between the dungeons on the world map, not one huge table that grows quadratically with the number of locations in the entire game. For example, if you’ve laid out the world as a nested hierarchy where a room is in a building on a street in a neighborhood of a city in a region of a country on a continent on a planet, you can just store locations as a tree, and navigating from one area to the others is just a matter of walking up the tree until you reach a branch on the path to your destination.
I’m not sure how machine learning with neural networks or such would be helpful here; if the kind of trick you’re looking out for is the possibility that a room that appears to be the same as one you’ve seen before is really a duplicate, the way to handle that would be to maintain multiple possible maps at once on the assumption that apparently-identical rooms are or are not duplicates, a garden of forked paths.

how to ensure there single edge in a graph for a given order_id?

My current scenario is like I have I have products,customer and seller nodes in my graph ecosystem. The problem I am facing is that I have to ensure uniqueness of
(customer)-[buys]->product
with order_item_id as property of the buys relation edge.I have to ensure that there is an unique edge with buys property for a given order_item_id. In this way I want to ensure that my graph insertion remains idempotent and no repeated buys edges are created for a given order_item_id.
creating a order_item_id property
if(!mgmt.getPropertyKey("order_item_id")){
order_item_id=mgmt.makePropertyKey("order_item_id").dataType(Integer.class).make();
}else{
order_item_id=mgmt.getPropertyKey("order_item_id");
}
What I have found so far is that building unique index might solve my problem. like
if(mgmt.getGraphIndex('order_item_id')){
ridIndexBuilder=mgmt.getGraphIndex('order_item_id')
}else{
ridIndexBuilder=mgmt.buildIndex("order_item_id",Edge.class).addKey(order_item_id).unique().buildCompositeIndex();
}
Or I can also use something like
mgmt.buildEdgeIndex(graph.getOrCreateEdgeLabel("product"),"uniqueOrderItemId",Direction.BOTH,order_item_id)
How should I ensure this uniqueness of single buys edge for a given
order_item_id. (I don't have a use-case to search based on
order_item_id.)
What is the basic difference in creating an index on edge using
buildIndex and using buildEdgeIndex?
You cannot enforce the uniqueness of properties at the edge-level, ie. between two vertices (see this question on the mailing list). If I understand your problem correctly, building a CompositeIndex on edge with a uniqueness constraint for a given property should address your problem, even though you do not plan to search these edges by the indexed property. However, this could lead to performance issues when inserting many edges simultaneously due to locking. Depending on the rate at which you insert data, you may have to skip the locking (= skip the uniqueness constraint) and risk duplicate edges, then handle the deduplication yourself at read time and/or run post-import batch jobs to clean up potential duplicates.
buildIndex() builds a global, graph-index (either CompositeIndex or MixedIndex). These kind of indexes typically allows you to quickly find starting points of your Traversal within a graph.
However, buildEdgeIndex() allows you to build a local, "vertex-centric index" which is meant to speed-up the traversal between vertices with a potentially high degree (= huge number of incident edges, incoming and/or outgoing). Such vertices are called "super nodes" (see A solution to the supernode problem blog post from Aurelius) - and although they tend to be quite rare, the likelihood of traversing them isn't that low.
Reference: Titan documentation, Ch. 8 - Indexing for better Performance.

Resources