Graph building algorithm given an infinite walk - graph

I need help writing a resilient, mapping (graph building) algorithm. Here is the problem:
Imagine you have a text oriented virtual reality(TORG/MUD) where you can send movement commands (n, s, w, e, up, down, in, out ... etc.) through telnet to move your character from room to room. And the server sends back corresponding room description after each movement step. Your task is to generate a graph that represents the underlying map structure, so that you can simply do a DFS on the client side to figure out how to get from one room to another. Also you want to design the system so that minimum user input is required
Here are the assumptions:
The underlying graph topology on the server never change.
Room descriptions are not unique. Most of the rooms have distinct descriptions, but some of the rooms have the exact same description. Room description are changed slightly once in a while(days or weeks)
Your movement may fail randomly with a small probability, and you will get an error message instead of the new room description, such as "You stop to wait for the wagon to pass", "The door is locked", and your character will still be in the current room.
You cannot assume the unit spacial distance for each movement. For example you may have a topology like the one shown below, so assuming unit distance for each neighboring room and assigning a hard coordinate to each room is not going to work. However you may assume that the relative direction to be consistent, that is there will be no loop in a topological sort along X(west, east) and Y(south, north).
Objective: given a destination that you have visited before, the algorithm guarantees to eventually move you to that location, and will find the shortest path most of the time. Mistakes are allowed, but the algorithm should be able to detect and correct the mistakes on the fly.
Example graph:
A--B---B
| | <- k
C--D-E-F
I have already implemented a very simple solution that would record the room descriptions and construct a graph. The following is an example of a graph representation my program generates in json. The "exits" are movement direction mapped to node id. -1 represents an un-mapped room. If the user walks in a direction and detect a -1 in the graph representation, the algorithm will attempt to find nodes already in the graph. If nodes with the same description are found, it will prompt the user to decide whether the newly seen room is one of the old nodes. If not, it adds a new node and connect it to the graph.
"node": [
{
"description": "You are standing in the heart of the Example city. There is a fountain with large marble statue in it...",
"exits": {
"east": -1,
"north": 31,
"south": 574,
"west": 42
},
"id": 0,
"name": "cot",
"tags": [],
"title": "Center of Town",
"title_color": "\u001b[1m\u001b[36m Center of Town\u001b[0;37;40m"
},
{
...
This simple solution requires human input detect loops when building the graph. For example, in the graph shown above, assume same letters represent same room descriptions. If you start mapping at the first B, and to left, down, right...till you perform movement k, now you see B again, but mapper cannot determine whether it is the B it has seen before.
In short I want to be able to write a resilient graph building algorithm that takes a walk (possibly infinite) in a hidden target graph and generate(and keep updating) a graph that can (hopefully) as similar as the target graph. We then use the generated graph to help navigate in the target graph. Is there an existing algorithm for this category of problems?
I also thought about applying some machine learning techniques to this problem, but I am unable to write out a concrete model. I am thinking along the lines of defining a list of features for each room we see (room description, exits, neighboring nodes), and each time we see a room we attempt to find the graph node that best fit the features, and based on some update rule(like Winnow or Perceptron) update the description we see based on some mistakes detection metrics.
Any thoughts/suggestions would be very much appreciated!

Many MU*s will give you a way to get a unique identifier for rooms. (On MUSH and its offshoots, that’s think %L.) Others might be set up to describe rooms you’ve already been to in an abbreviated form. If not, you need some way to determine whether you have been in a room before. A simple way would be to compute a hash of enough information about each room to get a unique key. However, a maze might be deliberately set up to trick you into thinking you’re in a different location. Wizardry in particular was designed to make old-school players mapping the dungeon ny hand tear their hair out when they realized their map had to be wrong, and the Zork series had a puzzle where objects you dropped to mark your place in the maze would get moved around while you were elsewhere. In practice, coding your tool to solve these puzzles is unlikely to be worth it.
You should be able to memoize a table of all-pairs-shortest-paths and update it to match the subgraph you’ve explored whenever you search a new exit. This could be implemented as a N×N table where row i, column j tells you the next step on the shortest path from location i to location j. Normally, for a directed graph. Even running Dijkstra’s algorithm each time should suffice, but in practice each move adds one room to the map and doesn’t add a shorter path between many other rooms. You would want to automatically map connections between rooms you’ve already been too (unless they’re supposed to be hidden) and not force the explorer to tediously walk through each individual exit and back to see where it goes.
If you can design the map, you can also lay out the areas so that they’re easy to navigate between, and then you can keep your tables small: each only needs to contain maps of individual areas you’ve deliberately laid out as mazes to explore. That is, if you want to go from one dungeon to another, you just need to look up the nearest exit, and then directions between the dungeons on the world map, not one huge table that grows quadratically with the number of locations in the entire game. For example, if you’ve laid out the world as a nested hierarchy where a room is in a building on a street in a neighborhood of a city in a region of a country on a continent on a planet, you can just store locations as a tree, and navigating from one area to the others is just a matter of walking up the tree until you reach a branch on the path to your destination.
I’m not sure how machine learning with neural networks or such would be helpful here; if the kind of trick you’re looking out for is the possibility that a room that appears to be the same as one you’ve seen before is really a duplicate, the way to handle that would be to maintain multiple possible maps at once on the assumption that apparently-identical rooms are or are not duplicates, a garden of forked paths.

Related

How does the Titan Twitter Example work?

I'm trying to wrap my mind around graph data right now. I'm finding it difficult to think in terms of property graphs. On the vertex centric indeces docs page, there is an example involving twitter data. The Gremlin code is:
g = TitanFactory.open(conf)
// graph schema construction
g.makeKey('name').dataType(String.class).indexed(Vertex.class).make()
time = g.makeKey('time').dataType(Long.class).make()
if(useVertexCentricIndices)
g.makeLabel('tweets').sortKey(time).make()
else
g.makeLabel('tweets').make()
g.commit()
// graph instance construction
g.addVertex([name:'v1000']);
g.addVertex([name:'v10000']);
g.addVertex([name:'v100000']);
g.addVertex([name:'v1000000']);
for(i=1000; i<1000001; i=i*10) {
v = g.V('name','v' + i).next();
(1..i).each {
v.addEdge('tweets',g.addVertex(),[time:it])
if(it % 10000 == 0) g.commit()
}; g.commit()
}
The explanation is that each edge represents someone tweeting a tweet vertex. This doesn't make sense to me as a schema. Why should any two nodes be connected? If the answer is that the edge connects different tweets that a user has tweeted, then one edge connects more than one node. This would mean that Titan is a hypergraph, which I thought it wasn't.
In short, can someone explain this example better than the docs?
The example in the wiki is a bit over-simplified and is designed to convey the concept of vertex-centric indices. On its own, it might not be he best thing to use for purposes of understanding how to model a schema in general. That said, I think the model still makes basic sense (at least in that light).
If the answer is that the edge connects different tweets that a user
has tweeted, then one edge connects more than one node.
I'm not sure where you see that in the code. I see 4 user vertices who are doing the tweeting (v1000, v10000, etc). The for loop iterates each user and adds tweet edges for each. On each creation of an edge a new vertex is created to represent the tweet. Perhaps I'm misunderstanding you but in that sense an edge does not connect more than two vertices. It only connects from user vertex into tweet vertex.

Property graph (Neo4j) design: Single node with multiple relations or new nodes for each event occurence?

Let us say I've two Leagues L1 and L2. Each league can have multiple rounds like Playoffs, Quarterfinals, Semifinals and Finals. Moreover, I also need to represent the happens_after fact like Quarterfinals happens after Playoffs, Semifinals happens after the Quarterfinals and Finals happens after the Semifinals.
Questions
Should my graph have one node for each of these rounds and each League should link to these rounds? This way we are just creating new relationships (e.g. both L1 and L2 will have a relationship to Playoffs) but there is only one Playoff node. However, this limits the happens_after relationship because some leagues can have more rounds (for e.g. Round 2 can come before Quarterfinals). Is there a better way to represent this?
Use-cases
Need to be able to find all the rounds of a given league.
Need to be able to find the order of all the rounds of a given league and the dates each of these happened.
EDIT
In general everything that has an identify on its own should become a node. Relationships tie the "things" together.
Not sure if I fully understand your domain. L1, L2 and each round would be nodes. The relationship league -> round indicates that a given league takes part in the round.
The temporal order within the rounds can be modeled by having BEFORE and/or AFTER relationships among them. This way you build a linked (or a double linked) list of rounds.
Another way to express temporal order would be to store a indexed timestamp property for the round. If you're just interested in before or after and not on absolute time, the first approach (linked list) seems to fit better.

Hierarchy in a graph

How can one compute or quantify the hierarchical nature of a given graph, if there is a hierarchy in that graph ?
More specifically, i want to know if some sort of hierarchy existed in artificial neural network (with some number of hidden layers). and also want to measure that.
This is an interesting open ended question and so I'll answer loosely and off the cuff.
So you want to find out if the graph is logically like a tree? There is no up or down in a graph, so probably what you are really looking for is a way to find the node or nodes in your graph that are the most highly connected to other highly connected nodes, and then take a certain perspective supposition and determine if a "tree" like graph makes sense using that node as the trunk or root of the tree. Another thing you could do is just choose any random node, suppose that is the root of the tree, and then see what happens. You can re-balance the tree if the node connections lead you to want to do that based on a certain number of connections or traversals, and you can attempt to find the "real root" - if there is such a thing. If you re-balance a certain number of times, or detect a circular path has been traversed, you may decide that the graph is not very hierarchical at all. If you do find a "real root", then you might then decide to look for depth, avg. branch numbers, balance stats, etc. If you refine the question, I'll refine my answer.

Build an undirected weighted graph by matching N vertices

Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.
I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.
Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.
So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.
Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.
I could think of only an approach to solve the problem, since the outcomes of below stuff
depend on the nature of inter-relation between interests.
=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.
=>step:2 - cluster the interests. (Most complex)
Kmeans is a commonly used clustering algo, but works on based on
K-Dimensional vector space.refer wiki to see how K-means works.
it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".
"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.
(Note: keep modifying this step untill a right clustering algo is found and
the right calibration for distance rule, no of clusters etc are found. )
=>Step:3 - Evaluate the clusters
from here on its like calibrating a couple things to fit your need.
Examine the clusters, reevaluate :
the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters,
time\precision trade-off (compare final - match results without any clustring)
goto: step-2 untill this evaluation is satisfactory.
=>step:4 - Examinie new inerest
iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters
sort and filter out the highly connected interests.
=>step:5 - Match User
reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.
=>step:6 - Apart form the above
you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.
what is the application for this problem, whats the expected traffic?
Another solution to find the connectivity between the new interest and "set of interests in Cluster" C.
Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.
1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.
2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.
3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.
4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.
One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).
If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.
Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.
Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.
Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.
Note:
There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.
Approximate solution:
I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.
To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.
One way to build that structure:
Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.
If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).
few thoughts ...
Not exactly a graph theory solution.
assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not.
For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.

Hadoop MapReduce implementation of shortest PATH in a graph, not just the distance

I have been looking for "MapReduce implementation of Shortest path search algorithms".
However, all the instances I could find "computed the shortest distance form node x to y", and none actually output the "actual shortest path like x-a-b-c-y".
As for what am I trying to achieve is that I have graphs with hundreds of 1000s of nodes and I need to perform frequent pattern analysis on shortest paths among the various nodes. This is for a research project I am working on.
It would be a great help if some one could point me to some implementation (if it exists) or give some pointers as to how to hack the existing SSSP implementations to generate the paths along with the distances.
Basically these implementations work with some kind of messaging. So messages are send to HDFS between map and reduce stage.
In the reducer they are grouped and filtered by distance, the lowest distance wins. When the distance is updated in this case, you have to set the vertex (well, some ID probably) where the message came from.
So you have additional space requirement per vertex, but you can reconstruct every possible shortest path in the graph.
Based on your comment:
yes probably
I will need to write another class of the vertex object to hold this
additional information. Thanks for the tip, though it would be very
helpful if you could point out where and when I can capture this
information of where the minimum weight came from, anything from your blog maybe :-)
Yea, could be a quite cool theme, also for Apache Hama. Most of the implementations are just considering the costs not the real path. In your case (from the blog you've linked above) you will have to extract a vertex class which actually holds the adjacent vertices as LongWritable (maybe a list instead of this split on the text object) and simply add a parent or source id as field (of course also LongWritable).
You will set this when propagating in the mapper, that is the for loop that is looping over the adjacent vertices of the current key node.
In the reducer you will update the lowest somewhere while iterating over the grouped values, there you will have to set the source vertex in the key vertex by the vertex that updated to the minimum.
You can actually get some of the vertices classes here from my blog:
Source
or directly from the repository:
Source
Maybe it helps you, it is quite unmaintained so please come back to me if you have a specific question.
Here is the same algorithm in BSP with Apache Hama:
https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/SSSP.java

Resources