Hierarchy in a graph - graph

How can one compute or quantify the hierarchical nature of a given graph, if there is a hierarchy in that graph ?
More specifically, i want to know if some sort of hierarchy existed in artificial neural network (with some number of hidden layers). and also want to measure that.

This is an interesting open ended question and so I'll answer loosely and off the cuff.
So you want to find out if the graph is logically like a tree? There is no up or down in a graph, so probably what you are really looking for is a way to find the node or nodes in your graph that are the most highly connected to other highly connected nodes, and then take a certain perspective supposition and determine if a "tree" like graph makes sense using that node as the trunk or root of the tree. Another thing you could do is just choose any random node, suppose that is the root of the tree, and then see what happens. You can re-balance the tree if the node connections lead you to want to do that based on a certain number of connections or traversals, and you can attempt to find the "real root" - if there is such a thing. If you re-balance a certain number of times, or detect a circular path has been traversed, you may decide that the graph is not very hierarchical at all. If you do find a "real root", then you might then decide to look for depth, avg. branch numbers, balance stats, etc. If you refine the question, I'll refine my answer.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Graph building algorithm given an infinite walk

I need help writing a resilient, mapping (graph building) algorithm. Here is the problem:
Imagine you have a text oriented virtual reality(TORG/MUD) where you can send movement commands (n, s, w, e, up, down, in, out ... etc.) through telnet to move your character from room to room. And the server sends back corresponding room description after each movement step. Your task is to generate a graph that represents the underlying map structure, so that you can simply do a DFS on the client side to figure out how to get from one room to another. Also you want to design the system so that minimum user input is required
Here are the assumptions:
The underlying graph topology on the server never change.
Room descriptions are not unique. Most of the rooms have distinct descriptions, but some of the rooms have the exact same description. Room description are changed slightly once in a while(days or weeks)
Your movement may fail randomly with a small probability, and you will get an error message instead of the new room description, such as "You stop to wait for the wagon to pass", "The door is locked", and your character will still be in the current room.
You cannot assume the unit spacial distance for each movement. For example you may have a topology like the one shown below, so assuming unit distance for each neighboring room and assigning a hard coordinate to each room is not going to work. However you may assume that the relative direction to be consistent, that is there will be no loop in a topological sort along X(west, east) and Y(south, north).
Objective: given a destination that you have visited before, the algorithm guarantees to eventually move you to that location, and will find the shortest path most of the time. Mistakes are allowed, but the algorithm should be able to detect and correct the mistakes on the fly.
Example graph:
A--B---B
| | <- k
C--D-E-F
I have already implemented a very simple solution that would record the room descriptions and construct a graph. The following is an example of a graph representation my program generates in json. The "exits" are movement direction mapped to node id. -1 represents an un-mapped room. If the user walks in a direction and detect a -1 in the graph representation, the algorithm will attempt to find nodes already in the graph. If nodes with the same description are found, it will prompt the user to decide whether the newly seen room is one of the old nodes. If not, it adds a new node and connect it to the graph.
"node": [
{
"description": "You are standing in the heart of the Example city. There is a fountain with large marble statue in it...",
"exits": {
"east": -1,
"north": 31,
"south": 574,
"west": 42
},
"id": 0,
"name": "cot",
"tags": [],
"title": "Center of Town",
"title_color": "\u001b[1m\u001b[36m Center of Town\u001b[0;37;40m"
},
{
...
This simple solution requires human input detect loops when building the graph. For example, in the graph shown above, assume same letters represent same room descriptions. If you start mapping at the first B, and to left, down, right...till you perform movement k, now you see B again, but mapper cannot determine whether it is the B it has seen before.
In short I want to be able to write a resilient graph building algorithm that takes a walk (possibly infinite) in a hidden target graph and generate(and keep updating) a graph that can (hopefully) as similar as the target graph. We then use the generated graph to help navigate in the target graph. Is there an existing algorithm for this category of problems?
I also thought about applying some machine learning techniques to this problem, but I am unable to write out a concrete model. I am thinking along the lines of defining a list of features for each room we see (room description, exits, neighboring nodes), and each time we see a room we attempt to find the graph node that best fit the features, and based on some update rule(like Winnow or Perceptron) update the description we see based on some mistakes detection metrics.
Any thoughts/suggestions would be very much appreciated!
Many MU*s will give you a way to get a unique identifier for rooms. (On MUSH and its offshoots, that’s think %L.) Others might be set up to describe rooms you’ve already been to in an abbreviated form. If not, you need some way to determine whether you have been in a room before. A simple way would be to compute a hash of enough information about each room to get a unique key. However, a maze might be deliberately set up to trick you into thinking you’re in a different location. Wizardry in particular was designed to make old-school players mapping the dungeon ny hand tear their hair out when they realized their map had to be wrong, and the Zork series had a puzzle where objects you dropped to mark your place in the maze would get moved around while you were elsewhere. In practice, coding your tool to solve these puzzles is unlikely to be worth it.
You should be able to memoize a table of all-pairs-shortest-paths and update it to match the subgraph you’ve explored whenever you search a new exit. This could be implemented as a N×N table where row i, column j tells you the next step on the shortest path from location i to location j. Normally, for a directed graph. Even running Dijkstra’s algorithm each time should suffice, but in practice each move adds one room to the map and doesn’t add a shorter path between many other rooms. You would want to automatically map connections between rooms you’ve already been too (unless they’re supposed to be hidden) and not force the explorer to tediously walk through each individual exit and back to see where it goes.
If you can design the map, you can also lay out the areas so that they’re easy to navigate between, and then you can keep your tables small: each only needs to contain maps of individual areas you’ve deliberately laid out as mazes to explore. That is, if you want to go from one dungeon to another, you just need to look up the nearest exit, and then directions between the dungeons on the world map, not one huge table that grows quadratically with the number of locations in the entire game. For example, if you’ve laid out the world as a nested hierarchy where a room is in a building on a street in a neighborhood of a city in a region of a country on a continent on a planet, you can just store locations as a tree, and navigating from one area to the others is just a matter of walking up the tree until you reach a branch on the path to your destination.
I’m not sure how machine learning with neural networks or such would be helpful here; if the kind of trick you’re looking out for is the possibility that a room that appears to be the same as one you’ve seen before is really a duplicate, the way to handle that would be to maintain multiple possible maps at once on the assumption that apparently-identical rooms are or are not duplicates, a garden of forked paths.

number of 'graphs' in a ArangoDB database

I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.

Can a network of nodes (graph) be represented as one or more trees?

When I convert my graph into trees I don't mind having duplicate nodes in the trees. let me explain the reverse way. Suppose I have 2 trees with a common element. I can join them on the common element to create a graph.
Can I do this in the opposite direction, i.e., start with the graph and split an element into duplicates to create multiple trees?
If I understand correctly, you want a tree containing all the links from the original, by allowing original nodes to appear several times in this tree. It's some kind of spanning tree, except the constraint is on keeping all links (instead of keeping all nodes, as in a classic spanning tree).
So you can use a similar approach than those used for discovering classic spanning trees. The basic approach is based on either breadth- or depth-first search. You start from a random node, and you add it to your tree. You then explore its neighbors using the search algorithm, and each time you reach a node, you add it to your tree (with the corresponding link). You must maintain a list of all processed links, so that you don't end up treating the same link twice. Also, you need a way to identify each node uniquely, to allow subsequent processing on your tree. For instance, simply number the nodes in your original graph.

Build an undirected weighted graph by matching N vertices

Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.
I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.
Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.
So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.
Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.
I could think of only an approach to solve the problem, since the outcomes of below stuff
depend on the nature of inter-relation between interests.
=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.
=>step:2 - cluster the interests. (Most complex)
Kmeans is a commonly used clustering algo, but works on based on
K-Dimensional vector space.refer wiki to see how K-means works.
it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".
"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.
(Note: keep modifying this step untill a right clustering algo is found and
the right calibration for distance rule, no of clusters etc are found. )
=>Step:3 - Evaluate the clusters
from here on its like calibrating a couple things to fit your need.
Examine the clusters, reevaluate :
the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters,
time\precision trade-off (compare final - match results without any clustring)
goto: step-2 untill this evaluation is satisfactory.
=>step:4 - Examinie new inerest
iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters
sort and filter out the highly connected interests.
=>step:5 - Match User
reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.
=>step:6 - Apart form the above
you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.
what is the application for this problem, whats the expected traffic?
Another solution to find the connectivity between the new interest and "set of interests in Cluster" C.
Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.
1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.
2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.
3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.
4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.
One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).
If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.
Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.
Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.
Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.
Note:
There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.
Approximate solution:
I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.
To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.
One way to build that structure:
Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.
If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).
few thoughts ...
Not exactly a graph theory solution.
assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not.
For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.

Resources