Is there an easy way to modify the PageRank algorithm so that being connected to many other nodes still increases a node's PageRank, but it's best if the nodes are less important?
I'm not sure if I'm explaining this well, but what I'm thinking of is applying this to hockey scoring. So if Gretzky, for example, has 4 edges, but none of his connected nodes are connected to anything else, and Lemieux also has 4 edges, but his are interconnected (more important), I'd want Gretzky to have a higher score.
In other words, I want PageRank to "adjust" for the quality of your teammates, so you get a higher score by being connected to lower-quality teammates, in contrast to how the algorithm normally adjusts so you get a higher score by being connected to higher-quality teammates.
Below is an ugly diagram of what I'm trying to explain:
Any ideas if something like this exists, and, if so, how to implement it? I use R and igraph for most of my graph theory-related analyses, for whatever it's worth, and if it isn't clear, I'm not really knowledgeable about graph theory.
EDIT: I've looked into personalized PageRank, and it seems somewhat relevant, though I'm not sure how I could set the weights so that the algorithm performs as I want it to.
Related
I know there has some famous graph partition algo tools like METIS which is implemented by karypis Lab (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)
but I wanna know is there any method to partition graph stored in Neo4j?
or I have to dump the Neo4j's data and transform the node and edge format manually to fit the METIS input format?
Regarding new-ish and interesting algorithms, this is by no means exhaustive or state of the art, but these are the first places I would look:
Specific Algorithm: DiDiC (Distributed Diffusive Clustering) - I used it once in my thesis (Partitioning Graph Databases)
You iterate over all nodes, then for each node retrieve all neighbors, in order to spread some of "some unit" to all your neighbors
Easy to implement.
Can be made deterministic
Iterative - as it's based on iterations (like Super Steps in Pregel) you can stop it at any time. The longer you leave it the better the result, in theory (though in some cases, on certain graph shapes it can be unstable)
When we implemented this we ran it for 100 iterations on a machine with ~30GB RAM, for up to ~4 million nodes - it took no more than two days to complete.
Specific Algorithm: EvoCut "Finding sparse cuts locally using evolving sets" - local probabilistic algorithm from Microsoft - related to these papers
Difficult to implement
Local algorithm - BFS-like access patterns (random walks)
It's been a while since i read that paper, but i remember it was built on clean abstractions:
EvoNibble (pluggable - decides how much of neighborhood to add to the current cluster
EvoCut (calls EvoNibble multiple times to find the local cluster)
EvoPartition (calls EvoCut repeatedly to partition entire graph)
Not deterministic
General Algorithm Family: Hierarchical Graph Clustering
From a high level:
Coarsen the graph by collapsing nodes into aggregate nodes
coarsening strategy is selectable
Find clusters in the coarsened/smaller graph
clustering strategy is selectable
Incrementally decoarsen the graph, refining at the clustering at each step
refining strategy is selectable
Notes:
If the graph changes slowly (or results don't need to be right up to date) it may be possible to coarsen once (or infrequently) then work with the coarsened graph - to save computation
I don't know of a specific algorithm to recommend
General limitations - the things few clustering algorithms do:
Node types not acknowledged - i.e., all nodes treated equally
Relationship types not acknowledged - i.e., all relationships treated equally
Relationship direction not acknowledged - i.e., relationships treated as undirected
Having worked independently with METIS and Neo4j in the past, I am not aware of any tool for generating a METIS file from Neo4j. That being said, writing such a tool should be an easy task and would be a great community contribution.
Another approach for integrating METIS with Neo4j might be in connecting METIS to Neo4j from C++ via JNI. However this is going to be much more involved as it would have to take care of things like transactions, concurrency etc.
On the more general question of partitioning graphs, it is quite possible to implement some of the more known and simple algorithms with reasonable effort.
What is the difference (if any) between a node and a vertex? I can't find the answer after looking at countless sites! Even my book doesn't specify it so I am kind of lost!
It is worth mentioning that I am looking for the difference besides the fact that it is called a 'vertex' when used in a graph and a 'node' when used in a tree.
There are no differences between the words Node and Vertex. Even in some books that explain graph theory and graph algorithms they name it as:
Vertex denoted by v, and sometimes it's called nodes also
There are no major nor minor differences between them.
This is mentioned in the book: Data structure and Algorithms with Object Oriented Design Patterns in C#, Bruno R, Preiss.
In "The Practitioner's Guide to Graph Data", the author avoid the term "node/nodes" and only use vertex/vertices and they explain it as below:
...because we are focusing on distributed graphs, and nodes has different meanings in distributed systems, graph theory and computer science.
In distributed systems, a node can be a client, server or peer, while in computer network it can be a computer or a modem. In computer science, as you already point out, it could be used either for graph theory or tree system.
So in the context of graph theory, node and vertex are used interchangeable. But if you would like to make it clear and avoid any misunderstanding, vertex/vertices is the way to go.
In think both terminologies come from the different perception of graphs and networks. Albert-László Barabási writes in his recent text book.
"In the scientific literature the terms network and graph are used interchangeably:
Network science
Graph theory
Network
Graph
Node
Vertex
Link
Edge
Yet, there is a subtle distinction between the two terminologies: the {network, node, link} combination often refers to real systems: The WWW is a network of web documents linked by URLs; society is a network of individuals linked by family, friendship or professional ties; the metabolic network is the sum of all chemical reactions that take place in a cell. In contrast, we use the terms {graph, vertex, edge} when we discuss the mathematical representation of these networks: We talk about the web graph, the social graph (a term made popular by Facebook), or the metabolic graph. Yet, this distinction is rarely made, so these two terminologies are often synonyms of each other."
<tl;dr> Same, same, but different.
There is no difference between a node and a vertex. Most books use V to represent the vertex of a graph. I've seen node mostly associated with a tree.
For instance, you may have come across O(V + E) being used to represent the time complexity for depth first search and breadth first search graph traversals.
Similarly, V is used as part of time complexity analysis for other graph algorithms like Prim's, Kruskal's, etc.
So I have an un-directed un-weighted graph. It contains cycles. I would like to find the path which visits the most nodes with no repeat visits to any node. Since this is a graph traversal, you can start and end at any node you like.
Background Research:
I have looked at Travelling Salesman Problem (TSP); this problem is different and does NOT allow you to finish where you started from and there are no weights. I have looked at several other algorithms, but have found none suitable for this problem.
Graph Size: There are 100 nodes in the graph; with 10 disconnected nodes.
UPDATE: I have moved this to: https://math.stackexchange.com/questions/243375/what-is-the-maximum-number-of-nodes-i-can-traverse-in-an-undirected-graph-visiti
Look for the Hamiltonian Cycle problem
http://en.wikipedia.org/wiki/Hamiltonian_cycle
You should take a look at the wikipedia entry which has an algorithm for acyclic graphs. Your graph has cycles which makes your problem NP-hard.
I would try and create a DAG with nodes representing strongly connected components. Then you could at least find the path that visits the most strongly connected components. You could then expand that path by replacing the individual (strongly connected components) nodes with the longest paths in each of the subgraphs.
Finding the longest paths in the subgraphs is now the same as your original problem but at least you graphs are smaller. If your in luck, the subproblems are easy and your done. In the general case they might not be so small and you could use some advanced heuristics. Maybe have a look at this paper or this question (you could use the answer there to solve your problem completely but i'm not sure)
I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.
I didn't think this was mathsy enough for mathoverflow, so I thought I would try here. Its for a programming problem though, so its sorta on topic.
I have a graph (as in the maths one), where I have a number of vertices, A,B,C.. each with a load "value", these are connected by some arbitrary topology (there is at least a spanning tree). The objective of the problem is to transfer load between each of the vertices, preferably using the minimal flow possible.
I wish to know the transfer values along each edge.
The solution to the problem I was thinking of was to treat it as a heat transfer problem, and iteratively transfer load, or solve the heat equation in some way, counting the amount of load dissipated along each edge. The amount of heat transfer until the network reaches steady state should thus yield the result.
Whilst I think this will work, it seems like the stupid solution. I am wondering if there is a reference or sample problem that someone can point me to -- I am unsure what keywords to search for.
I could not see how to couple the problem as either a simplex problem or as a network flow problem -- each edge has unlimited capacity, and so does each node. There are two simultaneous minimisation problems to solve, so simplex seems to not apply??
If you have a spanning tree, then there is an O(n) solution.
Calculate the average value. (In the end, every node will have this value.) Then iterate over the leaves: calculate the change in value necessary for that leaf (average - value), add that change to the leaf, assign it (as flow) to the edge that connects that leaf to the tree, and subtract it from the other node. Remove that leaf and edge from the tree (not from the graph of course); the other node to may become a new leaf, if it has only one remaining edge in the tree.
When you reach the last node, if you've done the arithmetic right it will end up with the average value, just like all the other nodes.