Gremlin: need an efficient query to check whether two vertices are connected?

Gremlin: need an efficient query to check whether two vertices are connected? - gremlin

Our app needs to check whether two vertices are connected via any path.
The app does not care about the segments in a path, or the shortest path.
The app only needs to know if two vertices share a common sub-graph.
My question: given two vertices with id(s) A and B, respectively, what gremlin query works well to answer the question "are A and B connected, somehow?"

This one should do the trick:
g.V(A).
repeat(both().dedup()).
until(hasId(B)).
hasNext()
Start at A, then start visiting neighbors, don't visit any vertex twice, and stop if B is reached. Obviously, this can run into timeouts (or memory issues) if you are dealing with huge subgraphs.

Related

How to compute the average(or sum) of node values in a network?

Consider a network(graph) of N nodes and each of them is holding a value, how to design a program/algorithm (for each node) that allows each node to compute the average(or sum) of all the node values in the network?
Assumptions are:
Direct communication between nodes is constrained by the graph topology, which is not a complete graph. Any other assumptions, if necessary for your algorithm, is allowable. The weakest one I assume is that there's a loop in the graph that contains all the nodes.
N is finite.
N is suffiently large such that you can't store all the values and then compute its average (or sum). For the same reason, you can't "remember" whose value you've received (thus you can't just redistributing values you've received and add those you've not seen to the buffer and get a result).
(The Tags may not be right since I don't know which field this kind of problems are in, if it's some kind of a general problem.)

That is an interesting question, here some assumptions I've made, before I present a partial solution:
The graph is connected (in case of a directed graph, strongly connected)
The nodes only communicate with their direct neighbours
It is possible to hold and send the sum of all numbers, this means the sum either won't exceed long or you have a data structure sufficiently large, which it won't exceed
I'd go with depth first search. Node N0 would initiate the algorithm and send it's value + the count to the first neighbour (N0.1). N0.1 would add it's own value + count and forward the message to the next neighbour (N0.1.1). In case the message comes back to either N0 or N0.1 they just forward it to another neighbour of theirs. (N0.2 or N0.1.2).
The problem now is to know, when to terminate the algorithm. Preferably you want to terminate it as soon as you've reached all nodes, and afterwards just broadcast the final message. In case you know how many nodes there are in the graph, just keep on forwarding it to the next node, until every node will be reached eventually. The last node will know that is had been reached (it can compare the count variable with the number of nodes in the graph) and broadcast the message.
If you don't know how many nodes there are, and it's and undirected graph, than it will be just depth first implementation in a graph. This means, if N0.1 gets a message from anyone else than N0.1.1 it will just bounce the message back, as you can't send messages to the parent when performing depth first search. If it is a directed graph and you don't know the number of nodes, well than you either come up with a mathematical model to prove when the algorithm has finished, or you learn the number of nodes.
I've found a paper here, proposing a gossip based algorithm to count the number of nodes in a dynamic network: https://gnunet.org/sites/default/files/Gossipico.pdf maybe that will help. Maybe you can even use it to sum up the nodes.

Visiting graph edges with independent paths

Given a directed graph with multiple start nodes and multiple end nodes, I need to form paths that visit every reachable edge, but I cannot visit any edge (or vertex) more than once during a single pass. [This is to electrically test every connection in a network by sending signals from start to end nodes, but I cannot allow paths to short together.]
Because I cannot re-visit edges during a single pass:
I can safely ignore the cycles in the graph.
I know each path I form will block other paths.
Consequently, I cannot visit every reachable edge in one pass, so multiple passes are necessary.
From context, I know that the minimum number of passes will be the maximum number of edges entering any vertex. Once I finish a given pass, I am free to re-visit edges that were visited in previous passes, but never-visited edges are the ones that I most want to visit.
I would like to visit "many" edges per pass, so that I can reduce total the number of passes, but I do not strictly need to minimize the number of passes.
Any suggestions on algorithms to accomplish this? It sounds a little like the route inspection problem, except that my graph is directed.

It is not clear from the question whether you have one or many start points and one or many end points. For simplicity let me assume "one-to-many" network. Then your requirement (not visit any edge or vertex more then once) means you actually generate a spanning tree of your graph with the given root.
A simple but not 100% solution that comes to mind is the following:
Assign some initial weights to the edges and apply random spanning tree algorithm. Then decrease the weight (actually, relative probability) of visited edges. It is very likely all edges will be visited.
In the case of "many-to-many" connection you can play with different starting points. If some sources are not connected to some sinks the algorithm would throw an exception. If this is not what you inspect, you can run regular DFS first to collect all reacheable vertices into some set; then you can use this set as a filter to form a boost::filtered_graph.

Who can explain the 'Replication' in dynamo paper?

In dynamo paper : http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
The Replication section said:
To account for node failures, preference list contains more than N
nodes.
I want to know why ? and does this 'node' mean virtual node ?

It is for increasing Dynamo availability. If the top N nodes in the preference list are good, the other nodes will not be used. But if all the N nodes are unavailable, the other nodes will be used. For write operation, this is called hinted handoff.
The diagram makes sense both for physical nodes and virtual nodes.

I also don't understand the part you're talking about.
Background:
My understanding of the paper is that, since Dynamo's default replication factor is 3, each node N is responsible for circle ranging from N-3 to N (while also being the coordinator to ring range N-1 to N).
That explains why:
node B holds keys from F to B
node C holds keys from G to C
node D holds keys from A to D
And since range A-B falls within all those ranges, nodes B, C and D are the ones that have that range of key hashes.
The paper states:
The section 4.3 Replication:
To address this, the preference list for a key is constructed by skipping positions in the ring to ensure that the list contains only distinct physical nodes.
How can preference list contain more than N nodes if it is constructed by skipping virtual ones?
IMHO they should have stated something like this:
To account for node failures, ring range N-3 to N may contain more than N nodes, N physical nodes plus x virtual nodes.

Distributed DBMS Dyanmo DB falls in that class which sacrifices Consistency. Refer to the image below:
So, the system is inconsistent even though it is highly available. Because network partitions are a given in Distributed Systems you cannot not pick Partition Tolerance.
Addressing your questions:
To account for node failures, preference list contains more than N nodes. I want to know why?
One fact of Large Scale Distributed Systems is that in a system of thousands of nodes, failure of a nodes is a norm.
You are bound to have a few nodes failing in such a big system. You don't treat it as an exceptional condition. You prepare for such situations. How do you prepare?
For Data: You simply replicate your data on multiple nodes.
For Execution: You perform the same execution on multiple nodes. This is called speculative execution. As soon as you get the first result from the multiple executions you ran, you cancel the other executions.
That's the answer right there - you replicate your data to prepare for the case when node(s) may fail.
To account for node failures, preference list contains more than N nodes. Does this 'node' mean virtual node?
I wanted to ensure that I always have access to my house. So I copied my house's keys and gave them to another family member of mine. This guy put those keys in a safe in our house. Now when we all go out, I'm under the illusion that we have other set of keys so in case I lose mine, we can still get in the house. But... those keys are in the house itself. Losing my keys simply means I lose the access to my house. This is what would happen if we replicate the data on virtual nodes instead of physical nodes.
A virtual node is not a separate physical node and so when the real node on which this virtual node has been mapped to will fail, the virtual node will go away as well.
This 'node' cannot mean virtual node if the aim is high availability, which is the aim in Dynamo DB.

Javascript library for graph operations

Is there any suggested javascript alternative(s) to pythons pygraph or NetworkX? It should be noted that visualization is not necessary (even prefered not to have this).
The library should be able to parse a format capable of retaining labeling and attributes on nodes and edges (DOT, GraphML?). It should support operations such as:
Listing nodes and edges.
Given a node, the edges which point in/out to/from it.
Given a node or edge, return the attached attributes.
Given two nodes that are connected, determine the most complete path. When running this operation a predicate function should be provided to determine if a node should be included in the search or not.
To put it in context, the web browser based application will traverse the graph from a pre-determined start node. Each node holds an attribute 'userValue' which is compared against conditions (rules?) held as attributes on the nodes out-edges. For the traversal to continue the edge condition must evaluate to true against 'userValue'. The graph will always contain a predetermined start and end (or goal) node.

You could try
JSNetworkX
A port of the NetworkX graph library to JavaScript
http://felix-kling.de/JSNetworkX/

Fixed length path between two graph nodes

Is there an algorithm that will, if given two nodes on a graph, find a route between them that takes the specified number of hops? Any node can be connected to any other.
The points at the moment are located in 2D space, so I'm not sure if a graph is the best approach.

Have you tried iterated-deepening DFS?

If you have nodes are seeking to find routes in terms of hops, then a graph is probably the right approach. I'm not sure I understand what you are trying to do and what the constraints are, though, especially with respect to "Any Node can be connected to any other" .. which seems a bit open ended.
Disregarding that, however; with some graph representation:
It seems like starting at the first node, and doing a depth first search from there, and terminating a search if (a) the hops taken is larger than your specified number or (b) we have arrived at the second node; this will determine the first (not only) path connecting the two nodes in (at most) that many hops.
If it has to be exactly the specified hops, terminate any branch of the search if the hops have gone over, and terminate with success if you have also arrived at the second node.

Dumb approach: (data structure is array of stacks). This is basically doing Breadth First Search (BFS) to depth N, except that if loops are allowed (you did not clarify but I assume they are), you don't exclude the visited nodes from further searching.
Push starting node on a stack stored in the array at index 0 (index=depth)
For each level/index "l" 0-N:
For each node on a stack stored at level "l", find all its neighbors, and push them onto a stack stored in level "l+1".
Important: if your task allows finding paths that contain loops, do NOT check if you already visited any node you add. If it does not allow loops, use a hash of visited nodes to not add any node twice**
Stop when you end level "N-1".
Loop over all the nodes you just added to stack at index "N" and find your destination node. If found: success, if not, no such path.
Please note that if by "every node can be connected" you are implying a FULLY CONNECTED graph, then there exists a mathematical answer not involving actually visiting nodes
(however, the formula is too long to write in the text-entry field of StackOverflow)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex