Given two Gremlin queries q1 and q2 and their results ri = qi.toSet(), I want to find all nodes in r1 that have a connection to a node in r2 - ignoring edge labels and direction.
My current approach included the calculation of shortest paths between the two result sets:
q1.shortestPath().with_(ShortestPath.target, q2).toList()
However, I found the shortest path calculation in Tinkerpop is unsuitable for this purpose because the result will be empty if there are nodes in r1 without any connection to any node in r2.
Instead, I thought about connected components, but the connectedComponents() step will yield all connected components found and I would have to filter them to find the connected component that meets the above requirements.
Do you have suggestions on how I could tackle this problem in gremlin-python?
Here is one way of doing what I think you need in Gremlin Python. This may or may not be efficient depending on the size and shape of your graph. In my test graph only vertices 1,2 and 3 have a route to either 12 or 13. This example does not show you how you got there, just that at least one path exists (if any exist).
>>> ids = g.V('1','2','3','99999').id().toList()
>>> ids
['1', '2', '3', '99999']
>>> ids2 = g.V('12','13').id().toList()
>>> ids2
['12', '13']
>>> g.V(ids).filter(__.repeat(__.out().simplePath()).until(__.hasId(within(ids2))).limit(1)).toList()
[v[1], v[2], v[3]]
You can also use dedup() instead of simplePath() and limit() if you only care that any route exists.
g.V(ids).filter(__.repeat(__.out().dedup()).until(__.hasId(within(ids2)))).toList()
Related
Goal
The objective is to efficiently generate random walks on a relatively large graph with uneven probabilities of going through edges depending on their type.
Configuration
Ubuntu VM, 23Go RAM
JanusGraph 0.6.1 full
Local graph (default conf/remote.yaml file used)
~1.8m vertices (~28k will be start nodes for the random walks)
~21m relationships (they can all be used in the random walks)
What I am doing
I am currently generating random walks with the sample command:
g.V(<startnode_id>).
repeat( local( both().sample(1) ) ).
times(<desired_randomwalk_length>).
path()
What I tried
I tried using a gremlinpython script to create a random walk generator that would first get all edges connected to the current node, then pick randomly an edge to go through and repeat <desired_randomwalk_length> times.
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.structure.graph import Vertex
from typing import List
connection = DriverRemoteConnection(<URL>, "g")
g = traversal().withRemote(connection)
def get_next_node(start:Vertex) -> Vertex:
next_vertices = g.V(start.id).both().fold().next()
return next_vertices[randint(0, len(next_vertices)-1)]
def get_random_walk(start:Vertex, length:int=10) -> List[Vertex]:
current_node = start
random_walk = [current_node]
for _ in range(length):
current_node = get_next_node(current_node)
random_walk.append(current_node)
return random_walk
Issues
While testing on a subset of the total graph (400k vertices, 1.5m rel), I got these results
Sample query, <desired_randomwalk_length> of 10: 100k random walks in 1h10
Gremlinpython function, <desired_randomwalk_length> of 4: 2k random walks in 1h+
The sample command is really fast, but there are a few problems:
It doesn't seem to truly be a uniform distribution pick amongst the edges (it seems to be successive coin tosses) which could lead to certain paths being taken more often, which then diminishes the interest of generating random walks. (I can't directly do what is recommended here as the nodes ids aren't in a sequence, thus I have to acquire them first.)
I haven't found a way to give different probabilities to different types of relationships.
Is there a better way to do random walks with Gremlin?
If there is none, is there a way to modify the sample query to rectify the assign probabilities to types of edges? Maybe even a way to have a better distribution of the sampling?
In last recourse, is there a way to improve the queries to make this "by hand" with a gremlinpython script?
Thanks to everyone reading/replying!
EDIT
Is there a way to do the following:
Given a r_type1, r_type2, r_type3, ... the acceptable relationship type for this random walk
Given a proba1, proba2, proba3, ... the probabilities of going through these relationship types
For each step
Sample a node for each relationship type r_type1, r_type2, r_type3, ...
Keep only one according to the probabilities proba1, proba2, proba3, ...
I think the second step could be done be sampling multiple nodes for each relationships, in accordance with the probas (which could be done by using a gremlinpython script to build the query). This still leaves the question of how to sample on multiple relationships from a single node, and how to randomly pick one in the sampled nodes.
I hope this is clear!
Thanks to #Kelvin Lawrence's Practical Gremlin (especially the union section), I managed to do what I wanted (or close enough).
The Gremlin query I get is the following:
g.V(<vertex_id>).
repeat(
local(
union(
both(<relationship_type1>).sample(N1),
both(<relationship_type2>).sample(N2),
...
).
sample(1)
)
).times(<walk_length>).
path()
The N_ values are set independently of the node, such that the least probable transition yields exactly 1 sample. This also means that the probabilities are not exactly respected where the number of relationships of a given type is inferior to the corresponding N_ value.
The union part is built in python using gremlinpython (nb_samples is the dictionary storing the number of samples needed for each relationship type)
from gremlin_python.process.graph_traversal import __, GraphTraversal
next_node_traversal:GraphTraversal = __.union(
*[
__.both(key).sample(nb_samples[key])
for key in nb_samples
]
).sample(1)
(Here we are using the * operator to unpack the list when passing it as argument to the union method)
Can someone help me please with this simple query...Many thanks in advance...
I am using the following gremlin query and it works well giving me the original vertex (v) (with id-=12345), its edges (e) and the child vertex (id property). However, say if the original vertex 'v' (with id-12345) has no outgoing edges, the query returns nothing. I still want the properties of the original vertex ('v') even if it has no outgoing edges and a child. How can I do that?
g.V().has('id', '12345').as('v').
outE().as('e').
inV().
as('child_v').
select('v', 'e', 'child_v').
by(valueMap()).by(id).by(id)
There are a couple of things going on here but the major update you need to the traversal is to use a project() step instead of a select().
select() and project() steps are similar in that they both allow you to format the results of a traversal however they differ in (at least) one significant way. select() steps function by allowing you to access previously traversed and labeled elements (via as). project() steps allow you take the current traverser and branch it to manipulate the output moving forward.
In your original traversal, when there are no outgoing edges from original v so all the traversers are filtered out during the outE() step. Since there are no further traversers after the outE() step then remainder of the traversal has no input stream so there is no data to return. If you use a project() step after the original v you're able to return the original traverser as well as return the edges and incident vertex. This does lead to a slight complication when handling cases where no out edges exist. Gremlin does not handle null values, such as no out edges existing, you need to return some constant value for these statements using a coalesce statement.
Here is functioning version of this traversal:
g.V().hasId(3).
project('v', 'e', 'child_v').
by(valueMap()).
by(coalesce(outE().id(), constant(''))).
by(coalesce(out().id(), constant('')))
Currently you will get a lot of duplicate data, in the above query you will get the vertex properties E times. probably will be better to use project:
g.V('12345').project('v', 'children').
by(valueMap()).
by(outE().as('e').
inV().as('child').
select('e', 'child').by(id).fold())
example: https://gremlify.com/a1
You can get the original data format if you do something like this:
g.V('12345').as('v').
coalesce(
outE().as('e').
inV().
as('child_v')
select('v', 'e', 'child_v').
by(valueMap()).by(id).by(id),
project('v').by(valueMap())
)
example: https://gremlify.com/a2
I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).
I have an un-directed graph that weight of each edge is 1. The graph may have cycles. I need to find a longest path in the graph (each node appear once). The length of the path is number of nodes. Any simple/effective solution? Thanks!
According to http://en.wikipedia.org/wiki/Longest_path_problem, finding the longest path is NP-hard. So it is considered to be a hard to solve problem for big instances unless P = NP. In contrast to finding the shortest path, where BFS algorithm is linear.
I had a similar case but my nodes were limited, the number was less than 50.
I modelled it in a SQL database table (from, to and length columns) and tried to find each path between two given nodes and calculated the length of the path to identify the longest path.
On SQL Server, I build a SQL Recursive CTE query to define the longest path. Please refer to find the longest path article at referred document
Please note that, even with 50 nodes the query calculated over 70m possible paths from start node to end node without passing the same node twice and it took about 2 hours for SQL Server engine on my development computer to complete the execution of this query.
Given a directed graph, I need to find all vertices v, such that, if u is reachable from v, then v is also reachable from u. I know that, the vertex can be find using BFS or DFS, but it seems to be inefficient. I was wondering whether there is a better solution for this problem. Any help would be appreciated.
Fundamentally, you're not going to do any better than some kind of search (as you alluded to). I wouldn't worry too much about efficiency: these algorithms are generally linear in the number of nodes + edges.
The problem is a bit underspecified, so I'll make some assumptions about your data structure:
You know vertex u (because you didn't ask to find it)
You can iterate both the inbound and outbound edges of a node (efficiently)
You can iterate all nodes in the graph
You can (efficiently) associate a couple bits of data along with each node
In this case, use a convenient search starting from vertex u (depth/breadth, doesn't matter) twice: once following the outbound edges (marking nodes as "reachable from u") and once following the inbound edges (marking nodes as "reaching u"). Finally, iterate through all nodes and compare the two bits according to your purpose.
Note: as worded, your result set includes all nodes that do not reach vertex u. If you intended the conjunction instead of the implication, then you can save a little time by incorporating the test in the second search, rather than scanning all nodes in the graph. This also relieves assumption 3.