Gremlin, how to return all vertex pairs that are connected by an edge with a specific label - gremlin

Take a simple example of airline connection graph as in below picture
can we come up with a gremlin query that can return pairs of cities connected by SW? Like [{ATL,CHI},{SFO,CHI},{DAL,CHI},{HSV,DAL}]

Looks like all you probably need is:
g.V().outE('SW').inV().path()
If you don't want the edge in the result you can use a flatMap :
g.V().flatMap(outE('SW').inV()).path()
To get back some properties rather than just vertices all you need to do is add a by modulator to the path step.
g.V().flatMap(outE('SW').inV()).path().by(valueMap())
This will return all the properties for every vertex. In a large result set this is not considered a best practice and you should explicitly ask for the properties you care about. There are many ways you can do this using values, project or valueMap. If you had a property called code representing the airport code you might do this.
g.V().
flatMap(outE('SW').inV()).
path().
by(valueMap('code'))
or just
g.V().flatMap(outE('SW').inV()).
path().
by('code')

Related

Is there a way to reuse aggregate steps?

I have a graph database storing different types of entities and I am building an API to fetch entities from the graph. It is however a bit more complicated since for each type of entity there is a set of rules that are used for fetching related entities as well as the original.
To do this I used an aggregate step to aggregate all related entities that I am fetching into a collection.
An added requirement is to fetch a batch of entities (and related entities). I was going to do this by changing the has step that is fetching entities to use P.within and map the aggregation to each of the found entities.
This works if I continue fetching a single entity, but if I want to fetch two then my result set will be correct for the first one but the result set for the second contains the results of the first one as well as its own results.
I think this is because second one will simply add to the aggregated collection from the first one since the aggregation key is the same.
I haven't found any way to clear the collection between the first and the second, nor any way to have a dynamic aggregation side effect key.
Code:
return graph.traversal().V()
.hasLabel(ENTITY_LABEL)
.has("entity_ref", P.within(entityRefs)) // entityRefs is a list of entities I am looking for
.flatMap(
__.aggregate("match")
.sideEffect(
// The logic that applies the rules lives here. It will add items to "match" collection.
)
.select("match")
.fold()
)
.toStream()
...
The result should be a list of lists of entities where the first list of entities in the outer list contains results for the first entity in entityRefs, and the second list of entities contains results for the second entity in entityRefs.
Example:
I want to fetch the vertices for entity refs A and B and their related entities.
Let's say I expect the results to then be [[A, C], [B, D, E]], but I get the results [[A, C], [A, C, B, D, E]] (The second results contain the results from the first one).
Questions:
Is there a way to clear the "match" collection after the selection?
Is there a way to have dynamic side effect keys such that I create a collection for each entityRef?
Is there perhaps a different way I can do this?
Have I misidentified the problem?
EDIT:
This is an example that is a miniature version of the problem. The graph is setup like so:
g.addV('entity').property('id',1).property('type', 'customer').as('1').
addV('entity').property('id',2).property('type', 'email').as('2').
addV('entity').property('id',6).property('type', 'customer').as('6').
addV('entity').property('id',3).property('type', 'product').as('3').
addV('entity').property('id',4).property('type', 'subLocation').as('4').
addV('entity').property('id',7).property('type', 'location').as('7').
addV('entity').property('id',5).property('type', 'productGroup').as('5').
addE('AKA').from('1').to('2').
addE('AKA').from('2').to('6').
addE('HOSTED_AT').from('3').to('4').
addE('LOCATED_AT').from('4').to('7').
addE('PART_OF').from('3').to('5').iterate()
I want to fetch a batch of entities, given their ids and fetch related entities. Which related entities should be returned is a function of the type of the original entity.
My current query is like this (slightly modified for this example):
g.V().
hasLabel('entity').
has('id', P.within(1,3)).
flatMap(
aggregate('match').
sideEffect(
choose(values('type')).
option('customer',
both('AKA').
has('type', P.within('email', 'phone')).
sideEffect(
has('type', 'email').
aggregate('match')).
both('AKA').
has('type', 'customer').
aggregate('match')).
option('product',
bothE('HOSTED_AT', 'PART_OF').
choose(label()).
option('PART_OF',
bothV().
has('type', P.eq('productGroup')).
aggregate('match')).
option('HOSTED_AT',
bothV().
has('type', P.eq('subLocation')).
aggregate('match').
both('LOCATED_AT').
has('type', P.eq('location')).
aggregate('match')))
).
select('match').
unfold().
dedup().
values('id').
fold()
).
toList()
If I only fetch for one entity i get correct results. For id: 1 I get [1,2,6] and for id: 3 I get [3,5,4,7]. However when i fetch for both I get:
==>[3,5,4,7]
==>[3,5,4,7,1,2,6]
The first result is correct, but the second contains the results for both ids.
You can leverage the (not too well documented to be honest but seemingly powerful traversal step) group().by(key).by(value).
That way you can drop the aggregate() side effect step that is causing you trouble. As an alternative to collect multiple vertices matching some traversal into a list I used union().
An example that uses the graph you posted(I only included the Customer option for brevity):
g.V().
hasLabel('entity').
has('id', P.within(1,3)).
<String, List<Entity>>group()
.by("id")
.by(choose(values("type"))
.option('customer', union(
identity(),
both('AKA').has('type', 'email'),
both('AKA').has('type', within('email', 'phone')).both('AKA').has('type', 'customer'))
.map((traversal) -> new Entity(traversal.get())) //Or whatever business class you have
.fold() //This is important to collect all 3 paths in the union together
.option('product', union()))
.next()
This traversal has the obvious drawback of the code being a bit more verbose. It declares it will step over the 'AKA' from a Customer twice. Your traversal only declared it once.
It does however keep the by(value) part of the group() step separate between different keys. Which is what we wanted.

Gremlin find all vertices that have "any" property with a given value

The properties in my graph are dynamic. That means, there can be any number of properties on the vertices. This also means that, when I do a search, I will not know what property value to look for. Is it possible in gremlin to query the graph to find all vertices that have any property with a given value.
e.g., with name and desc as properties. If the incoming search request is 'test', the query would be g.V().has('name', 'test').or().has('desc', 'test'). How can I achieve similar functionality when I do not know what properties exist? I need to be able to search on all the properties and check if any of those properties' value is 'test'
You can do this using the following syntax:
g.V().properties().hasValue('test')
However, with any size dataset I would expect this to be a very slow traversal to perform as it is the equivalent of asking an RDBMS "Find me any cell in any column in any table where the value equals 'test'". If this is a high frequency request I would suggest looking at refactoring your graph model or using a database optimized for searches such as Elasticsearch.

Given a set of nodes of a strongly connected graph as input can we get subgraph and path traversal between them

I have this requirement in ArangoDB AQL: I have a graph created with Document collection for node and Edge collection for directed edge relation.
I want to input a subset of list of nodes as input to AQL query and get all the node traversals /sub graph as the output.
How to achieve this from AQL?
I want to know the relation between given nodes in that way. Please comment if more details are needed.
I know below query now
FOR v IN 1..1 INBOUND[or OUTBOUND] 'Collection/_key' EdgeCollection
OPTIONS {bfs: true}
RETURN v
I'd recommend reviewing the queries on the ArangoDB sample page where it shows how it performs graph queries, and how to review the results.
In your sample query above you are only returning v (vertex information) as in FOR v IN.
That returns only the last vertex from every path that the query returns, it doesn't return edge or path information.
For that you need to test with FOR v, e, p IN and it will return extra information about the last edge (e), and the path (p) it took.
In particular look at the results of p as it contains a JSON object that holds path information, which is a collection of vertices and edges.
By iterating through that data you should be able to extract the information you require.
AQL gives you many tools to aggregate, group, filter, de-duplicate, and reduce data sets, so make sure you look at the wider language functions and practice building more complex queries.

arangodb aql effectively tarversing from startvertex through the endvertex and find connection between them

i'm very new to graph concept and arangodb. i plan to using both of them in a project which related to communication analysis. i have set the data to fit the need in arangodb with one document collection named object and one edge collection named object_routing
in my object the data structure is as follow
{
"img": "assets/img/default_message.png",
"label": "some label",
"obj_id": "45a92a7344ee4f758841b5466c010ed9",
"type": "message"
}
...
{
"img": "assets/img/default_person.png",
"label": "some label",
"obj_id": "45a92a7344ee4f758841b5466c01111",
"type": "user"
}
in my object_routing the data structure is as follow
{
"message_id": "no_data",
"source": "45a92a7344ee4f758841b5466c010ed9",
"target": "45a92a7344ee4f758841b5466c01111",
"type": "has_contacted"
}
with _from : object/45a92a7344ee4f758841b5466c010ed9 and _to : object/45a92a7344ee4f758841b5466c01111
the sum of data for object is 23k and for object_routing is 127k.
my question is, how can i effectively traversing from start vertex through the end vertex, so that i can presumably get all the connected vertex and its edge and its children and so on between them untill there is nothing to traverse again?
i'm afraid my question is not clear enough and my understanding of graph concept is not in the right direction so please bear with me
note : bfs algorithm is not an option because that is not what i need. if possible, i would like to get the longest path. my arangodb current version is 3.1.7 running on a cluster with 1 coordinator and 3 db servers
It is worth trying a few queries to get a feel for how AQL traversals work, but maybe start with this example from the AQL Traversal documentation page:
FOR v, e, p IN 1..10 OUTBOUND 'object/45a92a7344ee4f758841b5466c010ed9' GRAPH 'insert_my_graph_name'
LET last_vertex_in_path = LAST(p.vertices)
FILTER last_vertex_in_path.obj_id == '45a92a7344ee4f758841b5466c01111'
RETURN p
This sample query will look at all outbound edges in your graph called insert_my_graph_name starting from the vertex with an _id of object/45a92a7344ee4f758841b5466c010ed9.
The query is then set up to return three variables for every path found:
v contains a collection of vertices for the outbound path found
e contains a collection of edges for the outbound path found
p contains the path that was found
A path is consisted of vertices connected to each other by edges.
If you want to explore the variables, try this version of the query:
FOR v, e, p IN 1..10 OUTBOUND 'object/45a92a7344ee4f758841b5466c010ed9' GRAPH 'insert_my_graph_name'
RETURN {
vertices: v,
edges: e,
paths: p
}
What is nice is that AQL returns this information in JSON format, in arrays and such.
When a path is returned, it is stored as a document with two attributes, edges and vertices, where the edges attribute is an array of edge documents the path went down, and the vertices attribute is an array of vertex documents.
The interesting thing about the vertices array is that the order of array elements is important. The first document in the vertices array is the starting vertex, and the last document is the ending vertex.
So the example query above, because your query is set up as an OUTBOUND query, that means your starting vertex will always be the FIRST element of the array stored at p.vertices' and the end of the path will always be theLAST` element of that array.
It doesn't matter how many vertices are traversed in your path, that rule still works.
If your query was an INBOUND rule, then the logic stays the same, in that case FIRST(p.vertices) will be the starting vertex for the path, and LAST(p.vertices) will be the terminating vertex, which will be the same _id as what you specified in your query.
So back to your use case.. if you want to filter out all OUTBOUND paths from your starting vertex to a specific vertex, then you can add the LET last_vertex_in_path = LAST(p.vertices) declaration to set a reference to the last vertex in the path provided.
Then you can easily provide a FILTER that references this variable, and then filter on any attribute of that terminating vertex. You could filter on the last_vertex_in_path._id or last_vertex_in_path.obj_id or any other parameter of that final vertex document.
Play with it and practice some, but once you see that a graph traversal query only provides you with these three key variables, v, e, and p, and these aren't anything special, they are just arrays of vertices and edges, then you can do some pretty powerful filtering.
You could put filters on properties of any of the vertices, edges, or path positions to do some pretty flexible filtering and aggregation of the results it sends through.
Also have a look at the traversal options, they can be useful.
To get started just make sure your have your documents and edges loaded, and that you've created a graph with those document and edges collections in it.
And yes.. you can have many document and edge collections in a single graph, even sharing document/edge collections over multiple graphs if that suits your use cases.
Have fun!

Tinkerpop3/Gremlin. Find (A) Upsert (B) add Edge A to B

I am looking for an upsert functionality in Gremlin.
Client program has a stream of (personId, favoriteMovieNodeId) that need to query for the favoriteMovieNodeId's, then UPSERT a person Vertex and create the [favoriteMovie] edge.
this will create duplicate Person nodes:
g.V().has(label,'movies').has('uid',$favoriteMovieNodeId).as('fm')
.addV('Person').property('personId', $personId).addE('favMovie').to('fm')
Is there a way to check for existence of node based on properties before adding a node? I can't seem to find the documentation on this very basic graph function thats a part of every underlying graph db.
If the movie is guaranteed to exist, then it's:
g.V().has('movies','uid',$favoriteMovieNodeId).as('fm').
coalesce(V().has('Person','personId', $personId),
addV('Person').property('personId', $personId)).
addE('favMovie').to('fm')

Resources