Gremlin - optimize query - graph

I have a graph, that represents database objects, parent-child relations and dataflows relations (only in-between columns).
Here is my current gremlin query (in python), that should find dataflow impact of a column:
g.V().has('fqn', 'some fully qualified name').
repeat(outE("flows_into").dedup().store('edges').inV()).
until(
or_(
outE("flows_into").count().is_(eq(0)),
cyclicPath(),
)
).
cap('edges').
unfold().
dedup().
map(lambda: "g.V(it.get().getVertex(0).id()).in('child').in('child').id().next().toString() + ',' + g.V(it.get().getVertex(1).id()).in('child').in('child').id().next().toString()").
toList()
This query should return all edges, that are somehow impacted by the initial column.
The problem is, that in some cases, I do not care about the column-level stuff and I want to get the edges on 'schema level'. That is wjat the lambda does - for both nodes in the edge, it traverses two times up in the objects tree, which returns the schema node.
The problem is in this lambda function - I cannot just do this:
it.get().getVertex(1).in('child').in('child').id().next().toString()
because getVertex(1) does not return a traversable instance. So I need to start new traversal by g.V().... By my debugging, this line causes the horrible slowdown. It gets about 50x slower if I leave this transformation in.
Do you have any ideas how to optimize this query?

You might consider not using a lambda at all, given they tend to not be portable between implementations. Perhaps the map step could be replaced with a project step something like:
project('v0','v1').
by(outV().in('child').in('child').id())
by(inV().in('child').in('child').id())

Related

How to get properties hasid/key with vertexes info in one gremlin query or Gremlin.Net

I try to get properties which has key or id in following query by Gremlin.Net, but vertex info(id and label) in VertexProperty is null in result.
g.V().Properties<VertexProperty>().HasKey(somekey).Promise(p => p.ToList())
So i try another way, but it's return class is Path, and i had to write an ugly code for type conversion.
g.V().Properties<VertexProperty>().HasKey(somekey).Path().By(__.ValueMap<object, object>(true))
Is there a better way to achieve this requirement
I think basically the only thing missing to get what you want is the Project() step.
In order to find all vertices that have a certain property key and then get their id, label, and then all information about that property, you can use this traversal:
g.V().
Has(someKey).
Project<object>("vertexId", "vertexLabel", "property").
By(T.Id).
By(T.Label).
By(__.Properties<object>(someKey).ElementMap<object>()).
Promise(t => t.ToList());
This returns a Dictionary where the keys are the arguments given to the Project step.
If you instead want to filter by a certain property id instead of a property key, then you can do it in a very similar way:
g.V().
Where(__.Properties<object>().HasId(propertyId)).
Project<object>("vertexId", "vertexLabel", "property").
By(T.Id).
By(T.Label).
By(__.Properties<object>(someKey).ElementMap<object>()).
Promise(t => t.ToList());
This filters in both cases first the vertices to only have vertices that have the properties we are looking for. That way, we can use the Project() step afterwards to get the desired data back.
ElementMap should give all information back about the properties that you want.
Note however that these traversals will most likely require a full graph scan in JanusGraph, meaning that it has to iterate over all vertices in your graph. The reason is that these traversals cannot use an index which would make them much more efficient. So, for larger graphs, the traversals will probably not be feasible.
If you had the vertex ids available instead of the property ids in the second traversal, then you could make the traversal a lot more efficient by replacing g.V().Where([...]) simply with g.V(id).

Gremlin OLAP traversal query error regarding local star-graph

I'm trying to execute an OLAP traversal on a query that needs to check if a vertex has a neighbour of certain type.
i keep getting
Local traversals may not traverse past the local star-graph on GraphComputer
my query looks something like:
g.V().hasLabel('label1').
where(_.out().hasLable('label2'))
I'm using the TraversalVertexProgram.
needless to say, when running the same query in oltp mode there is no problem
is there a way to execute such logic?
That is limitation of TinkerPop OLAP GraphComputer. It operate on 'star-graph' objects. The vertex and connected edges only. It uses message passing engine inside. So you have to rewrite you query.
Option 1: start from label2 and go t label1. This should return the same result
g.V().hasLabel('label2').in().hasLabel('label1')
Option2: try to use unique edge labels and check edge label instead
g.V().hasLabel('label1').where(_.outE('label1_label2'))

Gremlin bredcrumb query

I'm new to Gremlin and still learning.
I'd like to include the starting vertex in the results of the following:
g.V('leafNode').repeat(out()).emit()
This gives me a collection of vertexes starting from an arbitrary leaf node "upwards" to the root vertex. However this collection excludes the V('leafNode') vertex itself.
How do I include the V('leafNode') in this collection?
Thanks
-John
There are two places for the emit in this statement: either before the repeat or after. If it comes before the repeat, it will be performed before evaluating the next loop.
Source: http://tinkerpop.apache.org/docs/current/reference/#repeat-step
So the following should take care of what your request.
g.V('leafNode').emit().repeat(out())

is it better to get few properties from a vertex or getting the whole vertex in gremlin

which of these queries is cost effective when considering network calls and query process time
1)
g.V().has('personId','1234')
=> V[4232]
2)
g.V().has('personId','1234').values('name','age')
=> chris
=> 24
3)
g.V().has('personId','1234').valueMap('name','age')
=> [name:[chris],age:[24]]
4)
g.V().has('personId','1234').properties('name','age')
=> vp[name->chris
=>vp[age->24]
does getting the whole vertex cost us more of network bandwidth . i usually get vertices as the query processing is done quickly
The cheapest will be options two and three. Returning an Element, which would be a Vertex, Edge, VertexProperty will be more costly than returning an individual result or a Map or results. Even the following is less expensive than returning an entire Element:
g.V(1).valueMap(true)
which from a data perspective is basically the same as g.V(1).
The basic rule ends up being not so different than SQL. You likely wouldn't do a SELECT * FROM table, nor should you return all the data from an element - only retrieve the data you need.

How to insert large number of nodes into Neo4J

I need to insert about 1 million of nodes in Neo4j. I need to specify that each node is unique, so every time I insert a node it has to be checked that there's not the same node yet. Also the relationships must be unique.
I'm using Python and Cypher:
uq = 'CREATE CONSTRAINT ON (a:ipNode8) ASSERT a.ip IS UNIQUE'
...
queryProbe = 'MERGE (a:ipNode8 {ip:"' + prev + '"})'
...
queryUpdateRelationship= 'MATCH (a:ipNode8 {ip:"' + prev + '"}),(b:ipNode8 {ip:"' + next + '"}) MERGE (a)-[:precede]->(b)'
The problem is that after putting 40-50K nodes into Neo4j , the insertion speed slows down quickly and I can not to put anything else.
Your question is quite open ended. In addition to #InverseFalcon's recommendations, here are some other things you can investigate to speed things up.
Read the Performance Tuning documentation, and follow the recommendations. In particular, you might be running into memory-related issues, so the Memory Tuning section may be very helpful.
Your Cypher query(ies) can probably be sped up. For instance, if it makes sense, you can try something like the following. The data parameter is expected to be a list of objects having the format {a: 123, b: 234}. You can make the list as long as appropriate (e.g., 20K) to avoid running out of memory on the server while it processes the list within a single transaction. (This query assumes that you also want to create b if it does not exist.)
UNWIND {data} AS d
MERGE (a:ipNode8 {ip: d.a})
MERGE (b:ipNode8 {ip: d.b})
MERGE (a)-[:precede]->(b)
There are also periodic execution APOC procedures that you might be able to use.
For mass inserts like this, it's best to use LOAD CSV with periodic commit or the import tool.
I believe it's also best practice to use a parameterized query instead of appending values into a string.
Also, you created a unique property constraint on :ipNode8, but not :ipNode, which is the first one you MERGE. Seems like you'll need a unique constraint for that one too.

Resources