How to get count of intermediate vertices? - gremlin

Let's say there is a tree-like structure.
Top level: Warehouse
Next level: Storage space
Last level: Stored item
I want to get count of Storage spaces and Storage items per each warehouse.
I've already tried to get the number of Stored items: it's done pretty easily using groupCount.
g.V().
hasLabel('Warehouse').
as('w').
out('HAS_SPACE').
hasLabel('Space').
as('s').
out('HAS_ITEM').
hasLabel('Item').
groupCount().by(select('w')).
unfold().
order().by(values, desc).
limit(100).
project('WarehouseName', 'ItemsCount').
by(select(keys).values('Name')).
by(select(values))
However I want to get count of 's' as well and I can't think of any fast way to achieve it. I've thought about counting traversals something like:
g.V().
hasLabel('Warehouse').
project('WarehouseName', 'SpaceCount', 'ItemCount').
by('Name').
by(out('HAS_SPACE').count()).
by(out('HAS_SPACE').out('HAS_ITEMS').count())
but it works extremely slow on large number of vertices (there are about 26M).
Is there any other way to get that count?

You can use groupCount as sideEffect by giving it a name:
g.V().hasLabel('Warehouse').as('w')
.out('HAS_SPACE').hasLabel('Space').as('s')
.groupCount('spaceCount').by(select('w'))
.out('HAS_ITEM').hasLabel('Item')
.groupCount('itemsCount').by(select('w'))
.count().select('spaceCount', 'itemsCount')
Note that this will return 2 maps, one for spaces and one for items.
If you need to get it as a single map you can replace last line with:
.count().union(select('spaceCount'),select('itemsCount')).unfold()
.group().by(keys).by(select(values).fold()).unfold()
The result will be a map of arrays, of which first value is spaces count and second is items count.

Related

Efficient way to create relationships from a csv in neo4j

I am trying to populate my graph db with relationships that I currently have access to in a file.
They are in the form were each line in the relationship csv has the unique IDs of the two nodes that relationship is describing as well as the kind of the relationship it is.
Each line in the relationship csv has something along of the lines of:
uniqueid1,uniqueid2,relationship_name,property1_value,..., propertyn_value
I already had all nodes created and was working on matching the nodes that match the uniqueids specified in each of the files and then creating the relationship between them.
However, the tend to be taking a long time to be creating for each of the relationships and my suspicion is that I am doing something wrong.
The csv file has about 2.5 million lines with different relationship types. So i manually set the relationships.rela property to one of them and try to run through creating all nodes involved in that relationship and follow up with the next using my where clause.
The amount of properties each node has has been reduced by an ellipsis(...) and the names redacted.
I currently have the query to create the relationships set up in the following way
:auto USING PERIODIC COMMIT 100 LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships
WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)
RETURN count(rel)
Would appreciate if alternate patterns could be recommended.
Indexing is a mechanism that databases use, to speed up data lookups. In your case, since Item nodes are not indexed, these two matches can take a lot of time, especially if the number of Item nodes is very large.
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
To speed this up, you can create an index on Item nodes uniqueid property, like this:
CREATE INDEX unique_id_index FOR (n:Item) ON (n.uniqueid)
When you'll run your import query after creating the index, it will be much faster. But it will still take a bit of time as there are 2.5 million relationships. Read more about indexing in neo4j here.
Aside from suggestion from Charchit about create an index, I will recommend using an APOC function apoc.periodic.iterate which will execute the query into parallel batches of 10k rows.
https://neo4j.com/labs/apoc/4.4/overview/apoc.periodic/apoc.periodic.iterate/
For example:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships RETURN relationships",
"WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)",{batchSize:10000, parallel:true})
The first parameter will return all the data in the csv file then it will divide the rows into 10k per batch and it will run it in parallel using default concurrency (50 workers).
I use it often where I load 40M nodes/edges in about 30mins.

How do I simple calculations in CosmosDB using GremlinAPI

I am using CosmosDB with GremlinAPI and I would like to perform simple calculation even though CosmosDB does not support the math step.
Imagine that I have vertex "Person" with the property Age that can have a edge "Owns" to another vertex "Pet" that also has the property Age.
I would like to know if a given person has a cat that is younger than the person but not more than 10 years younger.
The query (I know this is just a part of it but this is where my problem is)
g.V().hasLabel("Person").has("Name", "Jonathan Q. Arbuckle").as("owner").values("age").inject(-10).sum().as("minAge").select("owner")
Returns an empty result but
g.V().hasLabel("Person").has("Name", "Jonathan Q. Arbuckle").as("owner").values("age").inject(-10).as("minAge").select("owner")
Returns the selected owner.
It seems that if I do a sum() or a count() in the query, then I cannot do 'select("owner")' anymore.
I do not understand this behaviour. What should I do to be able to do a 'select("owner")' and be able to filter the Pets based on their age.
Is there some other way I can write this query?
Thank you in advance
Steps like sum, count and max are known as reducing barrier steps. They cause what has happened earlier in the traversal to essentially be forgotten. One way to work around this is to use a project step. As I do not have your data I used the air-routes data set and used airport elevation as a substitute for age in your graph.
gremlin> g.V(3).
project("elev","minelev","city").
by("elev").
by(values("elev").inject(-10).sum()).
by("city")
==>[elev:542,minelev:532,city:Austin]
I wrote some notes about reducing barrier steps here: http://kelvinlawrence.net/book/PracticalGremlin.html#rbarriers
UPDATED
If you wanted to find airports with an elevation less than the starting airport by no more than 10 and avoiding the math step you can use this formulation.
g.V(3).as('a').
project('min').by(values('elev').inject(-10).sum()).as('p').
select('a').
out().
where(lt('a')).by('elev').
where(gt('p')).by('elev').by('min')

gremlin - projecting to separate lists from single list with different where clauses

if I had this small graph:
addV('location').property('name','loc 1').property('active',true).property('size',2000).as('a')
addV('location').property('name','loc 2').property('active',true).property('size',1200).as('b')
addV('location').property('name','loc 3').property('active',true).property('size',1800).as('c')
addV('location').property('name','loc 4').property('active',true).property('size',2400).as('d')
addV('location').property('name','loc 5').property('active',true).property('size',2800).as('e')
addV('location').property('name','loc 6').property('active',true).property('size',4200).as('f')
addV('owner').property('name','john doe').as('o')
addV('building').property('name','building').property('active',true).as('building')
addE('on-property').from('a').to('building')
addE('owns').from('o').to('a')
addE('owns').from('o').to('b')
addE('owns').from('o').to('c')
addE('owns').from('o').to('d')
addE('owns').from('o').to('e')
addE('owns').from('o').to('f')
If I start with a base query like this:
//if 1 is the id of "john doe"
g.V(1).out('owns').has('active',true)
I want to do additional where clauses, but keep them separate so using or into a single list won't work.
The other consideration is that this is a small subset of a true graph so I don't really want to loop through them multiple times...i.e. I am thinking that a project would loop through the 1000+ locations for each projection, rather then doing a one time loop
g.V(1)
.project('sizeLimit','withBuildings')
.by(out('my-locations').where(and(has('active',true),has('size',lt(2000)))))
.by(out('my-locations').and(
coalesce(
out('on-property').count(local),
constant(0)).is(gt(0))
,has('active',true)))
I tried to do the related portions of the query first rather then having to do the "out" in each of them like:
g.V(1).out('my-locations')
.has('active',true)
.project('sizeLimit','withBuildings')
.by(has('size',lt(2000)))
.by(
coalesce(
out('on-property').count(local),
constant(0)).is(gt(0))
)
but the projection didn't take properly and ended up doing a single item per projection - even if I did a fold or unfold
So I am wondering if the first way with putting the out in the by rather then before is the only option or if I can aggregate or store the list and then do a where off that aggregate or somehow loop through and push to separate list?
I am using azure cosmos graph db.
The end look would hopefully be like this json:
"sizeLimit":[
locationVertex,
locationVertex,
locationVertex
],
"withBuildings":[
locationVertex,
locationVertex
]
The "locationVertex" being the vertex that is returned by the filter for the set of the project.
I'm not sure I fully follow your problem, but it sounds like you're mostly interested in keeping the out('owns') from the by() modulator because for each by() you add you will be re-traversing the same portion of the graph. You're quite close with your second traversal example (though you say it wasn't working quite right) - with some minor modifications it seemed to behave for me:
gremlin> g.V().has('name','john doe').
......1> out('owns').has('active',true).
......2> fold().
......3> project('sizeLimit','withBuildings').
......4> by(unfold().has('size',lt(2000)).fold()).
......5> by(unfold().filter(out('on-property').count().is(gt(0))).fold())
==>[sizeLimit:[v[8],v[4]],withBuildings:[v[0]]]
I don't think you needed the coalesce() because count() will reduce the stream to zero when there are no outE() edges.

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

neo4j cypher : how to query a linked list

I'm having a bit of trouble to design a cypher query.
I have a graph data structures that records some data in time, using
(starting_node)-[:last]->(data1)-[:previous]->(data2)-[:previous]->(data3)->...
Each of the data nodes has a date, and some data as attributes that I want to sum.
Now, for the sake of the example, let's say I want to query what happened last week.
The closer I got is to query something like
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn
where dn.date > some_date_a_week_ago
Unfortunately, as I get the right path, I also get all the intermediate paths (from 2 days ago, from 3 days ago...etc).
Since there is many starting nodes, and thus many possible path lengths, I cannot ask for the longest path in my query. Furthermore, dn.date can be different from date_a_week_ago ( if there is only one data node this week, and one data node last month, then the expected path is of length=1).
Any tip on how to filter the intermediate paths in my query ?
Thanks in advance !
ps : by the way, I'm quite new with the graph modeling, and I'd be interested with any answer that would require to change the graph structure if needed.
You can add a further point "dnnext" in your path, and add a condition to ensure the "dn" is the last one that satisfis the condifition,
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn-[:previous*0..1]->dnnext
where dn.date > some_date_a_week_ago and dnnext < some_date_a_week

Resources