neo4j cypher : how to query a linked list - graph

I'm having a bit of trouble to design a cypher query.
I have a graph data structures that records some data in time, using
(starting_node)-[:last]->(data1)-[:previous]->(data2)-[:previous]->(data3)->...
Each of the data nodes has a date, and some data as attributes that I want to sum.
Now, for the sake of the example, let's say I want to query what happened last week.
The closer I got is to query something like
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn
where dn.date > some_date_a_week_ago
Unfortunately, as I get the right path, I also get all the intermediate paths (from 2 days ago, from 3 days ago...etc).
Since there is many starting nodes, and thus many possible path lengths, I cannot ask for the longest path in my query. Furthermore, dn.date can be different from date_a_week_ago ( if there is only one data node this week, and one data node last month, then the expected path is of length=1).
Any tip on how to filter the intermediate paths in my query ?
Thanks in advance !
ps : by the way, I'm quite new with the graph modeling, and I'd be interested with any answer that would require to change the graph structure if needed.

You can add a further point "dnnext" in your path, and add a condition to ensure the "dn" is the last one that satisfis the condifition,
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn-[:previous*0..1]->dnnext
where dn.date > some_date_a_week_ago and dnnext < some_date_a_week

Related

all_simple_paths() in igraph in R is taking too much time

I am trying to solve a trip assignment problem (transport planning). Available data is this: trips between nodes and links shapefile with 'from' and 'to' (match with those in trips data) codes. The approach i am adopting is this:
take each Origin-Destination pair from trip data
find all the possible paths between that OD pair
sort those paths based on length
start with smallest path and assign trips to that path until its capacity is reached
then take 2nd smallest and assign the trips...and so on
the problem i am facing is at 2nd step. I am using all_simple_paths() to get all possible paths between two nodes but it is taking too long. Here is that line
paths_between <- all_simple_paths(g_2, from = "240", to = "14")
how to work around this? is there any algorithm that I can use to get all possible paths? any help would be appreciated. thank you.

Efficient way to create relationships from a csv in neo4j

I am trying to populate my graph db with relationships that I currently have access to in a file.
They are in the form were each line in the relationship csv has the unique IDs of the two nodes that relationship is describing as well as the kind of the relationship it is.
Each line in the relationship csv has something along of the lines of:
uniqueid1,uniqueid2,relationship_name,property1_value,..., propertyn_value
I already had all nodes created and was working on matching the nodes that match the uniqueids specified in each of the files and then creating the relationship between them.
However, the tend to be taking a long time to be creating for each of the relationships and my suspicion is that I am doing something wrong.
The csv file has about 2.5 million lines with different relationship types. So i manually set the relationships.rela property to one of them and try to run through creating all nodes involved in that relationship and follow up with the next using my where clause.
The amount of properties each node has has been reduced by an ellipsis(...) and the names redacted.
I currently have the query to create the relationships set up in the following way
:auto USING PERIODIC COMMIT 100 LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships
WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)
RETURN count(rel)
Would appreciate if alternate patterns could be recommended.
Indexing is a mechanism that databases use, to speed up data lookups. In your case, since Item nodes are not indexed, these two matches can take a lot of time, especially if the number of Item nodes is very large.
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
To speed this up, you can create an index on Item nodes uniqueid property, like this:
CREATE INDEX unique_id_index FOR (n:Item) ON (n.uniqueid)
When you'll run your import query after creating the index, it will be much faster. But it will still take a bit of time as there are 2.5 million relationships. Read more about indexing in neo4j here.
Aside from suggestion from Charchit about create an index, I will recommend using an APOC function apoc.periodic.iterate which will execute the query into parallel batches of 10k rows.
https://neo4j.com/labs/apoc/4.4/overview/apoc.periodic/apoc.periodic.iterate/
For example:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships RETURN relationships",
"WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)",{batchSize:10000, parallel:true})
The first parameter will return all the data in the csv file then it will divide the rows into 10k per batch and it will run it in parallel using default concurrency (50 workers).
I use it often where I load 40M nodes/edges in about 30mins.

Do I need to create all nodes by hand in Neo4j?

I am probably missing something because I am very new to Neo4j, but looking at their Movie graph - probably the very first graph to play with when you are learning the platform - they give us a really big piece of code where every node and labels and properties are imputed by hand, one after the other. Ok, it seems fair to a small graph for learning purpose. But, how should I proceed when I want to import a CSV and create a graph from this data? I believe a hand-imput is not expected at all.
My data look something like this:
date
origin
destiny
value
type
balance
01-05-2021
A
B
500
transf
2500
It has more than 10 thousand rows like this.
I loaded it as:
LOAD CSV FROM "file:///MyData.csv" AS data
RETURN data;
and it worked. The data was loaded etc. But now I have some questions:
1- How do I proceeed if I want origin to be a node and destiny to be another node with type to be edges with value as property? I mean, I know how to create it like (a)->[]->(b) but how to create the entire graph without creating edge by edge, node by node, property by property etc...?
2- Am I able to select the date and see something like a time evolution for this graph? I want to see all transactions in 20-05-2021, 01-05-2021 etc and see how it evolves. Is it possible?
As example in the official docs says here: https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/#tutorial-neo4j-admin-import
You may want to create 3 separate files for the import:
First: you need the movies.csv to import nodes with label :Movie
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Second: you need actors.csv to import nodes with label :Actor
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
Finally, you can import relationships
As you see, actors and movies are already imported. So now you just need to specify the relationships. In the example, you're importing ROLE relationship in the given format:
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
So as you see in the header, you've got values:
START_ID - where the relationship starts, from which node
role - property name (you can specify multiple properties here, just make sure the csv format contains data for it)
:END_IN - where the relationship ends, to which node
:TYPE - type of the relationship
That's all :)

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

Neo4j matching nodes based on multiple criteria

We are using Neo4j to find best-matches for a speed-dating type of conference. Prior to the conference each Person fills out a form that specifies:
Languages (one or more)
Location (one to five preferred)
Interests (one to five interests)
We've ingested the data into Neo4j such that People, Languages, Locations, and Interests are all node types. The labels on the nodes represent the literal values e.g. (Person:Dave)->[r:knows]->(Language:English).
We would like to iterate through all Person nodes and find all matches to other Person nodes who have the same Language and Location and Interests.
In pseudocode: Languages(English||Spanish) && Location (Maryland||DC||Virginia) && Interests(Books||Movies||Food||Sports)
I'm pretty new to Cypher so I would appreciate any help. Thanks!
Try something like this
Match (p:Person) with p
// skip 0 limit 1000 -optional if you have big data
Match (p)-[r:SPEAKS]->(l:Language)<-[r2:SPEAKS]-(p2:Person),
(p)-[r3:LIVES]->(l:Location)<-[r4:LIVES]-(p2),
(p)-[r5:LIKES]->(l:Interest)<-[r6:LIKES]-(p2)
where id(p) < id(p2)
Return distinct p, p2
You might use different way to exclude duplicates (id(p) < id(p2) ) as it woun't work with limit

Resources