Efficient way to create relationships from a csv in neo4j - graph

I am trying to populate my graph db with relationships that I currently have access to in a file.
They are in the form were each line in the relationship csv has the unique IDs of the two nodes that relationship is describing as well as the kind of the relationship it is.
Each line in the relationship csv has something along of the lines of:
uniqueid1,uniqueid2,relationship_name,property1_value,..., propertyn_value
I already had all nodes created and was working on matching the nodes that match the uniqueids specified in each of the files and then creating the relationship between them.
However, the tend to be taking a long time to be creating for each of the relationships and my suspicion is that I am doing something wrong.
The csv file has about 2.5 million lines with different relationship types. So i manually set the relationships.rela property to one of them and try to run through creating all nodes involved in that relationship and follow up with the next using my where clause.
The amount of properties each node has has been reduced by an ellipsis(...) and the names redacted.
I currently have the query to create the relationships set up in the following way
:auto USING PERIODIC COMMIT 100 LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships
WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)
RETURN count(rel)
Would appreciate if alternate patterns could be recommended.

Indexing is a mechanism that databases use, to speed up data lookups. In your case, since Item nodes are not indexed, these two matches can take a lot of time, especially if the number of Item nodes is very large.
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
To speed this up, you can create an index on Item nodes uniqueid property, like this:
CREATE INDEX unique_id_index FOR (n:Item) ON (n.uniqueid)
When you'll run your import query after creating the index, it will be much faster. But it will still take a bit of time as there are 2.5 million relationships. Read more about indexing in neo4j here.

Aside from suggestion from Charchit about create an index, I will recommend using an APOC function apoc.periodic.iterate which will execute the query into parallel batches of 10k rows.
https://neo4j.com/labs/apoc/4.4/overview/apoc.periodic/apoc.periodic.iterate/
For example:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships RETURN relationships",
"WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)",{batchSize:10000, parallel:true})
The first parameter will return all the data in the csv file then it will divide the rows into 10k per batch and it will run it in parallel using default concurrency (50 workers).
I use it often where I load 40M nodes/edges in about 30mins.

Related

Do I need to create all nodes by hand in Neo4j?

I am probably missing something because I am very new to Neo4j, but looking at their Movie graph - probably the very first graph to play with when you are learning the platform - they give us a really big piece of code where every node and labels and properties are imputed by hand, one after the other. Ok, it seems fair to a small graph for learning purpose. But, how should I proceed when I want to import a CSV and create a graph from this data? I believe a hand-imput is not expected at all.
My data look something like this:
date
origin
destiny
value
type
balance
01-05-2021
A
B
500
transf
2500
It has more than 10 thousand rows like this.
I loaded it as:
LOAD CSV FROM "file:///MyData.csv" AS data
RETURN data;
and it worked. The data was loaded etc. But now I have some questions:
1- How do I proceeed if I want origin to be a node and destiny to be another node with type to be edges with value as property? I mean, I know how to create it like (a)->[]->(b) but how to create the entire graph without creating edge by edge, node by node, property by property etc...?
2- Am I able to select the date and see something like a time evolution for this graph? I want to see all transactions in 20-05-2021, 01-05-2021 etc and see how it evolves. Is it possible?
As example in the official docs says here: https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/#tutorial-neo4j-admin-import
You may want to create 3 separate files for the import:
First: you need the movies.csv to import nodes with label :Movie
movieId:ID,title,year:int,:LABEL
tt0133093,"The Matrix",1999,Movie
tt0234215,"The Matrix Reloaded",2003,Movie;Sequel
tt0242653,"The Matrix Revolutions",2003,Movie;Sequel
Second: you need actors.csv to import nodes with label :Actor
personId:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor
Finally, you can import relationships
As you see, actors and movies are already imported. So now you just need to specify the relationships. In the example, you're importing ROLE relationship in the given format:
:START_ID,role,:END_ID,:TYPE
keanu,"Neo",tt0133093,ACTED_IN
keanu,"Neo",tt0234215,ACTED_IN
keanu,"Neo",tt0242653,ACTED_IN
laurence,"Morpheus",tt0133093,ACTED_IN
laurence,"Morpheus",tt0234215,ACTED_IN
laurence,"Morpheus",tt0242653,ACTED_IN
carrieanne,"Trinity",tt0133093,ACTED_IN
carrieanne,"Trinity",tt0234215,ACTED_IN
carrieanne,"Trinity",tt0242653,ACTED_IN
So as you see in the header, you've got values:
START_ID - where the relationship starts, from which node
role - property name (you can specify multiple properties here, just make sure the csv format contains data for it)
:END_IN - where the relationship ends, to which node
:TYPE - type of the relationship
That's all :)

Preventing duplicate SIMILAR relationships when using algo.similarity.jaccard on continuously updated data

I am computing the Jaccard similarity index for a category of nodes in a graph using the algo.similarity.jaccard algorithm from the Neo4j graph algorithm's library. Once calculating the Jaccard similarity and indicating a cutoff, I am storing the metric in a relationship between the nodes (this is a feature of the algorithm). I am trying to see the change of the graph over time as I get new data to add into the graph (I will be reloading my CSV file with new data and merging in new nodes/relationships).
A problem I foresee is that once I run the Jaccard algorithm again with the updated graph, it will create duplicate relationships. This is the Neo4j documentation example of the code that I am using:
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
Is there a way to specify I do not want to have duplicate relationships each time I run this code with an updated graph? Manually, I'd use MERGE instead of CREATE but seeing as though this an algorithm from a library, I'm not sure how to go about that. FYI I will not have the ability to add changes to a library plug in and it seems like there is no way to store the relationship under a different label such as SIMILARITY2.
There are at least 2 ways to avoid duplicate relationships from multiple calls to algo.similarity.jaccard:
Delete the existing relationships (by default, they have the SIMILAR type) before each call. This is probably the easiest approach.
Omit the write:true option when making the calls (so that the procedure won't create relationships at all), and write your own Cypher code to optionally create relationships that do not already exist (using MERGE).
[UPDATED]
Here is an example of the second approach (using the
algo.similarity.jaccard.stream variant of the procedure, which yields more useful values for our purposes):
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.1})
YIELD item1, item2, similarity
WHERE item1 < item2
WITH algo.getNodeById(item1) AS n1, algo.getNodeById(item2) AS n2, similarity
MERGE (n1)-[s:SIMILAR]-(n2)
SET s.score = similarity
RETURN *
Since the procedure will return the same node pair twice (with the same similarity score), the WHERE clause is used to filter out one of the pairs, to speed up processing. The algo.getNodeById() utility function is used to get a node by its native ID. And the MERGE clause's relationship pattern does not specify a value for score, so that it will match an existing relationship even if it has a different value. The SET clause for setting the score is placed after the MERGE, which also helps to ensure the value is up to date.

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

What is a good data structure to save a dictionary?

I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.

neo4j cypher : how to query a linked list

I'm having a bit of trouble to design a cypher query.
I have a graph data structures that records some data in time, using
(starting_node)-[:last]->(data1)-[:previous]->(data2)-[:previous]->(data3)->...
Each of the data nodes has a date, and some data as attributes that I want to sum.
Now, for the sake of the example, let's say I want to query what happened last week.
The closer I got is to query something like
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn
where dn.date > some_date_a_week_ago
Unfortunately, as I get the right path, I also get all the intermediate paths (from 2 days ago, from 3 days ago...etc).
Since there is many starting nodes, and thus many possible path lengths, I cannot ask for the longest path in my query. Furthermore, dn.date can be different from date_a_week_ago ( if there is only one data node this week, and one data node last month, then the expected path is of length=1).
Any tip on how to filter the intermediate paths in my query ?
Thanks in advance !
ps : by the way, I'm quite new with the graph modeling, and I'd be interested with any answer that would require to change the graph structure if needed.
You can add a further point "dnnext" in your path, and add a condition to ensure the "dn" is the last one that satisfis the condifition,
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn-[:previous*0..1]->dnnext
where dn.date > some_date_a_week_ago and dnnext < some_date_a_week

Resources