We are using Neo4j to find best-matches for a speed-dating type of conference. Prior to the conference each Person fills out a form that specifies:
Languages (one or more)
Location (one to five preferred)
Interests (one to five interests)
We've ingested the data into Neo4j such that People, Languages, Locations, and Interests are all node types. The labels on the nodes represent the literal values e.g. (Person:Dave)->[r:knows]->(Language:English).
We would like to iterate through all Person nodes and find all matches to other Person nodes who have the same Language and Location and Interests.
In pseudocode: Languages(English||Spanish) && Location (Maryland||DC||Virginia) && Interests(Books||Movies||Food||Sports)
I'm pretty new to Cypher so I would appreciate any help. Thanks!
Try something like this
Match (p:Person) with p
// skip 0 limit 1000 -optional if you have big data
Match (p)-[r:SPEAKS]->(l:Language)<-[r2:SPEAKS]-(p2:Person),
(p)-[r3:LIVES]->(l:Location)<-[r4:LIVES]-(p2),
(p)-[r5:LIKES]->(l:Interest)<-[r6:LIKES]-(p2)
where id(p) < id(p2)
Return distinct p, p2
You might use different way to exclude duplicates (id(p) < id(p2) ) as it woun't work with limit
Related
I am trying to populate my graph db with relationships that I currently have access to in a file.
They are in the form were each line in the relationship csv has the unique IDs of the two nodes that relationship is describing as well as the kind of the relationship it is.
Each line in the relationship csv has something along of the lines of:
uniqueid1,uniqueid2,relationship_name,property1_value,..., propertyn_value
I already had all nodes created and was working on matching the nodes that match the uniqueids specified in each of the files and then creating the relationship between them.
However, the tend to be taking a long time to be creating for each of the relationships and my suspicion is that I am doing something wrong.
The csv file has about 2.5 million lines with different relationship types. So i manually set the relationships.rela property to one of them and try to run through creating all nodes involved in that relationship and follow up with the next using my where clause.
The amount of properties each node has has been reduced by an ellipsis(...) and the names redacted.
I currently have the query to create the relationships set up in the following way
:auto USING PERIODIC COMMIT 100 LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships
WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)
RETURN count(rel)
Would appreciate if alternate patterns could be recommended.
Indexing is a mechanism that databases use, to speed up data lookups. In your case, since Item nodes are not indexed, these two matches can take a lot of time, especially if the number of Item nodes is very large.
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
To speed this up, you can create an index on Item nodes uniqueid property, like this:
CREATE INDEX unique_id_index FOR (n:Item) ON (n.uniqueid)
When you'll run your import query after creating the index, it will be much faster. But it will still take a bit of time as there are 2.5 million relationships. Read more about indexing in neo4j here.
Aside from suggestion from Charchit about create an index, I will recommend using an APOC function apoc.periodic.iterate which will execute the query into parallel batches of 10k rows.
https://neo4j.com/labs/apoc/4.4/overview/apoc.periodic/apoc.periodic.iterate/
For example:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships RETURN relationships",
"WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)",{batchSize:10000, parallel:true})
The first parameter will return all the data in the csv file then it will divide the rows into 10k per batch and it will run it in parallel using default concurrency (50 workers).
I use it often where I load 40M nodes/edges in about 30mins.
I have a directed multigraph. With this query I tried to find all the nodes that is connected to node with the uuid n1_34
MATCH (n1:Node{uuid: "n1_34"}) -[r]- (n2:Node) RETURN n2, r
This will give me a list of n2 (n1_1187, n2_2280, n2_1834, n2_932 and n2_722) and their relationships among themselves which are exactly what I need.
Nodes n1_1187, n2_2280, n2_1834, n2_932 and n2_722 are connected to the node n1_34
Now I need to order them based on the relationship it has within this subgraph. So for example, n1_1187 should be on top with 4 relationships while the others have 1 relationship.
I followed this post: Extract subgraph from Neo4j graph with Cypher but it gives me the same result as the query above. I also tried to return count(r) but it gives me 1 since it counts all unique relationships not the relationships with a common source/target.
Usually with networkx I can copy this result into a subgraph then count the relationships of each node. Can I do that with neo4j without modifying the current graph? How?
Please help. Or is there other way?
This snippet will recreate your graph for testing purposes:
WITH ['n1_34,n1_1187','n1_34,n2_2280','n1_34,n2_1834','n1_34,n2_722', 'n1_34,n2_932','n1_1187,n2_2280','n1_1187,n2_932','n1_1187,n2_1834', 'n1_1187,n2_722'] AS node_relationships
UNWIND node_relationships as relationship
with split(relationship, ",") as nodes
merge(n1:Node{label:nodes[0]})
merge(n2:Node{label:nodes[1]})
merge(n1)-[:LINK]-(n2)
Once that is run, the graph I'm working with looks like:
Then this CQL will select the nodes in the subgraph and then subsequently count up each of their respective associated links, but only to other nodes existing already in the subgraph:
match(n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
with collect(distinct(n2)) as subgraph_nodes
unwind subgraph_nodes as subgraph_node
match(subgraph_node)-[r:LINK]-(n3:Node)
where n3 in subgraph_nodes
return subgraph_node.label, count(r) order by count(r) DESC
Running the above yields the following result:
This query should do what you need :
MATCH (n1:Node{uuid: "n1_34"})-[r]-(n2:Node)
RETURN n1, n2, count(*) AS freq
ORDER BY freq DESC
Using PROFILE to assess the efficiency of some of the existing solutions using #DarrenHick's sample data, the following is the most efficient one I have found, needing only 84 DB hits:
MATCH (n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
WITH COLLECT(n2) AS nodes
UNWIND nodes AS n
RETURN n, SIZE([(n)-[:LINK]-(n3) WHERE n3 IN nodes | null]) AS cnt
ORDER BY cnt DESC
Darren's solution (adjusted to return subgraph_node instead of subgraph_node.label, for parity) requires 92 DB hits.
#LuckyChandrautama's own solution (provided in a comment to Darren's answer, and adjusted to match Darren's sample data), uses 122 DB hits.
This shows the importance of using PROFILE to assess the performance of different Cypher solutions against the actual data. You should try doing that with your actual data to see which one works best for you.
I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c
I'm trying to approach a multilingual graph database but I'm struggling on how to achieve an optimal model.
My current proposal is to make two node types: Movie and MovieTranslation.
Movie holds all relationships as likes, related, ratings and comments. MovieTranslation contains all translatable data (title, plot, genres). A Movie node does not contain these kind of properties, only the original_title.
Movie and MovieTranslation are tied together by a translation relationship.
When I query nodes, I would check if they have a translation relationship with the queried locale (en_US for example). If true, merge the translation with the main node as the result.
I think this way might not be the best, but I can't think on a better one.
Do you guys have a better suggestion for the database model? It would be very appreciated.
I'm using neo4j, if you need this information.
Thanks,
Vinicius.
I suggest moving the original title to its own node also, call it MovieTitle. "Complicating" your model in this way should actually "simplify" (or at least standardise) your queries because you're always looking in one place for film titles (also for indexing and searching).
You're assuming that films only have one original title which isn't the case. A Korea-Japan co-production will have at least two original titles. Whole genres of Japanese cinema were released with different original Japanese titles in cinemas and on VHS.
Distinct from the idea of an original title is that of specific language titles. The same film released in different Chinese-speaking countries will have different Chinese-language titles that are deemed more marketable to the specific local audiences.
To get the original title:
MATCH (c:Country)<-[HAS_NATIONALITY]-(m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)
WHERE m.id = 1
RETURN COLLECT(t.title, c.country_code)
To get the original title in China:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)
WHERE c.country_code == "CN"
RETURN m, COLLECT(t.title, c.country_code)
To get all language titles:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)-[HAS_LANGUAGE]->(l:Language)
RETURN m, COLLECT(t.title, l.language_code)
To get all Chinese-language titles:
MATCH (m:Movie)-[HAS_TITLE]->(t:MovieTitle)-[HAS_NATIONALITY]->(c:Country)-[HAS_LANGUAGE]->(l:Language)
WHERE l.language_code == "zh"
RETURN m, COLLECT(t.title, c.name)
I would separate plot and genre into their own nodes. There is an argument that different national cinemas have unique genres, but if westerns and samurai dramas are both sub-genres of period dramas then you want to find them both on a period drama search.
I would still have the idea of Translation nodes but don't confuse with them the domain you're modelling. It should be domain-ignorant and - for simple words/phrases like "romantic comedy" - should almost be a third-party graph plug-in released by GraphAware in 2025.
Get the French-language genre titles of a specific film:
MATCH (m:Movie)-[HAS_GENRE*]->(g:Genre)-[HAS_TRANSLATION]->(t:Translation)-[HAS_LANGUAGE]->(l:Language)
WHERE m.id = 100 AND l.language_code = "fr"
RETURN COLLECT(t.translation)
Get all romanic comedies:
MATCH (m:Movie)-[HAS_GENRE*]->(g:Genre)-[HAS_TRANSLATION]->(t:Translation)
WHERE t.translation = "comédie romantique"
RETURN m
Unlike movie titles and genres, plots are altogether more simple because you're modelling the film's story as a blob of text and not as domain objects in itself. Perhaps later you may do textual analysis on the plot texts to find themes, gender bias, etc, and model this in the graph as well.
Get the French language plot for a specific movie:
MATCH (m:Movie)-[HAS_PLOT]->(p:Plot)-[HAS_LANGUAGE]->(l:Language)-[HAS_TRANSLATION]->(t:Translation)
WHERE m.id = 100 AND t.translation = "French"
RETURN p.plot
(Please treat the Cypher queries as pseudo-code. I didn't make a graph and test them.)
I think the model is ok.
You can RETURN movie, translation or RETURN {movie:movie, translation:translation}
Currently converting nodes to maps and combining these maps is not yet supported, that's something on the roadmap.
How and where would you want to use the nodes? If for rendering, you can just access the two columns or entries. If for graph visualization you can also combine them into a node in the json source for the viz.
I'm having a bit of trouble to design a cypher query.
I have a graph data structures that records some data in time, using
(starting_node)-[:last]->(data1)-[:previous]->(data2)-[:previous]->(data3)->...
Each of the data nodes has a date, and some data as attributes that I want to sum.
Now, for the sake of the example, let's say I want to query what happened last week.
The closer I got is to query something like
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn
where dn.date > some_date_a_week_ago
Unfortunately, as I get the right path, I also get all the intermediate paths (from 2 days ago, from 3 days ago...etc).
Since there is many starting nodes, and thus many possible path lengths, I cannot ask for the longest path in my query. Furthermore, dn.date can be different from date_a_week_ago ( if there is only one data node this week, and one data node last month, then the expected path is of length=1).
Any tip on how to filter the intermediate paths in my query ?
Thanks in advance !
ps : by the way, I'm quite new with the graph modeling, and I'd be interested with any answer that would require to change the graph structure if needed.
You can add a further point "dnnext" in your path, and add a condition to ensure the "dn" is the last one that satisfis the condifition,
start n= ... // definition of the many starting nodes here
match n-[:last]->d1, path = d1-[:previous*0..7]->dn-[:previous*0..1]->dnnext
where dn.date > some_date_a_week_ago and dnnext < some_date_a_week