I have around 644 nodes in my graph database(Neo4j) . I need to compute distances between all these 644 nodes and display it visually in the GUI. I want to pre-compute and store the distances between every two pairs of nodes in the database itself rather than retrieving the nodes on to the server and then finding the distances between them on the fly and then showing on the GUI.
I want to understand how to write such a query in CYPHER. Please let me know.
I think this can work:
// half cross product
match (a),(b)
where id(a) < id(b)
match p=shortestPath((a)-[*]-(b))
with a,b,length(p) as l
create (a)-[:DISTANCE {distance:l}]->(b)
Set 4950 properties, created 4950 relationships, returned 0 rows in 4328 ms
But the browser viz will blow up with this, just that you know.
Regarding your distance measure (it won't be that fast but should work):
MATCH (a:User)-[:READ]->(book)<-[:READ]-(b:User)
WITH a,b,count(*) as common,
length(a-[:READ]->()) as a_read,
length(b-[:READ]->()) as b_read
CREATE UNIQUE (a)-[:DISTANCE {distance:common/(a_read+b_read-common)}]-(b)
Related
I'm currently writing my master thesis about clusterings in graphs. My prof said he wants the graph to be represented as a hash table. Because it needs less space than the adjency matrix and it is faster in checking if a edge exists between two vertices than adjency lists.
Anyway, I have a lot of problems understanding how a graph can be built with (perfect) hash functions. I know there should be two tables inside each other. The first includes every node and the second contains all the adjacent vertices. But how do I find a hash function that makes this correctly?
After I built the graph I have to assign a weight to each edge. Is it better to build a new graph or keep the old one? How can I assign the weights correctly to each edge and how do I save it?
And the last question: How fast can I do a degree query for one vertex? O(1)?
Sorry for all these questions but I read so many papers and I'm still confused.
Thank you in advance for any help!!!
Lisa
You have to ask your professor, but I would assume it is something simple.
E.g. let us say you have a triangle A,B,C then in the hash you just represent it as
A {B,C}
B {A,C}
C {A,B}
So the entry to the link A,B could be both from A and B.
I have a unidirectional graph.
The structure is as follows:
There are about 20,000 nodes in the graph.
I make the simplest request: MATCH (b1)-[:NEXT_BAR*10]->(b2) RETURN b1.id, b2.id LIMIT 5
The request is processed quickly.
But if I increase the number of relationships, the query takes much longer to process. In other words, the speed depends on the number of relationships.
This request takes longer than 5 minutes to complete: MATCH (b1)-[:NEXT_BAR*10000]->(b2) RETURN b1.id, b2.id LIMIT 5
This is still a simplified version. The request can have more than two nodes and the number of relationships can still be a range.
How can I optimize a query with a large number of relationships?
Perhaps there are other graph DBMS where there is no such problem?
Variable-length relationship queries have exponential time and memory complexity.
If R is the average number of suitable relationships per node, and D is the depth of the search, then the complexity is O(R ** D). This complexity will exist in any DBMS.
The theory is simple here, but there are a couple of intricacies in the query execution.
-[:NEXT_BAR*10000]- matches a path that is precisely 10000 edges in size, so query engine spends some time to find these paths. Another thing to mention is that in (b1)-[...]- >(b2), b1 and b2 are not specific, which means that the query engine has to scall all nodes. If there is a limit, yea, scall all should return a limited number of items. The whole execution also depends on the efficiency of variable-length path implementation.
Some of the following might help:
Is it feasible to start from a specific node?
If there are branches, the only hope is aggressive filtering because of exponential complexity (as cybersam well explained).
Use a smaller number in the variable expand, or a range, e.g., [NEXT_BAR*..10000]. In this case, the query engine will match any path up to 10000 in size (different semantics, but maybe applicable).
* means the DFS type of execution. On the other hand, BFS might be the right approach. Memgraph (DISCLAIMER: I'm the co-founder and CTO) also supports BFS type of execution with filtering lambda.
Here is a Python script I've used to generate and import data into Memgraph. By using small nodes_no you can quickly notice the execution patterns.
import mgclient
# Make a connection to the database.
connection = mgclient.connect(
host='127.0.0.1',
port=7687,
sslmode=mgclient.MG_SSLMODE_REQUIRE)
connection.autocommit = True
cursor = connection.cursor()
# Clean and setup database instance.
cursor.execute("""MATCH (n) DETACH DELETE n;""")
cursor.execute("""CREATE INDEX ON :Node(id);""")
# Import dataset.
nodes_no = 10
# Create nodes.
for identifier in range(0, nodes_no):
cursor.execute("""CREATE (:Node {id: "%s"});""" % identifier)
# Create edges.
for identifier in range(1, nodes_no):
cursor.execute("""
MATCH (start_node:Node {id: "%s"})
MATCH (end_node:Node {id: "%s"})
CREATE (start_node)-[:NEXT_BAR]->(end_node);
""" % (identifier - 1, identifier))
I want to create the initial population in a genetic algorithm. a population consists of paths between two nodes ( source and destination). how to find all possible paths between two nodes in an undirected graph?
Thanks
You could take a recursive approach to this problem. Do something along the lines of the following. (be warned I have not refined this).
Start by selecting a random node from the graph as a start node. And select a random node as the end node.
Look at all the connections to other nodes from the start one. Do not return to previous nodes. If there are no possible connections left stop.
If the node is the end node then stop and record the path. If not, then look at all the connections to that node, and repeat this step.
Repeat this process with every pair of nodes in the graph.
I'm sure you can see the recursive part to this solution. I'm afraid I cannot write up this solution currently but I hope this might point you in the right direction.
I have a Neo4j graph with directed cycles. I have had no issue finding all descendants of A assuming I don't care about loops using this Cypher query:
match (n:TEST{name:"A"})-[r:MOVEMENT*]->(m:TEST)
return n,m,last(r).movement_time
The relationships between my nodes have a timestamp property on them, movement_time. I've simulated that in my test data below using numbers that I've imported as floats. I would like to traverse the graph using the timestamp as a constraint. Only follow relationships that have a greater movement_time than the movement_time of the relationship that brought us to this node.
Here is the CSV sample data:
from,to,movement_time
A,B,0
B,C,1
B,D,1
B,E,1
B,X,2
E,A,3
Z,B,5
C,X,6
X,A,7
D,A,7
Here is what the graph looks like:
I would like to calculate the descendants of every node in the graph and include the timestamp from the last relationship using Cypher; so I'd like my output data to look something like this:
Node:[{Descendant,Movement Time},...]
A:[{B,0},{C,1},{D,1},{E,1},{X,2}]
B:[{C,1},{D,1},{E,1},{X,2},{A,7}]
C:[{X,6},{A,7}]
D:[{A,7}]
E:[{A,3}]
X:[{A,7}]
Z:[{B,5}]
This non-Neo4J implementation looks similar to what I'm trying to do: Cycle enumeration of a directed graph with multi edges
This one is not 100% what you want, but very close:
MATCH (n:TEST)-[r:MOVEMENT*]->(m:TEST)
WITH n, m, r, [x IN range(0,length(r)-2) |
(r[x+1]).movement_time - (r[x]).movement_time] AS deltas
WHERE ALL (x IN deltas WHERE x>0)
RETURN n, collect(m), collect(last(r).movement_time)
ORDER BY n.name
We basically find all the paths between any of your nodes (beware cartesian products get very expensive on non-trivial datasets). In the WITH we're building a collection delta's that holds the difference between two subsequent movement_time properties.
The WHERE applies an ALL predicate to filter out those having any non-positive value - aka we guarantee increasing values of movement_time along the path.
The RETURN then just assembles the results - but not as a map, instead one collection for the reachable nodes and the last value of movement_time.
The current issue is that we have duplicates since e.g. there are multiple paths from B to A.
As a general notice: this problem is much more elegantly and more performant solvable by using Java traversal API (http://neo4j.com/docs/stable/tutorial-traversal.html). Here you would have a PathExpander that skips paths with decreasing movement_time early instead of collection all and filter out (as Cypher does).
Given a directed graph, I need to find all vertices v, such that, if u is reachable from v, then v is also reachable from u. I know that, the vertex can be find using BFS or DFS, but it seems to be inefficient. I was wondering whether there is a better solution for this problem. Any help would be appreciated.
Fundamentally, you're not going to do any better than some kind of search (as you alluded to). I wouldn't worry too much about efficiency: these algorithms are generally linear in the number of nodes + edges.
The problem is a bit underspecified, so I'll make some assumptions about your data structure:
You know vertex u (because you didn't ask to find it)
You can iterate both the inbound and outbound edges of a node (efficiently)
You can iterate all nodes in the graph
You can (efficiently) associate a couple bits of data along with each node
In this case, use a convenient search starting from vertex u (depth/breadth, doesn't matter) twice: once following the outbound edges (marking nodes as "reachable from u") and once following the inbound edges (marking nodes as "reaching u"). Finally, iterate through all nodes and compare the two bits according to your purpose.
Note: as worded, your result set includes all nodes that do not reach vertex u. If you intended the conjunction instead of the implication, then you can save a little time by incorporating the test in the second search, rather than scanning all nodes in the graph. This also relieves assumption 3.