Neo4j MATCH then MERGE too many DB hits - graph

This is the query:
MATCH (n:Client{curp:'SOME_VALUE'})
WITH n
MATCH (n)-[:HIZO]-()-[r:FB]-()-[:HIZO]-(m:Client)
WHERE ID(n)<>ID(m)
AND NOT (m)-[:FB]->(n)
MERGE (n)-[:FB]->(m) RETURN m.curp
Why is the Merge stage getting so many DB hits if the query already narrowed down
n, m pairs to 6,781 rows?
Details of that stage shows this:
n, m, r
(n)-[ UNNAMED155:FB]->(m)

Keep in mind that queries build up rows, and operations in your query get run on every row that is built up.
Because the pattern in your match may find multiple paths to the same :Client, it will build up multiple rows with the same n and m (but possibly different r, but as you aren't using r anywhere else in your query, I encourage you to remove the variable).
This means that even though you mean to MERGE a single relationship between n and a distinct m, this MERGE operation will actually be run for every single duplicate row of n and m. One of those MERGEs will create the relationship, the others will be wasting cycles matching on the relationship that was created without doing anything more.
That's why we should be able to lower our db hits by only considering distinct pairs of n and m before doing the MERGE.
Also, since your query made sure we're only considering n and m where the relationship doesn't exist, we can safely use CREATE instead of MERGE, and it should save us some db hits because MERGE always attempts a MATCH first, which isn't necessary.
An improved query might look like this:
MATCH (n:Client{curp:'SOME_VALUE'})
WITH n
MATCH (n)-[:HIZO]-()-[:FB]-()-[:HIZO]-(m:Client)
WHERE n <> m
AND NOT (m)-[:FB]->(n)
WITH DISTINCT n, m
MERGE (n)-[:FB]->(m)
RETURN m.curp
EDIT
Returning the query to use MERGE for the :FB relationship, as attempts to use CREATE instead ended up not being as performant.

Related

Why is SQLite query on two indexed columns so slow?

I have a table with around 65 million rows that I'm trying to run a simple query on. The table and indexes looks like this:
CREATE TABLE E(
x INTEGER,
t INTEGER,
e TEXT,
A,B,C,D,E,F,G,H,I,
PRIMARY KEY(x,t,e,I)
);
CREATE INDEX ET ON E(t);
CREATE INDEX EE ON E(e);
The query I'm running looks like this:
SELECT MAX(t), B, C FROM E WHERE e='G' AND t <= 9878901234;
I need to run this queries for thousands of different values of t and was expecting each query to run in a fraction of a second. However, the above query is taking nearly 10 seconds to run!
I tried running the query plan but only get this:
0|0|0|SEARCH TABLE E USING INDEX EE (e=?)
So this should be using the index. With a binary search I would expect worse case only 26 tests, which I would be pretty quick.
Why is my query so slow?
Each table in a query can use one index. Since your WHERE clause looks at multiple columns, you can use a multi-column index. For these, all but the last column used from the index has to test for equality; the last one used can be used for greater than/less than.
So:
CREATE INDEX e_idx_e_t ON E(e, t);
should give you a boost.
For further reading about how Sqlite uses indexes, the Query Planner documentation is a good introduction.
You're also mixing an aggregate function (max(t)) and columns (B and C) that aren't part of a group. In Sqlite's case, this means that it will pick values for B and C from the row with the maximum t value; other databases usually throw an error.

Express conditions on two consecutive variable length relationships?

How to express a conditions for two consecutive variable length relationships?
Consider this partial query
MATCH(t1:Type{myID: 1})-[r:relType]->(:Type)-[rels:relType*0..]-(t2:Type{myID:100})
WHERE r.attr1>10
Basically I am trying to saying that there could be one or more relations from t1 to t2. The first relation r should satisfy a given condition on its attribute.
If this is the only relation between the two nodes then it's ok.
It at least another relation exist I want to add another condition such as:
WHERE r.attr1>10 AND r_next.attr2> r_prev.attr2+r_prev.attr1
where r_next and r_prev are consecutive relations: ()-[r_prev]->()-[r_next]-(). Note that at the first step r_prev is the first relation r.
I know rels is a collection but I do not know how to express such a condition.
Consecutive comparison like this isn't easy at this time, and it can't currently be evaluated during expansion.
You can do some filtering on this after, but it will be ugly.
We'll make use of the APOC Procedures for apoc.coll.pairsMin(), which takes a collection and returns a list of adjacent pairs.
MATCH (t1:Type{myID: 1}), (t2:Type{myID:100})
MATCH (t1)-[r:relType]->(:Type)-[rels:relType*0..]-(t2)
WHERE r.attr1>10
WITH t1, t2, apoc.coll.pairsMin(rels) as pairs
WHERE all(pair in pairs WHERE pair[0].attr1 + pair[0].attr2 < pair[1].attr2)
RETURN t1, t2 //or whatever you want to return from this

Dataflow: Hot keys in a cross join

I am using Google dataflow+ Scio to do a cross join of a dataset with itself to find out the topK most similar ones by doing cosine similarity. The data set has around 200k records and total size of the dataset is ~300MB.
I am joining the dataset with itself by passing it as a side input by setting the workerCacheMB to 500MB.
The dataset is a tuple and it looks like this: (String,Set(Integer)). The first element in the tuple is the URI and the next element is a set of entity indexes.
Most records in the dataset have under 500 entity indexes. However, there are about 7000 records which have over 10k entities and the maximum one has 171k entities.
I have some hot keys and hence the worker utitlization looks like this:
After it scaled up to 80 nodes and then scaled down to 1 node, it had already processed about 90% of the records. I assume, the hot keys have got stuck in the last one node and it took the rest of the time to process all the hotkeys.
I tried the --experiments=shuffle_mode=service option. Though it gave an improvement, the problem persists. I was thinking about ways to use the sharded HotKey join mentioned here, how ever since I need to find similarity I don't think I can afford to split the hot entities and rejoin them.
I was wondering if there is a way to solve it or if I basically have to live with this.
Obviously, this is a crude way to find sims. However, I am interested in finding solution to the Data engineering part of the problem, while letting ML engineers iterate on the Sim finding algorithms.
The stripped down version of the code looks as follows:
private def findSimilarities(e1: Set[Integer], e2: Set[Integer]): Float = {
val common = e1.intersect(e2)
val cosine = (common.size.toFloat) / (e1.size + e2.size).toFloat
cosine
}
val topN = sortedReverseTake[ElementSims](250)(by(_.getScore))
elements
.withSideInputs(elementsSI)
.flatMap { case (e1, si) =>
val fromUri = e1._1.toString
val fromEntities = e1._2
val sideInput: List[(String, Set[Integer])] = si(elementsSI)
val sims: List[ElementSims] = findSimilarities(fromUri,fromEntities,
sideInput)
topN(sims)
}
.toSCollection
.saveAsAvroFile(outputGCS)

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

Matching specific items in several discrete collections

I have a problem whereby I have several discrete lists of ID's eg.
List (A) 1,2,3,4,5,7,8
List (B) 2,3,4,5
List (C) 4,2,8,9,1
etc...
I then have another collection of ID's...
For example: 1,2,4
I need to try and match one into each list. If I can perfectly match all ID's in my secondary collection (one collection ID matched with an ID from each list) then I get a true result....
I have found that it becomes complicated because if you simply iterate over the lists matching the first collection/list pair that you encounter it may result in you precluding a possible combination further on down the line hence returning a false negative result.
For example:
List (A) 1,2,3,4
List (B) 1,2,3,4
List (C) 3,4
Collection is: 3,1,2
The first ID from the collection (3) matches with an entry in list A, the second ID in the collection (1) matches an item in list B, however the final ID in the collection (2) DOESNT match any entry in list C however if you rearrange the order of the collection to be: 2,1,3 then a match is found.... Therefore I am looking for some form of logic for attempting a match on all possible combinations in an efficient manner(?)
To make it more complicated the ID's are actually GUID's so cant just be sorted in ascending order
I hope I have described this well enough to make it clear what I am attempting and with a bit of luck somebody will be able to tell me that what I need to do is very easy and I am missing something real simple!
I am forced to code this in VB6 but any methods or pseudo code would be great. The backend of this is SQL server so if a solution using TSQL was possible this would be even better as all of the ID's are held in tables already.
Many thanks in advance.
Jake, yep the lists and the collection both contain GUIDS. I used plain integers to simplify the problem a bit.
Once a list has been matched it cant be searched again, hence the ordering problem that I tried to explain. If you say that a list as 'matched' then no further attempts to match this will be performed. It is this very behaviour that can cause a false negative.
'Sending' the collection in in every possible combination of orders would work but would be a massive job .....
I feel I must be missing a really straightforward concept or solution here??!!
Thanks for your assistance so far.
I don't see a way around checking each GUID contained in the lists against each GUID in the collection. You would have to keep record of in which lists each GUID in the collection occurs.
To use your example of the Collection (3, 1, 2), 3 occurs in List A, B and C.
You will basically be left with this dataset.
3 (A, B, C)
1 (A, B)
2 (A, B)
Once you have distilled it down to this dataset you can determine whether there are any GUIDs with zero occurrences in the lists which would result in a negative.
I am not at all well versed in algorithms, but this is how I would proceed after that :
Start with the first set (A, B, C), and check how many times it occurs further on in the dataset. In this case no occurrences are found.
Moving on to the next set (A, B), if the number of occurrences of this set is found to be greater than the length of this set, i.e. more than two occurrences, would result in a negative. If the number of occurrences match the length exactly, as is the case here, the set (A, B) can be removed from any further consideration.
3 (C)
1 ()
2 ()
I guess you would continue to repeat the process until a negative is identified or all the occurrences have been excluded. There is probably a recognized algorithm for this sort of problem, but my knowledge is a bit lacking in that respect. :(

Resources