I am using Google dataflow+ Scio to do a cross join of a dataset with itself to find out the topK most similar ones by doing cosine similarity. The data set has around 200k records and total size of the dataset is ~300MB.
I am joining the dataset with itself by passing it as a side input by setting the workerCacheMB to 500MB.
The dataset is a tuple and it looks like this: (String,Set(Integer)). The first element in the tuple is the URI and the next element is a set of entity indexes.
Most records in the dataset have under 500 entity indexes. However, there are about 7000 records which have over 10k entities and the maximum one has 171k entities.
I have some hot keys and hence the worker utitlization looks like this:
After it scaled up to 80 nodes and then scaled down to 1 node, it had already processed about 90% of the records. I assume, the hot keys have got stuck in the last one node and it took the rest of the time to process all the hotkeys.
I tried the --experiments=shuffle_mode=service option. Though it gave an improvement, the problem persists. I was thinking about ways to use the sharded HotKey join mentioned here, how ever since I need to find similarity I don't think I can afford to split the hot entities and rejoin them.
I was wondering if there is a way to solve it or if I basically have to live with this.
Obviously, this is a crude way to find sims. However, I am interested in finding solution to the Data engineering part of the problem, while letting ML engineers iterate on the Sim finding algorithms.
The stripped down version of the code looks as follows:
private def findSimilarities(e1: Set[Integer], e2: Set[Integer]): Float = {
val common = e1.intersect(e2)
val cosine = (common.size.toFloat) / (e1.size + e2.size).toFloat
cosine
}
val topN = sortedReverseTake[ElementSims](250)(by(_.getScore))
elements
.withSideInputs(elementsSI)
.flatMap { case (e1, si) =>
val fromUri = e1._1.toString
val fromEntities = e1._2
val sideInput: List[(String, Set[Integer])] = si(elementsSI)
val sims: List[ElementSims] = findSimilarities(fromUri,fromEntities,
sideInput)
topN(sims)
}
.toSCollection
.saveAsAvroFile(outputGCS)
Related
I am trying to populate my graph db with relationships that I currently have access to in a file.
They are in the form were each line in the relationship csv has the unique IDs of the two nodes that relationship is describing as well as the kind of the relationship it is.
Each line in the relationship csv has something along of the lines of:
uniqueid1,uniqueid2,relationship_name,property1_value,..., propertyn_value
I already had all nodes created and was working on matching the nodes that match the uniqueids specified in each of the files and then creating the relationship between them.
However, the tend to be taking a long time to be creating for each of the relationships and my suspicion is that I am doing something wrong.
The csv file has about 2.5 million lines with different relationship types. So i manually set the relationships.rela property to one of them and try to run through creating all nodes involved in that relationship and follow up with the next using my where clause.
The amount of properties each node has has been reduced by an ellipsis(...) and the names redacted.
I currently have the query to create the relationships set up in the following way
:auto USING PERIODIC COMMIT 100 LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships
WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)
RETURN count(rel)
Would appreciate if alternate patterns could be recommended.
Indexing is a mechanism that databases use, to speed up data lookups. In your case, since Item nodes are not indexed, these two matches can take a lot of time, especially if the number of Item nodes is very large.
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
To speed this up, you can create an index on Item nodes uniqueid property, like this:
CREATE INDEX unique_id_index FOR (n:Item) ON (n.uniqueid)
When you'll run your import query after creating the index, it will be much faster. But it will still take a bit of time as there are 2.5 million relationships. Read more about indexing in neo4j here.
Aside from suggestion from Charchit about create an index, I will recommend using an APOC function apoc.periodic.iterate which will execute the query into parallel batches of 10k rows.
https://neo4j.com/labs/apoc/4.4/overview/apoc.periodic/apoc.periodic.iterate/
For example:
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///filename.csv' as relationships RETURN relationships",
"WITH relationships.uniqueid1 as uniqueid1, relationships.uniqueid2 as uniqueid2, relationships.extraproperty1 as extraproperty1, relationships.rela as rela... , relationships.extrapropertyN as extrapropertyN
WHERE relations.rela = "manager_relationship"
MATCH (a:Item {uniqueid: uniqueid1})
MATCH (b:Item {uniqueid: uniqueid2})
MERGE (b) - [rel: relationship_name {propertyvalue1: extraproperty1,...propertyvalueN: extrapropertyN }] -> (a)",{batchSize:10000, parallel:true})
The first parameter will return all the data in the csv file then it will divide the rows into 10k per batch and it will run it in parallel using default concurrency (50 workers).
I use it often where I load 40M nodes/edges in about 30mins.
I am computing the Jaccard similarity index for a category of nodes in a graph using the algo.similarity.jaccard algorithm from the Neo4j graph algorithm's library. Once calculating the Jaccard similarity and indicating a cutoff, I am storing the metric in a relationship between the nodes (this is a feature of the algorithm). I am trying to see the change of the graph over time as I get new data to add into the graph (I will be reloading my CSV file with new data and merging in new nodes/relationships).
A problem I foresee is that once I run the Jaccard algorithm again with the updated graph, it will create duplicate relationships. This is the Neo4j documentation example of the code that I am using:
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
Is there a way to specify I do not want to have duplicate relationships each time I run this code with an updated graph? Manually, I'd use MERGE instead of CREATE but seeing as though this an algorithm from a library, I'm not sure how to go about that. FYI I will not have the ability to add changes to a library plug in and it seems like there is no way to store the relationship under a different label such as SIMILARITY2.
There are at least 2 ways to avoid duplicate relationships from multiple calls to algo.similarity.jaccard:
Delete the existing relationships (by default, they have the SIMILAR type) before each call. This is probably the easiest approach.
Omit the write:true option when making the calls (so that the procedure won't create relationships at all), and write your own Cypher code to optionally create relationships that do not already exist (using MERGE).
[UPDATED]
Here is an example of the second approach (using the
algo.similarity.jaccard.stream variant of the procedure, which yields more useful values for our purposes):
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.1})
YIELD item1, item2, similarity
WHERE item1 < item2
WITH algo.getNodeById(item1) AS n1, algo.getNodeById(item2) AS n2, similarity
MERGE (n1)-[s:SIMILAR]-(n2)
SET s.score = similarity
RETURN *
Since the procedure will return the same node pair twice (with the same similarity score), the WHERE clause is used to filter out one of the pairs, to speed up processing. The algo.getNodeById() utility function is used to get a node by its native ID. And the MERGE clause's relationship pattern does not specify a value for score, so that it will match an existing relationship even if it has a different value. The SET clause for setting the score is placed after the MERGE, which also helps to ensure the value is up to date.
I am looking to run a model that matches randomly matches 2 datasets (person and vacancy) based on matching characteristics of both datasets.
A person can have a role type, location, other and the vacancies will be looking for these characteristics.
Current methodology is using a for loop to work through the vacancies, subset the person table based on the matching characteristics and randomly pick a person.
Rough outline of current code:
for (I 1:dim(Vacancy)){
individual_vacancy = Vacancy[1]
available_person <- person[...matchingconditionsfromindividualvacancy....andavailable=1]
Vacancy$personid[I] = randomsampleofavailableperson
person$Available[personid == randomsampleofavailableperson] = 0
}
This is very slow and computationally expensive due to the size of the dataset and from what I can assume is the looping and writing back to the original datasets.
Are there any methodologies for this kind of problem/ R packages I'd be able to take advantage of?
Edit: This is a 1:1 matching problem which is where the problem is occurring. i.e. 1 person must be allocated to one vacancy. The Available flag update ensures this at the moment.
This is the query:
MATCH (n:Client{curp:'SOME_VALUE'})
WITH n
MATCH (n)-[:HIZO]-()-[r:FB]-()-[:HIZO]-(m:Client)
WHERE ID(n)<>ID(m)
AND NOT (m)-[:FB]->(n)
MERGE (n)-[:FB]->(m) RETURN m.curp
Why is the Merge stage getting so many DB hits if the query already narrowed down
n, m pairs to 6,781 rows?
Details of that stage shows this:
n, m, r
(n)-[ UNNAMED155:FB]->(m)
Keep in mind that queries build up rows, and operations in your query get run on every row that is built up.
Because the pattern in your match may find multiple paths to the same :Client, it will build up multiple rows with the same n and m (but possibly different r, but as you aren't using r anywhere else in your query, I encourage you to remove the variable).
This means that even though you mean to MERGE a single relationship between n and a distinct m, this MERGE operation will actually be run for every single duplicate row of n and m. One of those MERGEs will create the relationship, the others will be wasting cycles matching on the relationship that was created without doing anything more.
That's why we should be able to lower our db hits by only considering distinct pairs of n and m before doing the MERGE.
Also, since your query made sure we're only considering n and m where the relationship doesn't exist, we can safely use CREATE instead of MERGE, and it should save us some db hits because MERGE always attempts a MATCH first, which isn't necessary.
An improved query might look like this:
MATCH (n:Client{curp:'SOME_VALUE'})
WITH n
MATCH (n)-[:HIZO]-()-[:FB]-()-[:HIZO]-(m:Client)
WHERE n <> m
AND NOT (m)-[:FB]->(n)
WITH DISTINCT n, m
MERGE (n)-[:FB]->(m)
RETURN m.curp
EDIT
Returning the query to use MERGE for the :FB relationship, as attempts to use CREATE instead ended up not being as performant.
I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c