Neo4j - Cypher: mutual object with traversing relationships - graph

I have a small Graph:
CREATE
(Dic1:Dictioniary { name:'Dic1' }),
(Dic2:Dictioniary { name: 'Dic2' }),
(Dic3:Dictioniary { name: 'Dic3' }),
(File1:File { name: 'File1' }),
(File2:File { name: 'File2' }),
(File3:File { name: 'File3' }),
(Dic2)-[:contains]->(Dic1),
(Dic1)-[:contains]->(File1),
(Dic3)-[:contains]->(File2),
(File1)-[:references]->(File3),
(File2)-[:references]->(File3)
I need a cypher query to find out, if for example Dic2 and Dic3 have paths/relations, where they reference the same File.
In this case it would be true; the mutual File is File3.
Thanks for your help

When you are looking for just two dictionaries you can achieve this in a single statement:
MATCH (d2:Dictioniary { name:'Dic2' }),(d3:Dictioniary { name:'Dic3' })
MATCH (d2)-[:contains|references*]->(f:File)<-[:contains|references*]-(d3)
RETURN f
It is quite expensive due to the two unbounded path matches, but it is quite cheap as it is bound from the outset by the two dictionary matches.
If you had an arbitrary number of Dictionaries to test you could do something like:
MATCH (d1:Dictioniary { name:'Dic1' }),(d2:Dictioniary { name:'Dic2' }),(d3:Dictioniary { name:'Dic3' })
WITH [d1,d2,d3] AS ds
MATCH (d)-[:contains|references*]->(f:File)
WHERE d IN ds
WITH f, ds, COLLECT(d) AS fds
WHERE length(ds)= length(fds)
RETURN f
This matches the dictionaries that you are interested in first and for each of them in turn it finds the files that they reference. Importantly the File object is preserved and the Dictionary that referenced it is collected into an array (fds). If we know that we had 3 dictionaries to begin with (length(ds)) and that a given file has the same number of related dictionaries (length(fds)) then all dictionaries must reference it.
Assuming that there may multiple paths to a given File from a given Dictionary then you can insert the DISTINCTmodifier into the second WITH statement:
WITH f, ds, COLLECT(DISTINCT(d)) AS fds

Related

How to get portion of variable-length pattern path where relationship attribute >= some value?

How do I return just the portion of a variable length pattern path that meets some criteria?
In the example below, the following cypher query will return both blue & red nodes.
MATCH p=(a:Person)-[:KNOWS*1..10]->(b:Person)-[:KNOWS]->(j:Person { name: 'Jane'})
RETURN p
However, I want to return the blue nodes that have an incoming relationship confidence_factor >= 0.75. The issue is that everything I've tried either
Eliminates the entire upper mixed blue/red path b/c it's got one rel that fails test
Eliminates just the Jim->Erin rel b/c it fails the test.
Effectively, I want all sequential nodes going backwards from Jane where relationship confidence_factor >= 0.75 but a given path should stop as soon as it encounters a rel that fails that test, and NOT CONTINUE, even if other relationships between nodes in that path might pass (e.g. Tom-[:KNOWS]->Jim)
CREATE (one:Person { name: 'Tom'})
,(two:Person { name: 'Jim'})
,(three:Person { name: 'Erin'})
,(four:Person { name: 'Kevin'})
,(five:Person { name: 'Skylar'})
,(six:Person { name: 'Jane'})
,(one)-[:KNOWS {confidence_factor:0.80}]->(two)
,(two)-[:KNOWS {confidence_factor:0.05}]->(three)
,(three)-[:KNOWS {confidence_factor:0.85}]->(six)
,(four)-[:KNOWS {confidence_factor:0.90}]->(five)
,(five)-[:KNOWS {confidence_factor:0.95}]->(six)
;
You should already get multiple rows with different path lengths. Just look for those paths where all relationships match your criteria.
MATCH p = (:Person {name: 'Jane'})<-[:KNOWS*1..10]-(:Person)
WHERE all(r IN relationships(p) WHERE r.confidence_factor >= 0.75)
RETURN p
If you want to extract the nodes from the path without Jane (remove tail() if you want Jane to be included):
MATCH p = (:Person { name: 'Jane'})<-[:KNOWS*1..10]-(:Person)
WHERE all(r IN relationships(p) WHERE r.confidence_factor >= 0.75)
UNWIND tail(nodes(p)) as person
RETURN DISTINCT person

Producing files in dagster without caring about the filename

In the dagster tutorial, in the Materializiations section, we choose a filename (sorted_cereals_csv_path) for our intermediate output, and then yield it as a materialization:
#solid
def sort_by_calories(context, cereals):
# Sort the data (removed for brevity)
sorted_cereals_csv_path = os.path.abspath(
'calories_sorted_{run_id}.csv'.format(run_id=context.run_id)
)
with open(sorted_cereals_csv_path, 'w') as fd:
writer = csv.DictWriter(fd, fieldnames)
writer.writeheader()
writer.writerows(sorted_cereals)
yield Materialization(
label='sorted_cereals_csv',
description='Cereals data frame sorted by caloric content',
metadata_entries=[
EventMetadataEntry.path(
sorted_cereals_csv_path, 'sorted_cereals_csv_path'
)
],
)
yield Output(None)
However, this is relying on the fact that we can use the local filesystem (which may not be true), it will likely get overwritten by later runs (which is not what I want) and it's also forcing us to come up with a filename which will never be used.
What I'd like to do in most of my solids is just say "here is a file object, please store it for me", without concerning myself with where it's going to be stored. Can I materialize a file without considering all these things? Should I use python's tempfile facility for this?
Actually it seems this is answered in the output_materialization example.
You basically define a type:
#usable_as_dagster_type(
name='LessSimpleDataFrame',
description='A more sophisticated data frame that type checks its structure.',
input_hydration_config=less_simple_data_frame_input_hydration_config,
output_materialization_config=less_simple_data_frame_output_materialization_config,
)
class LessSimpleDataFrame(list):
pass
This type has an output_materialization strategy that reads the config:
def less_simple_data_frame_output_materialization_config(
context, config, value
):
csv_path = os.path.abspath(config['csv']['path'])
# Save data to this path
And you specify this path in the config:
execute_pipeline(
output_materialization_pipeline,
{
'solids': {
'sort_by_calories': {
'outputs': [
{'result': {'csv': {'path': 'cereal_out.csv'}}}
],
}
}
},
)
You still have to come up with a filename for each intermediate output, but you can do it in the config, which can differ per-run, instead of defining it in the pipeline itself.

Python Cerberus dependencies on nested list level

Does Cerberus 1.2 support dependency validation on a list?
For instance the schema looks as follows:
schema = {
'list_1': {
'type': 'list',
'schema': {
'type': 'dict',
'schema': {
'simple_field': {'type': 'boolean'},
'not_simple_field': {
'type': 'dict',
'schema': {
'my_field': {'dependencies': {'simple_field': True}}
}
}
}
}
}
}
The rule that I'd like to check is that my_field should only exist when simple_field is True. How would I translate that in Cerberus?
As of now Cerberus 1.2 does not support this feature. I've overridden the Validator class method _lookup_field in order to implement this functionality.
Here's the link to a feature request on GitHub
Here's my implementation:
def _lookup_field(self, path: str) -> Tuple:
"""
Implement relative paths with dot (.) notation as used
in Python relative imports
- A single leading dot indicates a relative import
starting with the current package.
- Two or more leading dots give a relative import to the parent(s)
of the current package, one level per dot after the first
Return: Tuple(dependency_name: str, dependency_value: Any)
"""
# Python relative imports use a single leading dot
# for the current level, however no dot in Cerberus
# does the same thing, thus we need to check 2 or more dots
if path.startswith('..'):
parts = path.split('.')
dot_count = self.path.count('.')
context = self.root_document
for key in self.document_path[:dot_count]:
context = context[key]
context = context.get(parts[-1])
return parts[-1], context
else:
return super()._lookup_field(path)

How to attach relationships in existing nodes in neo4j?

I'm trying to make a graph from a csv file, but I'm not being able to add additional relationship in the existing nodes.
My actual code is:
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'my_file.csv' AS line
MERGE (p:Title { title: line[0]})
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH line WHERE line[2] IS NOT NULL
MERGE (b:Author {name: line[2]})
MERGE (b)-[:COLABORATE_IN]->(p) //not working
RETURN line[2]
It should be a simple, It creates well the nodes and the firsts relationships, but for the line[2] it just create the relationships for new nodes. What could I do?
Thanks
Everything that is not piped in the WITH clause is not available to the next part of the query :
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH line WHERE line[2] IS NOT NULL
// p is no more available here
Just add the p identifier to make it available in the remaining part of the query :
USING PERIODIC COMMIT 10000
LOAD CSV FROM 'my_file.csv' AS line
MERGE (p:Title { title: line[0]})
MERGE (a:Author { name: line[1]})
MERGE (a)-[:COLABORATE_IN]->(p)
WITH p, line
WHERE line[2] IS NOT NULL
MERGE (b:Author {name: line[2]})
MERGE (b)-[:COLABORATE_IN]->(p) //not working
RETURN line[2]

No Idea how to create a specific MapReduce in CouchDB

I've got 3 types of documents in my db:
{
param: "a",
timestamp: "t"
} (Type 1)
{
param: "b",
partof: "a"
} (Type 2)
{
param: "b",
timestamp: "x"
} (Type 3)
(I can't alter the layout...;-( )
Type 1 defines a start timestamp, it's like the start event. A Type 1 is connected to several Type 3 docs by Type 2 documents.
I want to get the latest Type 3 (highest timestamp) and the corresponding type 1 document.
How may I organize my Map/Reduce?
Easy. For highly relational data, use a relational database.
As user jhs stated before me, your data is relational, and if you can't change it, then you might want to reconsider using CouchDB.
By relational we mean that each "type 1" or "type 3" document in your data "knows" only about itself, and "type 2" documents hold the knowledge about the relation between documents of the other types. With CouchDB, you can only index by fields in the documents themselves, and going one level deeper when querying using includedocs=true. Thus, what you asked for cannot be achieved with a single CouchDB query, because some of the desired data is two levels away from the requested document.
Here is a two-query solution:
{
"views": {
"param-by-timestamp": {
"map": "function(doc) { if (doc.timestamp) emit(doc.timestamp, [doc.timestamp, doc.param]); }",
"reduce": "function(keys, values) { return values.reduce(function(p, c) { return c[0] > p[0] ? c : p }) }"
},
"partof-by-param": {
"map": "function(doc) { if (doc.partof) emit(doc.param, doc.partof); }"
}
}
}
You query it first with param-by-timestamp?reduce=true to get the latest timestamp in value[0] and its corresponding param in value[1], and then query again with partof-by-param?key="<what you got in previous query>". If you need to fetch the full documents together with the timestamp and param, then you will have to play with includedocs=true and provide with the correct _doc values.

Resources