From and From Named Graph in SPARQL - graph

I am getting a confusion related to FROM and FROM NAMED graphs in SPARQL. I did read the specifications relating to these two construct in the SPARQL Specifications. I just want to confirm my understanding.
Suppose an RDF Dataset is located at IRI I. I is made up of:
a default graph G
3 named graphs {(I1,G1), (I2,G2), (I3,G3)}
Now, suppose I have a SPARQL query:
SELECT *
FROM I
FROM I1
FROM NAMED I2
So if I understand, to evaluate this query, the SPARQL service may construct the active graph at the back, this active merge will contain:
a default graph which is the merge of I and I1
a named graph I2
Is this understanding right?

The FROM, FROM NAMED clauses describe the dataset to be queried. How that comes into being is not part of the SPARQL spec. There is a universe of graphs from which I, I1, and I2 are taken.
You are correct that the dataset for the query will have a default graph which is the merge of I and I1, and also a named graph I2.
Whether those are taken from the underlying dataset is implementation dependent. It is a common thing to provide (the universe of graphs is the named graphs in the dataset) but it is also possible that the I, I1, and I2 are taken from the web (the universe of graphs is the web).

Related

Order the result by the number of relationships

I have a directed multigraph. With this query I tried to find all the nodes that is connected to node with the uuid n1_34
MATCH (n1:Node{uuid: "n1_34"}) -[r]- (n2:Node) RETURN n2, r
This will give me a list of n2 (n1_1187, n2_2280, n2_1834, n2_932 and n2_722) and their relationships among themselves which are exactly what I need.
Nodes n1_1187, n2_2280, n2_1834, n2_932 and n2_722 are connected to the node n1_34
Now I need to order them based on the relationship it has within this subgraph. So for example, n1_1187 should be on top with 4 relationships while the others have 1 relationship.
I followed this post: Extract subgraph from Neo4j graph with Cypher but it gives me the same result as the query above. I also tried to return count(r) but it gives me 1 since it counts all unique relationships not the relationships with a common source/target.
Usually with networkx I can copy this result into a subgraph then count the relationships of each node. Can I do that with neo4j without modifying the current graph? How?
Please help. Or is there other way?
This snippet will recreate your graph for testing purposes:
WITH ['n1_34,n1_1187','n1_34,n2_2280','n1_34,n2_1834','n1_34,n2_722', 'n1_34,n2_932','n1_1187,n2_2280','n1_1187,n2_932','n1_1187,n2_1834', 'n1_1187,n2_722'] AS node_relationships
UNWIND node_relationships as relationship
with split(relationship, ",") as nodes
merge(n1:Node{label:nodes[0]})
merge(n2:Node{label:nodes[1]})
merge(n1)-[:LINK]-(n2)
Once that is run, the graph I'm working with looks like:
Then this CQL will select the nodes in the subgraph and then subsequently count up each of their respective associated links, but only to other nodes existing already in the subgraph:
match(n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
with collect(distinct(n2)) as subgraph_nodes
unwind subgraph_nodes as subgraph_node
match(subgraph_node)-[r:LINK]-(n3:Node)
where n3 in subgraph_nodes
return subgraph_node.label, count(r) order by count(r) DESC
Running the above yields the following result:
This query should do what you need :
MATCH (n1:Node{uuid: "n1_34"})-[r]-(n2:Node)
RETURN n1, n2, count(*) AS freq
ORDER BY freq DESC
Using PROFILE to assess the efficiency of some of the existing solutions using #DarrenHick's sample data, the following is the most efficient one I have found, needing only 84 DB hits:
MATCH (n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
WITH COLLECT(n2) AS nodes
UNWIND nodes AS n
RETURN n, SIZE([(n)-[:LINK]-(n3) WHERE n3 IN nodes | null]) AS cnt
ORDER BY cnt DESC
Darren's solution (adjusted to return subgraph_node instead of subgraph_node.label, for parity) requires 92 DB hits.
#LuckyChandrautama's own solution (provided in a comment to Darren's answer, and adjusted to match Darren's sample data), uses 122 DB hits.
This shows the importance of using PROFILE to assess the performance of different Cypher solutions against the actual data. You should try doing that with your actual data to see which one works best for you.

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

Preventing duplicate SIMILAR relationships when using algo.similarity.jaccard on continuously updated data

I am computing the Jaccard similarity index for a category of nodes in a graph using the algo.similarity.jaccard algorithm from the Neo4j graph algorithm's library. Once calculating the Jaccard similarity and indicating a cutoff, I am storing the metric in a relationship between the nodes (this is a feature of the algorithm). I am trying to see the change of the graph over time as I get new data to add into the graph (I will be reloading my CSV file with new data and merging in new nodes/relationships).
A problem I foresee is that once I run the Jaccard algorithm again with the updated graph, it will create duplicate relationships. This is the Neo4j documentation example of the code that I am using:
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
Is there a way to specify I do not want to have duplicate relationships each time I run this code with an updated graph? Manually, I'd use MERGE instead of CREATE but seeing as though this an algorithm from a library, I'm not sure how to go about that. FYI I will not have the ability to add changes to a library plug in and it seems like there is no way to store the relationship under a different label such as SIMILARITY2.
There are at least 2 ways to avoid duplicate relationships from multiple calls to algo.similarity.jaccard:
Delete the existing relationships (by default, they have the SIMILAR type) before each call. This is probably the easiest approach.
Omit the write:true option when making the calls (so that the procedure won't create relationships at all), and write your own Cypher code to optionally create relationships that do not already exist (using MERGE).
[UPDATED]
Here is an example of the second approach (using the
algo.similarity.jaccard.stream variant of the procedure, which yields more useful values for our purposes):
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.1})
YIELD item1, item2, similarity
WHERE item1 < item2
WITH algo.getNodeById(item1) AS n1, algo.getNodeById(item2) AS n2, similarity
MERGE (n1)-[s:SIMILAR]-(n2)
SET s.score = similarity
RETURN *
Since the procedure will return the same node pair twice (with the same similarity score), the WHERE clause is used to filter out one of the pairs, to speed up processing. The algo.getNodeById() utility function is used to get a node by its native ID. And the MERGE clause's relationship pattern does not specify a value for score, so that it will match an existing relationship even if it has a different value. The SET clause for setting the score is placed after the MERGE, which also helps to ensure the value is up to date.

Using R nested lists as simple binary tree

I have a simple problem constrained to R. I have what is effectively a sort of binary tree, where only the terminal leaves have values associated with them. A toy example is visible here.
Essentially, I perform an operation between the leaves with the greatest depth (in a tie of depth, order doesn't matter). I have made it addition here, but, in reality, they're getting plugged into a more complicated formula.
I am limited to R for my code. This structure can be represented with this command, though I obtain it via other means:
testBranch<-list(list(list(list(20,15),40),list(10,30)),5) #Depth of 4
I have a working function to determine how deep the deepest level is, but nested lists in R are boggling. Any clue how to efficiently find the set of indexes to access the deepest values? For instance, in the toy example above
testBranch[[1]][[1]][[1]]
would give me what I'd like, a list containing 2 elements. Using my addition example, I could then do this:
indexesOI<-getIndexes(testBranch)
testBranch[indexesOI]<-testBranch[indexesOI][1]+testBranch[indexesOI][2]
#testBranch now has depth of 3
Resulting in the tree corresponding to step 1 in the toy example, which can be represented in R by:
testBranchStep1<-list(list(list(35,40),list(10,30)),5)
I am open to using packages, if need be. Just not looking to rewrite a whole node class/dfs in R, as I don't have much experience with the class system. I have looked into data.tree, but have had no luck coercing my nested lists into their data structure.
Any help you can provide would be great! Pardon the hastily made ASCII trees. I am largely self-taught and haven't asked many questions here, so please let me know, too, if I need to adjust my formatting! Thanks!
You can do this with data.tree.
library(data.tree)
testBranch <- list(list(list(list(20,15),40),list(10,30)),5)
tree <- FromListSimple(testBranch)
tree
This will print the tree:
levelName
1 Root
2 °--1
3 ¦--1
4 ¦ °--1
5 °--2
data.tree provides many utility functions and properties (make sure you read the vignettes). To know the depth, in particular, use this:
height <- tree$height
Which yields:
> 4
You can then traverse the tree and find the nodes with maximum height:
maxDepthLeaves <- Traverse(tree, filterFun = function(node) node$level == height)
This traversal is the list of nodes at max level (only one Node in this case). You can then use Get to retrieve from the traversal any value, e.g. the name, the position, or the pathString:
Get(maxDepthLeaves, 'pathString')
Displaying as:
1
"Root/1/1/1"
Sounds like you are halfway there. Whenever you find the deepest node(s), you can output the index into a list. Here's a recursive function in pseudo-code since I don't know R.
If tree is a leaf node
If current depth is greater than max-depth
Delete list of indices
Append current index into list of indices
If current depth is equal to max-depth
Append current index into list of indices
Else
for each element in the tree
Get current index
Recursively call this function, passing in the current index

Smart way to generate edges in Neo4J for big graphs

I want to generate a graph from a csv file. The rows are the vertices and the columns the attributes. I want to generate the edges by similarity on the vertices (not necessarily with weights) in a way, that when two vertices have the same value of some attribute, an edge between those two will have the same attribute with value 1 or true.
The simplest cypher query that occurs to me looks somewhat like this:
Match (a:LABEL), (b:LABEL)
WHERE a.attr = b.attr
CREATE (a)-[r:SIMILAR {attr : 1}]->(b)
The graph has about 148000 vertices and the Java Heap Sizeoption is: dynamically calculated based on available system resources.
The query I posted gives a Neo.DatabaseError.General.UnknownFailure with a hint to Java Heap Space above.
A problem I could think of, is that a huge cartesian product is build first to then look for matches to create edges. Is there a smarter, maybe a consecutive way to do that?
I think you need a little change model: no need to connect every node to each other by the value of a particular attribute. It is better to have a an intermediate node to which you will bind the nodes with the same value attribute.
This can be done at the export time or later.
For example:
Match (A:LABEL) Where A.attr Is Not Null
Merge (S:Similar {propName: 'attr', propValue: A.attr})
Merge (A)-[r:Similar]->(S)
Later with separate query you can remove similar node with only one connection (no other nodes with an equal value of this attribute):
Match (S:Similar)<-[r]-()
With S, count(r) As r Where r=1
Detach Delete S
If you need connect by all props, you can use next query:
Match (A:LABEL) Where A.attr Is Not Null
With A, Keys(A) As keys
Unwind keys as key
Merge (S:Similar {propName: key, propValue: A[key]})
Merge (A)-[:Similar]->(S)
You're right that a huuuge cartesian product will be produced.
You can iterate the a nodes in batches of 1000 for eg and run the query by incrementing the SKIP value on every iteration until it returns 0.
MATCH (a:Label)
WITH a LIMIT SKIP 0 LIMIT 1000
MATCH (b:Label)
WHERE b.attr = a.attr AND id(b) > id(a)
CREATE (a)-[:SIMILAR_TO {attr: 1}]->(b)
RETURN count(*) as c

Resources