I'm attempting to build a recommendation engine for a library system.
This is my db schema:
My starting point is a LoanerCard. The flow is then supposed to look like this: Get all copies -> get the material -> get all copies of the material (including the original) -> get LoanerCard from copy -> get all loaned copies -> return the material name of the copies + an aggregated count to indicate the strength of the recommendation.
My best attempt so far has resulted in this query:
MATCH (L:LoanerCard {Barcode:"10007"})-[:LOANED]->(myLoans)-[:COPY_OF]-
(masterMaterial),
(masterMaterial)<-[:COPY_OF]-(allCopies),
(allCopies)<-[:LOANED]-(coLoaners),
(coLoaners)-[r:LOANED]->(theirCopies),
(theirCopies)-[:COPY_OF]-(materials)
RETURN materials.Title as Recommended, count(*) as Strength ORDER BY Strength DESC
My issue here is that when I traverse the graph it doesn't include the original copy and the adjacent LoanerCards of that so essentially it only traverses the area circled in red and never reaches LoanerCard 10817 and 10558
How can I design my query so it includes these?
A MATCH clause automatically filters out duplicate relationships. Therefore, in order to traverse the same relationships twice, you need to split your MATCH clause in two.
Try this:
MATCH (:LoanerCard {Barcode:"10007"})-[:LOANED]->()-[:COPY_OF]-(masterMaterial)
MATCH (masterMaterial)<-[:COPY_OF]-()<-[:LOANED]-()-[:LOANED]->()-[:COPY_OF]-(materials)
RETURN materials.Title as Recommended, count(*) as Strength ORDER BY Strength DESC
I have a large number of nodes representing accounts, which we could label as say (a :Account). Each (:Account) can have potentially tens of thousands of (t :Transaction) nodes connected to it, each representing the data for a transaction that occurred involving that account.
The (:Transaction) nodes have a date property. Given a date to query on what would be the most efficient way to get the latest (:Transaction) node for each (a :Account) that occurs before or on the query date? This could be one way to do it:
// run for all address nodes
match (a :Address)
with distinct a
optional match (a)-->(t :Transaction)
where t.timestamp <= date("2014-03-07")
with a, t
where t.date = max(t.date)
return a, t
However I'm not sure if this method is very efficient when the number of (t) connected to each (a) becomes very large. Is there a way to write the query or to index the database such that the query time scales linearly with the number of accounts, no matter the number of transactions connected to those account nodes?
For disclosure I posted a version of this question on the neo4j community forum, but I'm hoping the greater traffic on this site gives this question more exposure.
In neo4j 3.5, a new "index-backed order by" optimization was added. This means that if you create a "native" index (see here for the details), then the index will be stored in sorted order, and the ORDER BY clause on a property on which the index is used won't actually have to do any sorting.
So, assuming that you have created in index on :Transaction(timestamp), like so:
CREATE INDEX ON :Transaction(timestamp);
then, in neo4j 3.5+, this query (with an optional hint to use that index) should avoid any sorting when finding the Transaction with the maximum timestamp for each Address:
MATCH (a:Address)-->(t:Transaction)
USING INDEX t:Transaction(timestamp)
WHERE t.timestamp <= date("2014-03-07")
WITH a, t
ORDER BY t.timestamp DESC
RETURN a, COLLECT(t)[0] AS transaction
This query should do the following:
Use the index to get all Transaction nodes with an appropriate timestamp (in descending order, without sorting).
Get the Address nodes related to each Transaction.
For each distinct Address node, create a list of all the related Transaction nodes (in descending timestamp order, without sorting), and get the first one from the list.
Return each distinct Address node and its most recent appropriate Transaction node.
This query will scale linearly with the number of appropriate Transactions. If your use case permits it, you could get faster results by reducing the number of appropriate Transactions by also putting a lower bound in your WHERE clause.
As a bit of a learning project, I am working to replace a somewhat slow program in perl with a Chapel implementation. I've got the algorithms down, but I'm struggling with the best way to reference the data in Chapel. I can do a direct translation, but it seems likely I'm missing a better way.
Details of existing program:
I have a graph with ~32000 nodes and ~2.1M edges. State is saved in
data files, but it's run as a daemon that keeps data in memory.
Each node has a numeric ID (assigned by another system) and have a variety
of other attributes defined by string, integer, and boolean values.
The edges are directional and have a couple of boolean values
attributed to them.
I have an external system that interacts with this daemon that I cannot change. It makes requests, such as "Add node (int) with these attributes", "find shortest path from node (int) to node (int)", or "add edges from node (int) to node(s) (int, int, int)"
In Perl, the program uses hashes with common integer IDs for node and edge attributes. I can certainly replicate this in Chapel with associative arrays.
Is there a better way to bundle this all together? I've been trying to wrap my head around ways to have opaque node and edge with each item defined, but struggling with how to reference them with the integer IDs in an easy fashion.
If somebody can provide an ideal way to do the following, it would get me the push I need.
Create two nodes with xx attributes identified by integer ID.
Create an edge between the two with xx attribues
Respond to request "show me the xx attribute of node (int)"
Cheers, and thanks.
As you might expect, there are a number of ways to approach this in Chapel, though I think given your historical approach and your external system's interface, associative domains and arrays are definitely an appropriate way to go. Specifically, given your desire to refer to nodes by integer IDs makes associative domains/arrays a natural match.
For Chapel newbies: associative domains are essentially sets of arbitrary values, like the set of integer node IDs in this case. Associative arrays are mappings from the indices of an associative domain to elements (variables) of a given type . Essentially, the domain represents the keys and the array the values in a key-value store or hash table.
To represent the nodes and edges themselves, I'm going to take the approach of using Chapel records. Here's my record for a node:
record node {
var id: int;
var str: string,
i: int,
flag: bool;
var edges: [1..0] edge;
}
As you can see, it stores its id as an integer, arbitrary attribute fields of various types (a string str, an integer i, and a boolean flag — you can probably come up with better names for your program), and an array of edges which I'll return to in a second. Note that it may or may not be necessary for each node to store its ID... perhaps in any context where you'd have the node, you would already know its ID, in which case storing it could be redundant. Here I stored it just to show you could, not because you must.
Returning to the edges: In your question, it sounded as though edges might have their own integer IDs and get stored in the same pool as the nodes, but here I've taken a different approach: In my experience, given a node, I typically want the set of edges leading out of it, so I have each node store an array of its outgoing edges. Here, I'm using a dense 1D array of edges which is initially empty (1..0 is an empty range in Chapel since 1 > 0). You could also use an associative array of edges if you wanted to give them each a unique ID. Or you could remove the edges from the node data structure altogether and store them globally. Feel free to ask follow-up questions if you'd prefer a different approach.
Here's my record for representing an edge:
record edge {
var from, to: int,
flag1, flag2: bool;
}
The first two fields (from and to) indicate the nodes that the edge connects. As with the node ID above, it may be that the from field is redundant / unnecessary, but I've included it here for completeness. The two flag fields are intended to represent the data attributes you'd associate with an edge.
Next, I'll create my associative domain and array to represent the set of node IDs and the nodes themselves:
var NodeIDs: domain(int),
Nodes: [NodeIDs] node;
Here, NodeIDs is an associative domain (set) of integer IDs representing the nodes. Nodes is a an associative array that maps from those integers to values of type node (the record we defined above).
Now, turning to your three operations:
Create two nodes with xx attributes identified by integer ID.
The following declaration creates a node variable named n1 with some arbitrary attributes using the default record constructor/initializer that Chapel provides for records that don't define their own:
var n1 = new node(id=1, "node 1", 42, flag=true);
I can then insert it into the array of nodes as follows:
Nodes[n1.id] = n1;
This assignment effectively adds n1.id to the NodeIDs domain and copies n1 into the corresponding array element in Nodes. Here's an assignment that creates a second anonymous node and adds it to the set:
Nodes[2] = new node(id=2, "node 2", i=133);
Note that in the code above, I've assumed that you want to choose the IDs for each node explicitly (e.g., perhaps your data file establishes the node IDs?). Another approach (not shown here) might be to have them be automatically determined as the nodes are created using a global counter (maybe an atomic counter if you're creating them in parallel).
Having populated our Nodes, we can then iterate over them serially or in parallel (here I'm doing it in parallel; replacing forall with for will make them serial):
writeln("Printing all node IDs (in an arbitrary order):");
forall nid in NodeIDs do
writeln("I have a node with ID ", nid);
writeln("Printing all nodes (in an arbitrary order):");
forall n in Nodes do
writeln(n);
The order in which these loops print the IDs and nodes is arbitrary for two reasons: (1) they're parallel loops; (2) associative domains and arrays store their elements in an arbitrary order.
Create an edge between the two with xx attribues
Since I associated the edges with nodes, I took the approach of creating a method on the node type that will add an edge to it:
proc node.addEdge(to: int, flag1: bool, flag2: bool) {
edges.push_back(new edge(id, to, flag1, flag2));
}
This procedure takes the destination node ID, and the attributes as its arguments, creates an edge using that information (and supplying the originating node's ID as the from field), and uses the push_back() method on rectangular arrays to add it to the list of edges.
I then call this routine three times to create some edges for node 2 (including redundant and self-edges since I only have two nodes so far):
Nodes[2].addEdge(n1.id, true, false);
Nodes[2].addEdge(n1.id, false, true);
Nodes[2].addEdge(2, false, false);
And at this point, I can loop over all of the edges for a given node as follows:
writeln("Printing all edges for node 2: (in an arbitrary order):");
forall e in Nodes[2].edges do
writeln(e);
Here, the arbitrary printing order is only due to the use of the parallel loop. If I'd used a serial for loop, I'd traverse the edges in the order they were added due to the use of a 1D array to represent them.
Respond to request "show me the xx attribute of node (int)"
You've probably got this by now, but I can get at arbitrary attributes of a node simply by indexing into the Nodes array. For example, the expression:
...Nodes[2].str...
would give me the string attribute of node 2. Here's a little helper routine I wrote to get at (and print) some various attributes):
proc showAttributes(id: int) {
if (!NodeIDs.member(id)) {
writeln("No such node ID: ", id);
return;
}
writeln("Printing the complete attributes for node ", id);
writeln(Nodes[id]);
writeln("Printing its string field only:");
writeln(Nodes[id].str);
}
And here are some calls to it:
showAttributes(n1.id);
showAttributes(2);
showAttributes(3);
I am working to replace a somewhat slow program in perl with a Chapel implementation
Given that speed is one of your reasons for looking at Chapel, once your program is correct, re-compile it with the --fast flag to get it running quickly.
I have 2 questions:
How to index this query?
g.V(vertexId).repeat(out().hasLabel('location')).emit().tree().next()
in the Titan 1.0 documentation, there are only ways given to index the graph once when the data is already inserted .
However in the generate-modern.groovy file here
we see that indexing is done before the creation of vertices which seems reasonable. However I am unable to do it when trying to use buildMixedIndex as it is throwing me
illegal argument exception :Unknown external index backend search
My approach was
def location = mgmt.makeVertexLabel("location").make()
def displayName = mgmt.makePropertyKey("displayName").dataType(String.class).cardinality(Cardinality.SINGLE).make()
def shortName = mgmt.makePropertyKey("shortName").dataType(String.class).cardinality(Cardinality.SINGLE).make()
def description = mgmt.makePropertyKey("description").dataType(String.class).cardinality(Cardinality.SINGLE).make()
def latitude = mgmt.makePropertyKey("latitude").dataType(String.class).cardinality(Cardinality.SINGLE).make()
def longitude = mgmt.makePropertyKey("longitude").dataType(String.class).cardinality(Cardinality.SINGLE).make()
def locationByName = mgmt.buildIndex("displayNameAndShortNameAndDescriptionAndLatitudeAndLongitude", Vertex.class).addKey(displayName).addKey(shortName).addKey(description)
.addKey(latitude).addKey(longitude).indexOnly(location).buildMixedIndex('search')
Where I am getting it wrong?
If that query is taking a long time, the problem is likely that it is visiting too many elements or it is stuck in an infinite loop. The existing JanusGraph/Titan indexes won't help for that. You already have a direct vertex lookup by id, g.V(vertexId), and the rest of the query is traversing the neighborhood from that vertex. I'd suggest using edge labels, i.e. out('friends'), to limit the number of edges you visit. You could also use simplePath() to eliminate cyclic paths. You could also use times() or until() to keep a limit on the number of times you loop with the repeat() step.
The configuration example you referenced only used composite indexes, which do not require an indexing backend.
Mixed indexes require configuring an indexing backend, either Elasticsearch, Lucene, or Solr. Pick one of these, then make sure you pass the correct configuration properties when you initialize your graph. You can find several examples in the distribution zip file in the conf directory. For example, in the janusgraph-cassandra-es.properties, you'll find:
index.search.backend=elasticsearch
index.search.hostname=127.0.0.1
index.search.elasticsearch.client-only=true
where search in index.X.backend is the chosen index configuration name you must pass to buildMixedIndex(X).
Here's an answer.
Both composite and mixed index are only avaiable for the first level gremlin query, not for the second level. Vertex-Centric index is required for the second level query.
Let's suppose this Cypher query (Neo4j):
MATCH(m:Meeting)
WHERE m.startDate > 1405591031731
RETURN m.name
In case of millions Meeting nodes in the graph, which strategy should I choose to make this kind of query fast?
Indexing the Meeting's startDate property?
Indexing it but with a LuceneTimeline?
Avoiding index and preferring such a structure?
However, this structure seems to be relevant for querying by a range of dates (FROM => TO), not for just a From.
I haven't use cases when I would query a range: FROM this startDate TO this endDate.
By the way, it seems that simple indexes work only when dealing with equality... (not comparison like >).
Any advice?
Take a look at this answer: How to filter edges by time stamp in neo4j?
When selecting nodes using relational operators, it is best to select on an intermediate node that is used to group your meeting nodes into a discrete interval of time. When adding meetings to the database you would determine which interval each timestamp occurred within and get or create the intermediate node that represents that interval.
You could run the following query from the Neo4j shell on your millions of meeting nodes which would group together meetings into an interval of 10 seconds. Assuming your timestamp is milliseconds.
MATCH (meeting:Meeting)
MERGE (interval:Interval { timestamp: toInt(meeting.timestamp / 10000) }
MERGE (meeting)-[:ON]->(interval);
Then for your queries you could do:
MATCH (interval:Interval) WHERE interval.timestamp > 1405591031731
WITH interval
MATCH (interval)<-[:ON]-(meeting:Meeting)
RETURN meeting