Neo4j / Good way to retrieve nodes created from a specific startDate - graph

Let's suppose this Cypher query (Neo4j):
MATCH(m:Meeting)
WHERE m.startDate > 1405591031731
RETURN m.name
In case of millions Meeting nodes in the graph, which strategy should I choose to make this kind of query fast?
Indexing the Meeting's startDate property?
Indexing it but with a LuceneTimeline?
Avoiding index and preferring such a structure?
However, this structure seems to be relevant for querying by a range of dates (FROM => TO), not for just a From.
I haven't use cases when I would query a range: FROM this startDate TO this endDate.
By the way, it seems that simple indexes work only when dealing with equality... (not comparison like >).
Any advice?

Take a look at this answer: How to filter edges by time stamp in neo4j?
When selecting nodes using relational operators, it is best to select on an intermediate node that is used to group your meeting nodes into a discrete interval of time. When adding meetings to the database you would determine which interval each timestamp occurred within and get or create the intermediate node that represents that interval.
You could run the following query from the Neo4j shell on your millions of meeting nodes which would group together meetings into an interval of 10 seconds. Assuming your timestamp is milliseconds.
MATCH (meeting:Meeting)
MERGE (interval:Interval { timestamp: toInt(meeting.timestamp / 10000) }
MERGE (meeting)-[:ON]->(interval);
Then for your queries you could do:
MATCH (interval:Interval) WHERE interval.timestamp > 1405591031731
WITH interval
MATCH (interval)<-[:ON]-(meeting:Meeting)
RETURN meeting

Related

What are the factors that can make the item count differ from the item id count in couchbase?

On a 3-nodes Couchbase Community Edition 5.0.1 build 5003 cluster, couchbase indicates that it contains 12268503 items. However, when counting the ids, the result is 6132875.
What are the factors that can make the item count differ from the item id count in couchbase?
More precisely, when the following N1QL query is executed on a buckets - say Product
SELECT count(1) FROM Product
It gives
12268503
While when the count is made on the item ids
SELECT count(META(Product).id) FROM Product
It returns:
6132875
That is, the number of ids is less than 50% of the number of items.
Also, they was no operation (0 ops/s) on the bucket for several hours, which excludes the possibility of the primary index not catching up due to a traffic peak.
I pored through the couchbase blog & doc without finding any clues as for this count difference. Any pointer is much appreciated.
If the query has no predicate/no join and projection has single expression count(*), count(constant) the query gets the results from bucket stats and provide the info (takes sub milli seconds).
SELECT count(*) FROM Product;
SELECT count(1) FROM Product;
The following is almost similar but COUNT argument is expression so it has to use index and do aggregation (As in this case document key which unique and must be string, optimizer should have considered as previous approach, As of now no optimization)
SELECT count(META(Product).id) FROM Product
In second case it uses index, Your index might have pending items and not caught up. Try use scan_consistency. Check the index stats start with.

Efficient way to get most recent of many transaction nodes connected to a single account node by date

I have a large number of nodes representing accounts, which we could label as say (a :Account). Each (:Account) can have potentially tens of thousands of (t :Transaction) nodes connected to it, each representing the data for a transaction that occurred involving that account.
The (:Transaction) nodes have a date property. Given a date to query on what would be the most efficient way to get the latest (:Transaction) node for each (a :Account) that occurs before or on the query date? This could be one way to do it:
// run for all address nodes
match (a :Address)
with distinct a
optional match (a)-->(t :Transaction)
where t.timestamp <= date("2014-03-07")
with a, t
where t.date = max(t.date)
return a, t
However I'm not sure if this method is very efficient when the number of (t) connected to each (a) becomes very large. Is there a way to write the query or to index the database such that the query time scales linearly with the number of accounts, no matter the number of transactions connected to those account nodes?
For disclosure I posted a version of this question on the neo4j community forum, but I'm hoping the greater traffic on this site gives this question more exposure.
In neo4j 3.5, a new "index-backed order by" optimization was added. This means that if you create a "native" index (see here for the details), then the index will be stored in sorted order, and the ORDER BY clause on a property on which the index is used won't actually have to do any sorting.
So, assuming that you have created in index on :Transaction(timestamp), like so:
CREATE INDEX ON :Transaction(timestamp);
then, in neo4j 3.5+, this query (with an optional hint to use that index) should avoid any sorting when finding the Transaction with the maximum timestamp for each Address:
MATCH (a:Address)-->(t:Transaction)
USING INDEX t:Transaction(timestamp)
WHERE t.timestamp <= date("2014-03-07")
WITH a, t
ORDER BY t.timestamp DESC
RETURN a, COLLECT(t)[0] AS transaction
This query should do the following:
Use the index to get all Transaction nodes with an appropriate timestamp (in descending order, without sorting).
Get the Address nodes related to each Transaction.
For each distinct Address node, create a list of all the related Transaction nodes (in descending timestamp order, without sorting), and get the first one from the list.
Return each distinct Address node and its most recent appropriate Transaction node.
This query will scale linearly with the number of appropriate Transactions. If your use case permits it, you could get faster results by reducing the number of appropriate Transactions by also putting a lower bound in your WHERE clause.

Proper handling of date operations in Gremlin

I am using AWS Neptune Gremlin with gremlin_python.
My date in property is stored as datetime as required in Neptune specs.
I created it using Python code like this:
properties_dict['my_date'] = datetime.fromtimestamp(my_date, timezone.utc)
and then constructed the Vertex with properties:
for prop in properties:
query += """.property("%s", "%s")"""%(prop, properties[prop])
Later when interacting with the constructed graph, I am only able to find the vertices by an exact string matching query like the following:
g.V().hasLabel('Object').has("my_date", "2017-12-01 00:00:00+00:00").valueMap(True).limit(3).toList()
What's the best way for dealing with date or datetime in Gremlin?
How can I do range queries such as "give me all Vertices that have date in year 2017"?
Personally, I prefer to store date/time values as days/seconds/milliseconds since epoch. This will definitely work on any Graph DB and makes range queries much simpler. Also, the conversion to days or seconds since epoch and back should be a simple method call in pretty much any language.
So, when you create your properties dictionary, you could simplify your code by changing it to:
properties_dict['my_date'] = my_date
... as my_date should represent the number of seconds since epoch. And a range query would be as simple as:
g.V().has("Object", "my_date", P.between(startTimestamp, endTimestamp)).
limit(3).valueMap(True)

Neo4j Cypher query to find nodes that are not connected too slow

Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.
EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).
The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.
The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).
The current query today is roughly similar to that one:
MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1
ORDER BY toLower(nv1.title) ASC
SKIP 0 LIMIT 15
This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.
Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.
Any help or even the slightest hint is greatly appreciated.
EDIT after answer from Michael Hunger:
Percentage of root nodes:
With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.
ITEM_VERSION node in first MATCH:
We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:
// date 6/1/17
// n1 returned because relationship not valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)
// n1 not returned because relationship and connected item n2 valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)
// n1 returned because connected item n2 not valid even though relationship is valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
No use of relationship-types:
The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?
What Neo4j version are you using.
What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?
I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.
You didn't use relationship-types.
MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC
SKIP 0 LIMIT 15
I think a good start for improvement would be to match on nodes using an index so you can quickly get a smaller relevant subset of nodes to search. Your approach right now must inspect all your :NODEs and all their relationships and patterns off of them every single time, which, as you've found, won't scale with your data.
Right now the only nodes in your graph with date/time properties are your :ITEM_VERSION nodes, so let's start with those. You'll need an index on :ITEM_VERSION's from and until properties for fast lookup.
The nulls are going to be problematic for your lookups, as any inequality against a null value returns null, and most workarounds to working with nulls (using COALESCE() or several ANDs/ORs for null cases) seem to prevent usage of index lookups, which is the point of my particular suggestion.
I would encourage you to replace your nulls in from and until with min and max values, which should let you take advantage of finding nodes by index lookup:
MATCH (version:ITEM_VERSION)
WHERE version.from <= {date} <= version.until
MATCH (version)<-[:VERSION_OF]-(node:NODE)
...
That should at least provide quick access to a smaller subset of nodes at the start for continuing your query.

Whats the best way to query DynamoDB based on date range?

As part of migrating from SQL to DynamoDB I am trying to create a DynamoDB table. The UI allows users to search based on 4 attributes start date, end date, name of event and source of event.
The table has 6 attributes and the above four are subset of it with other attributes being priority and location. The query as described above makes it mandatory to search based on the above four values. whats the best way to store the information in DynamoDB that will help me in querying based on start date and end date fairly easy.
I thought of creating a GSI with hashkey as startdate, rangekey as end date and GSI on the rest two attributes ?
Inshort:
My table in DynamoDB will have 6 attributes
EventName, Location, StartDate, EndDate, Priority and source.
Query will have 4 mandatory attributes
StartDate, EndDate, Source and Event Name.
Thanks for the help.
You can use greater than/less than comparison operators as part of your query http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html
So you could try to build a table with schema:
(EventName (hashKey), "StartDate-EndDate" (sortKey), other attributes)
In this case the sort-key is basically a combination of start and end date allowing you to use >= (on the first part) and <= (on the second part)... dynamodb uses ASCII based alphabetical ordering... so lets assume your sortKey looks like the following: "73644-75223" you could use >= "73000-" AND <= "73000-76000" to get the given event.
Additionally, you could create a GSI on your table for each of your remaining attributes that need to be read via query. You then could project data into your index that you want to fetch with the query. In contrast to LSI, queries from GSI do not fetch attributes that are not projected. Be aware of the additional costs (read/write) involved by using GSI (and LSI)... and the additional memory required by data projections...
Hope it helps.

Resources