OrientDB Query Result set of vertices with an empty collection of edges, vertices - orientdb-2.1

this might be a simple question but I am confused, please help.....
I am using OrientDB 2.1.9 and I am experimenting with VehicleHistoryGraph database.
From Studio, Browse mode, set limit to 9 records only. Now I am entering this simple query
select out() from Person
The result set I am getting back is 9 records BUT only two have Bought a vehicle. The rest are displayed with empty collections []. This is no good, I am confused. I would expect to get back only those two vertices with collections of edges !
How do I get back these two persons that bought something ?
I noticed also that there is this unwind operator in select. Is this useful in that case, can you make an example ?

Your query asks for out(), so out() is computed in all cases, and you're shown the results. If you only want the rows for which out().size() > 0 then you can construct a query like this:
select out() from v let n=out().size() where $n > 0
If you think that one ought to be able to write this more succintly, e.g. like so:
select out() as n from v where n > 0
then join the club (e.g. by supporting this enhancement request).
(select out() from v where out().size() > 0 is supported.)

Related

NebulaGraph Database: How to get all the vertices of each tag?

I want to get all the vertices of each tag in the Nebula Graph Database.
I tried using fetch prop on player * yield properties(vertex) to get the results, but this was not possible.
(root#nebula) [basketballplayer]> fetch prop on player * yield properties(vertex)
[ERROR (-1004)]: SyntaxError: syntax error near `* yield '
And I tried using neo4j statement match (v:player) return v, but it didn't work either.
root#nebula) [basketballplayer]> match (v:player) return v
[ERROR (-1005)]: Scan vertices or edges need to specify a limit number, or limit number can not push down.
Who can teach me how to use the Nebula Graph database correctly?
By design, the per tag/edge type scan(just like a tabular DBMS data scan) was chosen to be prohibited by default.
Due to the data was stored in NebulaGraph in a more linked/graph way(think of a graph traversal, which started from known nodes and then expand multiple hops along with the edges/relationships). Thus enabling a non-graph scan of data in a distributed graph database like NebulaGraph is costly.
To enable such queries, an index needs to be explicitly created before that 0 or LIMIT sample clause [1] was used(could also avoid full scan).
[1]: example of query(need index for starting node) with LIMIT clause
MATCH (v:player) RETURN v LIMIT 100
Note: the index is only related to the starting node seeking of the query pattern.

Why can't I MATCH (v:<tag>)-[e:<edge>]-(v2:<tag>) RETURN v LIMIT 10 in the NebulaGraph database

The Nebula Graph docs say that "When traversing all vertices of the specified Tag or edge of the specified Edge Type, such as MATCH (v:player) RETURN v LIMIT N, there is no need to create an index, but you need to use LIMIT to limit the number of output results." But when I run the statement in the preceding screenshot, it told me that I did not have a limit number, which I did.
What is the correct way to RETURN v without creating indexes?
I met the same issue before. Actually, when you specify both a tag and an edge for a query simultaneously, you need to create an index for the tag or the edge first.
Create an index for the tag company first and then try to execute it again.

Cypher: Getting n neighboring Nodes for each Node of certain type

I'm using Neo4j Sever 4.2.5.
The Pattern on which I want to run my query looks as follows:
(Artist)-[similar_to {score: <float>}]->(Artist)
Now what I want to do is get the 5 [similar_to] relations with the highest scores for each artist.
I've tried using Neo4j's collect() function to collect all the artists into a list and then using UNWIND to iterate over that. Sadly the LIMIT clause seems to limit the total number of returned records and not the returned records per iteration.
Any help would be appreciated.
Thanks in advance
To get the 5 rels with the highest scores, this should do it.
MATCH (n:Artist)-[r:similar_to]->(:Artist)
WITH n,r
ORDER BY r.score DESC
RETURN n, COLLECT(r)[..5] AS relsWithHighestScores

Neo4j Cypher query to find nodes that are not connected too slow

Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.
EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).
The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.
The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).
The current query today is roughly similar to that one:
MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1
ORDER BY toLower(nv1.title) ASC
SKIP 0 LIMIT 15
This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.
Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.
Any help or even the slightest hint is greatly appreciated.
EDIT after answer from Michael Hunger:
Percentage of root nodes:
With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.
ITEM_VERSION node in first MATCH:
We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:
// date 6/1/17
// n1 returned because relationship not valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)
// n1 not returned because relationship and connected item n2 valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)
// n1 returned because connected item n2 not valid even though relationship is valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)
No use of relationship-types:
The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?
What Neo4j version are you using.
What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?
I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.
You didn't use relationship-types.
MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC
SKIP 0 LIMIT 15
I think a good start for improvement would be to match on nodes using an index so you can quickly get a smaller relevant subset of nodes to search. Your approach right now must inspect all your :NODEs and all their relationships and patterns off of them every single time, which, as you've found, won't scale with your data.
Right now the only nodes in your graph with date/time properties are your :ITEM_VERSION nodes, so let's start with those. You'll need an index on :ITEM_VERSION's from and until properties for fast lookup.
The nulls are going to be problematic for your lookups, as any inequality against a null value returns null, and most workarounds to working with nulls (using COALESCE() or several ANDs/ORs for null cases) seem to prevent usage of index lookups, which is the point of my particular suggestion.
I would encourage you to replace your nulls in from and until with min and max values, which should let you take advantage of finding nodes by index lookup:
MATCH (version:ITEM_VERSION)
WHERE version.from <= {date} <= version.until
MATCH (version)<-[:VERSION_OF]-(node:NODE)
...
That should at least provide quick access to a smaller subset of nodes at the start for continuing your query.

Get the number of records from MDX query with Subcubes

I'm developing a system for generate mdx queries from entity "FilterCriterias" and related info like the number of records of a query, so I need a generic way to get the number of records of a mdx query than use subcubes. In a normal query I do something like:
WITH
MEMBER [MyCount] AS
Count([Date].[Date].MEMBERS)
SELECT
{[MyCount]} ON 0
FROM [Adventure Works];
But I have problems when use this way in queries a little more complexes like that
WITH
MEMBER [MyCount] AS
Count([Date].[Date].MEMBERS)
SELECT
{[MyCount]} ON 0
FROM
(
SELECT
{[Measures].[Sales Amount]} ON 0
,{[Date].[Date].&[20050701] : [Date].[Date].&[20051231]} ON 1
FROM
(
SELECT
{[Sales Channel].[Sales Channel].&[Internet]} ON 0
FROM [Adventure Works]
)
);
I guess the logic response could be the number of records of [Date].[Members] left in the subcube, but I get a result without columns and rows. I'm newbie in mdx language and I don't understand this behavior. Exists some generic way to get the number of records from a "base" query just like SELECT COUNT(*) FROM () in plain SQL?
The structure is quite different to a ralational SELECT COUNT(*) FROM ().
I believe that the structure of a sub-select will be very similar to that of a sub-cube and reading through this definition from MSDN (https://msdn.microsoft.com/en-us/library/ms144774.aspx) of what a sub-cube contains tells us that it isn't a straight filter like in a relational query:
Admittedly I still find this behaviour rather "enigmatic" (a polite way of saying "I do not understand it")
Is there a workaround?

Resources