Use views and table valued functions as node or edge tables in match clauses

Use views and table valued functions as node or edge tables in match clauses - graph

I like to use Table Valued functions in MATCH clauses in the same way as is possible with Node tables. Is there a way to achieve this?
The need for table valued functions
There can be various use cases for using table valued functions or views as Node tables. For instance mine is the following.
I have Node tables that contain NVarChar(max) fields that I would like to search for literal text. I need only equality searching and no full text searching, so I opted for using a index on the hash value of the text field. As suggested by Remus Rusanu in his answer to SQL server - worth indexing large string keys? and https://www.brentozar.com/archive/2013/05/indexing-wide-keys-in-sql-server/. A table valued function handles using the CHECKSUM index; see Msg 207 Invalid column name $node_id for pseudo column in inline table valued function.
Example data definitions
CREATE TABLE [Tags](
[tag] NVarChar(max),
[tagHash] AS CHECKSUM([Tag]) PERSISTED NOT NULL
) as Node;
CREATE TABLE [Sites](
[endPoint] NVarChar(max),
[endPointHash] AS CHECKSUM([endPoint]) PERSISTED NOT NULL
) as Node;
CREATE TABLE [Links] as Edge;
CREATE INDEX [IX_TagsByName] ON [Tags]([tagHash]);
GO
CREATE FUNCTION [TagsByName](
#tag NVarChar(max))
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT
$node_id AS [NodeId],
[tag],
[tagHash]
FROM [dbo].[Tags]
WHERE [tagHash] = CHECKSUM(#tag) AND
[tag] = #tag;
[TagsByName] returns the $node_id with an alias NodeId as suggested by https://stackoverflow.com/a/45565410/814206. However, real Node tables contain two more internal columns which I do not know how to export.
Desired query
I would like to query the database similar to this:
SELECT *
FROM [TagsByName]('important') as t,
[Sites] as s,
[Links] as l
WHERE MATCH ([t]-([l])->[s])
However, this results in the error1:
Msg 13901, Level 16, State 2, Line ...
Identifier 't' in a MATCH clause is not a node table or an alias for a node table.
I there a way to do this?
PS. There are some workarounds but they do not look as elegant as the MATCH-query; especially considering that my actual query involves matching more relations and more string equality tests. I will post these workarounds as answers and hope that someone comes with a better idea.
1 This gives a very specific difference between views and tables for Difference between View and table in sql; which only occurs in sql-server-2017 and only when using SQL Graph.

Workaround
Revert to traditional relational joins via JOIN clauses or FROM with <table_or_view_name> and WHERE clauses. In queries that match on more relations, the latter has the advantage that sql-server-2017-graph can MATCH on FROM <table_or_view_name> but not on FROM <table_source> JOIN <table_source>.
SELECT *
FROM [TagsByName]('important') as t
[Sites] as s,
[Links] as l
WHERE t.NodeId = l.$from_id AND
l.$to_id = s.$node_id;

Workaround
Add the Node table twice to the from clause: once as table and once as table valued function and join them via the $node_id in the where clause:
SELECT *
FROM [TagsByName]('important') as t1,
[Tags] as t2,
[Sites] as s,
[Links] as l
WHERE MATCH ([t2]-([l])->[s]) AND
t1.[NodeId] = t2.$node_id
Does this affect performance?

Workaround
Do not use the table valued function, but include its expression in the WHERE clause:
SELECT *
FROM [Tags] as t,
[Sites] as s,
[Links] as l
WHERE MATCH ([t]-([l])->[s]) AND
[t].[tagHash] = CHECKSUM('important') AND
[t].[tag] = 'important'
Downside: This is easy to get wrong; for example by forgetting to join on the CHECKSUM

Related

When to create multi-column indices in SQLite?

Assume I have a table in an SQLite database:
CREATE TABLE orders (
id INTEGER PRIMARY KEY,
price INTEGER NOT NULL,
updateTime INTEGER NOT NULL,
) [WITHOUT ROWID];
what indices should I create to optimize the following query:
SELECT * FROM orders WHERE price > ? ORDER BY updateTime DESC;
Do I create two indices:
CREATE INDEX i_1 ON orders(price);
CREATE INDEX i_2 ON orders(updateTime);
or one complex index?
CREATE INDEX i_3 ON orders(price, updateTime);
What can be query time complexity?

From The SQLite Query Optimizer Overview/WHERE Clause Analysis:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
Then the index might
be used if the initial columns of the index (columns a, b, and so
forth) appear in WHERE clause terms. The initial columns of the index
must be used with the = or IN or IS operators. The right-most column
that is used can employ inequalities.
As explained also in The SQLite Query Optimizer Overview/The Skip-Scan Optimization with an example:
Because the left-most column of the index does not appear in the WHERE
clause of the query, one is tempted to conclude that the index is not
usable here. However, SQLite is able to use the index.
This means than if you create an index like:
CREATE INDEX idx_orders ON orders(updateTime, price);
it might be used to optimize the WHERE clause even though updateTime does not appear there.
Also, from The SQLite Query Optimizer Overview/ORDER BY Optimizations:
SQLite attempts to use an index to satisfy the ORDER BY clause of a
query when possible. When faced with the choice of using an index to
satisfy WHERE clause constraints or satisfying an ORDER BY clause,
SQLite does the same cost analysis described above and chooses the
index that it believes will result in the fastest answer.
Since updateTime is defined first in the composite index, the index may also be used to optimize the ORDER BY clause.

SQLite: treat non-existent column as NULL

I have a query like this (simplified and anonymised):
SELECT
Department.id,
Department.name,
Department.manager_id,
Employee.name AS manager_name
FROM
Department
LEFT OUTER JOIN Employee
ON Department.manager_id = Employee.id;
The field Department.manager_id may be NULL. If it is non-NULL then it is guaranteed to be a valid id for precisely one row in the Employee table, so the OUTER JOIN is there just for the rows in the Department table where it is NULL.
Here is the problem: old instances of the database do not have this Department.manager_id column at all. In those cases, I would like the query to act as if the field did exist but was always NULL, so e.g. the manager_name field is returned as NULL. If the query only used the Department table then I could just use SELECT * and check for the column in my application, but the JOIN seems to make this impossible. I would prefer not to modify the database, partly so that I can load the database in read only mode. Can this be done just by clever adjustment of the query?

For completeness, here is an answer that does not require munging both possible schemas into one query (but still doesn't need you to actually do the schema migration):
Check for the schema version, and use that to determine which SELECT query to issue (i.e. with or without the manager_id column and JOIN) as a separate step. Here are a few possibilities to determine the schema version:
The ideal situation is that you already keep track of the schema by assigning version numbers to the schema and recording them in the database. Commonly this is done with either:
The user_version pragma.
A table called "Schema" or similar with one row containing the schema version number.
You can directly determine whether the column is present in the table. Two possibilities:
Use the table_info pragma to determine the list of columns in the table.
Use a simple SELECT * FROM Table LIMIT 1 and look at what columns are returned (this is probably better as it is independent of the database engine).

This seems to work:
SELECT
Dept.id,
Dept.name,
Dept.manager_id,
Employee.name AS manager_name
FROM
(SELECT *, NULL AS manager_id FROM Department) AS Dept
LEFT OUTER JOIN Employee
ON Dept.manager_id = Employee.id;
If the manager_id column is present in Department then it is used for the join, whereas if it is not then Dept.manager_id and Employee.name are both NULL.
If I swap the column order in the subquery:
(SELECT NULL AS manager_id, * FROM Department) AS Dept
then the Dept.manager_id and Employee.name are both NULL even if the Department.manager_id column exists, so it seems that Dept.manager_id refers to the first column in the Dept subquery that has that name. It would be good to find a reference in the SQLite documentation saying that this behaviour is guaranteed (or explicitly saying that it is not), but I can't find anything (e.g. in the SELECT or expression pages).
I haven't tried this with other database systems so I don't know if it will work with anything other than SQLite.

Function returning varchar inside select

Trying to generalize the SQL what splits a string/varchar into records. Here is the working SQL:
SELECT test.* FROM test JOIN (
SELECT level nbr, REGEXP_SUBSTR('1,3', '(.*?)(,|$)', 1, level, NULL, 1) value
FROM dual CONNECT BY level <= REGEXP_COUNT('1,3', ',')+1 ORDER BY level
) requested ON test.id=requested.value
What I mean by generalizing is; moving the recurring SQL (in this case the bit between the parenthesis's from the working SQL above) to a procedure/function so it can be reused. In this case I'm trying to find a way to insert a generated inner select statement. This is how the generalized SQL may look like:
SELECT t.* FROM table t JOIN (<GENERATED_INNER_SELECT>) my ON t.x=my.x;
However I didn't succeed yet, I tried tho but calling my function to generate the inner select statement directly resulted in:
ORA-00900: invalid SQL statement
And using the function in the generalized SQL resulted in:
ORA-00907: missing right parenthesis
None of these errors make any sense to me in this context.
Perhaps you can help? check out the full case on dbfiddle.

If you generate a SQL fragment to use as a subquery then the overall statement that embeds that as a subquery would have to be executed dynamically too.
It would be simpler to have the function actually doing the split itself, and returning a collection - as a schema-level collection type:
CREATE TYPE T_NUMBERS AS TABLE OF NUMBER
/
CREATE OR REPLACE FUNCTION split(p_string VARCHAR2, p_seperator VARCHAR2 DEFAULT ',')
RETURN T_NUMBERS AS
L_NUMBERS T_NUMBERS;
BEGIN
SELECT REGEXP_SUBSTR(p_string, '(.*?)(,|$)', 1, level, NULL, 1)
BULK COLLECT INTO L_NUMBERS
FROM dual
CONNECT BY level <= REGEXP_COUNT(p_string, ',')+1;
RETURN L_NUMBERS;
END split;
/
SELECT * FROM TEST
WHERE id MEMBER OF (split('1,3'))
/
ID NAM
---------- ---
1 foo
3 foe
or if you prefer the table collection expression approach:
SELECT t.*
FROM TABLE(split('1,3')) tmp
JOIN test t ON t.id = tmp.column_value;
It would be even simpler if the query could be called with a collection of numbers in the first place, but without seeing how the call is being made - and the string generated - it's hard to say exactly how you'd need to change that. You could even use a built-in collection type then, instead of having to define your own:
SELECT t.*
FROM TABLE(SYS.ODCINUMBERLIST(1,3)) tmp
JOIN test t ON t.id = tmp.column_value;
but it relies on the caller being able to pass the numbers in rather than a string (note the lack of single quotes...)

How to reverse order of SQLite CTE

If I do a simple select, I can order the result using ORDER BY. But I cannot do this with a WITH RECURSIVE CTE, because I am using it to find a path from a leaf in a tree back up to the root, and the order that the CTE creates the result is not an order that can be obtained by sorting, therefore there is no ORDER BY I can reverse to get the reverse order.
The problem I have is, this constructs the results from leaf to root, but for a subsequent part of the query I need it to be in the reverse order, from the root to the leaf. But I cannot construct the query this way because it would wind up following all branches in the tree instead of the single path that I need. Thus, I need to somehow reverse the order of the resulting CTE. How can I do this?
I have done a bit of looking and there are some similar questions for other (non SQLite) database which seem to suggest that the result of the CTE table doesn't actually have any defined order. I am not sure if that is true for SQLite - I always see it output the table in the same child to parent order, and in fact there are other cases (such as in creating temporary tables, as in a previous question I asked) where if the table were not guaranteed to have this property it would break the only possible solution rendering it an impossible problem to solve.

The documentation says:
An ORDER BY clause on the recursive-select can be used to control whether the search of a tree is depth-first or breadth-first.
However, you want to sort the ultimate output of the CTE.
This can be done easily because you are using a normal SELECT to access the CTE:
WITH RECURSIVE test1(id, parent) AS (
VALUES(3, 2)
UNION ALL
SELECT test.id, test.parent
FROM test JOIN test1 ON test1.parent = test.id)
SELECT *
FROM test1
ORDER BY id -- this sorts normally

You can use multiple elements in your order by. For example the following will order the tree by name after performing the depth first search. Here I am sorting first by LEVEL descending then NAME ascending. The output would be a sorted tree with children underneath the appropriate parent.
WITH RECURSIVE TEMPTREE (
id,
name,
level
)
AS (
SELECT flow_id,
name,
0
FROM DATATABLE
WHERE parent_id IS NULL
UNION ALL
SELECT DATATABLE.id,
DATATABLE.name,
TEMPTREE.level + 1
FROM DATATABLE
JOIN
TEMPTREE ON DATATABLE.parent_id = TEMPTREE .id
ORDER BY 3 DESC, 2 ASC
)
SELECT substr('..........', 1, level * 3) || name AS Name,
id
FROM TEMPTREE;

Hierarchical Database Select / Insert Statement (SQL Server)

I have recently stumbled upon a problem with selecting relationship details from a 1 table and inserting into another table, i hope someone can help.
I have a table structure as follows:
ID (PK) Name ParentID<br>
1 Myname 0<br>
2 nametwo 1<br>
3 namethree 2
e.g
This is the table i need to select from and get all the relationship data. As there could be unlimited number of sub links (is there a function i can create for this to create the loop ?)
Then once i have all the data i need to insert into another table and the ID's will now have to change as the id's must go in order (e.g. i cannot have id "2" be a sub of 3 for example), i am hoping i can use the same function for selecting to do the inserting.

If you are using SQL Server 2005 or above, you may use recursive queries to get your information. Here is an example:
With tree (id, Name, ParentID, [level])
As (
Select id, Name, ParentID, 1
From [myTable]
Where ParentID = 0
Union All
Select child.id
,child.Name
,child.ParentID
,parent.[level] + 1 As [level]
From [myTable] As [child]
Inner Join [tree] As [parent]
On [child].ParentID = [parent].id)
Select * From [tree];
This query will return the row requested by the first portion (Where ParentID = 0) and all sub-rows recursively. Does this help you?
I'm not sure I understand what you want to have happen with your insert. Can you provide more information in terms of the expected result when you are done?
Good luck!

For the retrieval part, you can take a look at Common Table Expression. This feature can provide recursive operation using SQL.
For the insertion part, you can use the CTE above to regenerate the ID, and insert accordingly.

I hope this URL helps Self-Joins in SQL

This is the problem of finding the transitive closure of a graph in sql. SQL does not support this directly, which leaves you with three common strategies:
use a vendor specific SQL extension
store the Materialized Path from the root to the given node in each row
store the Nested Sets, that is the interval covered by the subtree rooted at a given node when nodes are labeled depth first
The first option is straightforward, and if you don't need database portability is probably the best. The second and third options have the advantage of being plain SQL, but require maintaining some de-normalized state. Updating a table that uses materialized paths is simple, but for fast queries your database must support indexes for prefix queries on string values. Nested sets avoid needing any string indexing features, but can require updating a lot of rows as you insert or remove nodes.
If you're fine with always using MSSQL, I'd use the vendor specific option Adrian mentioned.