sqlite: how does order by in recursive queries work? - recursion

https://www.sqlite.org/lang_with.html#withorderby shows an example of how order-by on a recursion level can change Depth-First to Breadth-First ordering of a tree. Richard Hipp in
http://sqlite.1065341.n5.nabble.com/why-does-the-recursive-example-sort-alphabetically-td80404.html
says that order-by in recursive part has a special meaning: determines the order of recursion.
I've played with order-by on fields other than recursion level and ... I cannot predict the results. :( Could someone please explain how it works?
Basically, let's look at an example similar to Alice/Bob/Cindy from the sqlite docs (the first link above), but with names mixed up a bit (not alphabetically ordered within the tree, and inserted in a random order) and then with "order by name" (instead of recursion level) in the recursive part.
CREATE TABLE child_parent(
name TEXT PRIMARY KEY,
parent TEXT
);
INSERT INTO child_parent VALUES('KKK','HHH');
INSERT INTO child_parent VALUES('HHH','LLL');
INSERT INTO child_parent VALUES('MMM','CCC');
INSERT INTO child_parent VALUES('CCC','QQQ');
INSERT INTO child_parent VALUES('QQQ','LLL');
INSERT INTO child_parent VALUES('TTT','QQQ');
INSERT INTO child_parent VALUES('AAA','HHH');
INSERT INTO child_parent VALUES('EEE','TTT');
INSERT INTO child_parent VALUES('UUU','CCC');
INSERT INTO child_parent VALUES('FFF','MMM');
INSERT INTO child_parent VALUES('LLL',NULL);
The child_parent structure is:
-- LLL -> QQQ
-- -> TTT
-- -> EEE
-- -> CCC
-- -> UUU
-- -> MMM
-- -> FFF
-- -> HHH
-- -> AAA
-- -> KKK
First select - with "ORDER BY new_name asc"
WITH RECURSIVE
tree(name,level) AS (
SELECT cp.name, 0
FROM child_parent cp
WHERE cp.parent IS NULL
UNION ALL
SELECT ch.name AS new_name, tree.level+1 AS new_lvl
FROM child_parent ch
INNER JOIN tree ON ch.parent=tree.name
ORDER BY new_name ASC
)
SELECT *, substr('...................',1,level*3) || name FROM tree;
-- name level substr('...................',1,level*3) || name
-- ---------- ---------- -----------------------------------------------
-- LLL 0 LLL
-- HHH 1 ...HHH
-- AAA 2 ......AAA
-- KKK 2 ......KKK
-- QQQ 1 ...QQQ
-- CCC 2 ......CCC
-- MMM 3 .........MMM
-- FFF 4 ............FFF
-- TTT 2 ......TTT
-- EEE 3 .........EEE
-- UUU 3 .........UUU
First select - with "ORDER BY new_name desc"
WITH RECURSIVE
tree(name,level) AS (
SELECT cp.name, 0
FROM child_parent cp
WHERE cp.parent IS NULL
UNION ALL
SELECT ch.name AS new_name, tree.level+1 AS new_lvl
FROM child_parent ch
INNER JOIN tree ON ch.parent=tree.name
ORDER BY new_name DESC
)
SELECT *, substr('...................',1,level*3) || name FROM tree;
-- name level substr('...................',1,level*3) || name
-- ---------- ---------- -----------------------------------------------
-- LLL 0 LLL
-- QQQ 1 ...QQQ
-- TTT 2 ......TTT
-- HHH 1 ...HHH
-- KKK 2 ......KKK
-- EEE 3 .........EEE
-- CCC 2 ......CCC
-- UUU 3 .........UUU
-- MMM 3 .........MMM
-- FFF 4 ............FFF
-- AAA 2 ......AAA
Basically the question is: how to think about db behaviour to predict that the result of the last two queries would be as above. Could you describe what happens step-by-step, e.g. at each recursion level?
SQLite 3.15.2.0.37

The documentation says:
The basic algorithm for computing the content of the recursive table is as follows:
Run the initial-select and add the results to a queue.
While the queue is not empty:
a. Extract a single row from the queue.
b. Insert that single row into the recursive table
c. Pretend that the single row just extracted is the only row in the recursive table and run the recursive-select, adding all results to the queue.
[…]
If an ORDER BY clause is present, it determines the order in which rows are extracted from the queue in step 2a.
When the ORDER BY is applied to the tree level column, the result is a breadth-first or depth-first search. (With ASC, the 'older' rows are extracted first; with DESC, the most recent row at the lowest level is extracted first.)
But when applied to some other column, the order is no longer related to the structure of the tree.

Answer
Oh, I see, so the queue contains only a list of single rows and there are two points that I have missed:
the queue gets re-ordered after EVERY small STEP and NOT after every recursion level
the results are added to the results at the moment when the row is processed from the queue, and not when they are added to the queue.
Step-by-step example
Example: so in the last case above (ORDER BY new_name DESC):
initial select => result: empty, queue (LLL)
the second part of recursive select based on the first record of the queue
result: (LLL), queue empty
we get 2 records: QQQ, HHH
they both get added to the queue in order of ORDER BY: queue {QQQ, HHH}
first record from the queue gets picked up
result: (LLL), (QQQ); queue {HHH}
first based on QQQ we get 2 new records: TTT and AAA
they get added to the queue BUT at this moment (!) the queue wll get ordered and it will be in fact {TTT, HHH, AAA}
first record from the queue gets picked up
result: (LLL), (QQQ), (TTT); queue {HHH, AAA}
select based on TTT does not add anything to the queue
first record from the queue gets picked up:
result: (LLL), (QQQ), (TTT), (HHH); queue {AAA}
query based on HHH leads to two new rows: KKK and CCC, and the queue after this step is {KKK, CCC, AAA}
and so on
Summary
Yep, we do get the results as from the above selects
and in the case of ordering by recursion level these steps do lead to Depth-First or Breadth-First.
Thanks!

Related

How to create a sqlite recursive view that properly uses an index for the first row

We're using sqlite version 3.16.0.
I would like to create some views to simplify some common recursive operations I do on our schema. However, these views turn out to be significantly slower than running the SQL directly.
Specifically, a view to show me the ancestors for a given node:
CREATE VIEW ancestors AS
WITH RECURSIVE ancestors
(
leafid
, parentid
, name
, depth
)
AS
(SELECT id
, parentid
, name
, 1
FROM objects
UNION ALL
SELECT a.leafid
, f.parentid
, f.name
, a.depth + 1
FROM objects f
JOIN ancestors a
ON f.id = a.parentid
) ;
when used with this query:
SELECT *
FROM ancestors
WHERE leafid = 157609;
yields this result:
sele order from deta
---- ------------- ---- ----
2 0 0 SCAN TABLE objects
3 0 1 SCAN TABLE ancestors AS a
3 1 0 SEARCH TABLE objects AS f USING INTEGER PRIMARY KEY (rowid=?)
1 0 0 COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0 0 0 SCAN SUBQUERY 1
Run Time: real 0.374 user 0.372461 sys 0.001483
Yet running the query directly (with a WHERE constraint on the initial query for the same row), yields:
WITH RECURSIVE ancestors
(
leafid, parentid, name, depth
)
AS
(SELECT id, parentid , name, 1
FROM objects
WHERE id = 157609
UNION ALL
SELECT a.leafid, f.parentid , f.name, a.depth + 1
FROM objects f
JOIN ancestors a
ON f.id = a.parentid
)
SELECT *
FROM ancestors;
Run Time: real 0.021 user 0.000249 sys 0.000111
sele order from deta
---- ------------- ---- ----
2 0 0 SEARCH TABLE objects USING INTEGER PRIMARY KEY (rowid=?)
3 0 1 SCAN TABLE ancestors AS a
3 1 0 SEARCH TABLE objects AS f USING INTEGER PRIMARY KEY (rowid=?)
1 0 0 COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0 0 0 SCAN SUBQUERY 1
The second result is around 15 times faster because we're using the PK index on objects to get the initial row, whereas the view seems to scan the entire table, filtering on leaf node only after the ancestors for all rows are found.
Is there any way to write the view such that I can apply a constraint on a consuming select that would be applied to the optimization of the initial query?
You are asking for the WHERE leafid = 157609 to be moved inside the first subquery. This is the push-down optimization, and SQLite tries to do it whenever possible.
However, this is possible only if the database is able to prove that the result is guaranteed to be the same. For this particular query, you know that the transformation would be valid, but, at the moment, there is no algorithm to make this proof for recursive CTEs.

BQ array lookup: similar to NTH, but based on index, not position

The NTH function is really useful for extracting nested array elements in BQ, but its utility for a given table depends on each row's nested array containing the same amount of elements, and in the same order. If I have a 2+ column nested array where one column is variable name/ID, and the different instances of the array in different rows have inconsistent naming and/or ordering, is there an elegant way to fetch/pivot a variable based on the variable name/ID?
For example, if row1 has customDimensions array:
index value
4 aaa
23 bbb
70 ccc
and row2 has customDimensions array:
index value
4 ddd
70 eee
I'd want to run something like
SELECT
NTHLOOKUP(70, customdims.index, customdims.value) as val70,
NTHLOOKUP(4, customdims.index, customdims.value) as val4,
NTHLOOKUP(23, customdims.index, customdims.value) as val23
from my_table;
And get:
val70 val4 val23
ccc aaa bbb
eee ddd (null)
I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
EDIT: Based on Mikhail's answer below (thank you!!) I was able to write my query more elegantly. Not quite as slick as an NTHLOOKUP, but I'll take it:
select id,
max(case when index = 41 then value[OFFSET(0)] else '' end) as val41,
max(case when index = 59 then value[OFFSET(0)] else '' end) as val59
from
(select
concat(array1.thing1, array1.thing2) as id,
cd.index,
ARRAY_AGG(distinct cd.value) as value
FROM my_table g
,unnest(array1) as array1
,unnest(array1.customDimensions) as cd
where index in (41,59)
group by 1,2
order by 1,2
) x
group by 1
order by 1
The best I can "offer" is below (BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.my_table` AS (
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'aaa'), (23, 'bbb'), (70, 'ccc')] customDimensions
UNION ALL
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'ddd'), (70, 'eee')] customDimensions
)
SELECT cd.index, ARRAY_AGG(cd.value) VALUES
FROM `project.dataset.my_table`,
UNNEST(customDimensions) cd
GROUP BY cd.index
with result as below
Row index values
1 4 aaa
ddd
2 23 bbb
3 70 ccc
eee
I would recommend to stay with this flatten version as it serves most of practical cases I can think of
But if you still want to further pivot this - there are quite a number of posts related to how to pivot in BigQuery
I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
Yes, you can use a user-defined function to encapsulate the common logic. For example,
CREATE TEMP FUNCTION NTHLOOKUP(
targetIndex INT64,
customDimensions ARRAY<STRUCT<index INT64, value STRING>>
) AS (
(SELECT value FROM UNNEST(customDimensions)
WHERE index = targetIndex)
);
SELECT
NTHLOOKUP(70, customDimensions) as val70,
NTHLOOKUP(4, customDimensions) as val4,
NTHLOOKUP(23, customDimensions) as val23
from my_table;

Recursively add to a data table in SAS

I am new to SAS. I need to do x-iterations to populate my dataset called MYRS.
Each iteration needs to JOIN TABLE1 with (TABLE2+ MYRS) MINUS the records which are already in MYRS table.
Then, I need to update MYRS table with additional matches. The GOAL is to track a chain of emails.
MYRS is essentially a copy of TABLE1 and contains matching records. Kind of tricky. (simplified schema). Table1 Can have DUPS.
For example
TABLE1:
ID | EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4|
1 | A | s | d | F
2 | g | F | j | L
3 | z | x | L | v
4 | z | x | L | v
2 | g | F | j | L
TABLE2:
EMAIL
A
MYRS (starts as empty dataset)
EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4
Logic: TABLE1 has email that matches email in TABLE2. Therefore this record need to show up. Other records don't match anything in TABLE2. But because Record1 and Record2 share the same ALTERNATIVE email F, Record2 also need to be shown. But because Record2 and Record3 share same alternative email L, Record3 also needs to be shown. And so fourth...
proc sql;
SELECT TABLE1.id,
TABLE1.email1,
TABLE1.email2,
TABLE1.email3,
TABLE1.email4
FROM TABLE1
INNER JOIN (
SELECT EMAIL
FROM TABLE2
UNION
SELECT EMAIL1 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL2 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL3 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL4 AS EMAIL
FROM MYRS
)
ON EMAIL=EMAIL1 OR EMAIL=EMAIL2 OR EMAIL=EMAIL3 OR EMAIL=EMAIL4
WHERE TABLE1.id NOT IN (
SELECT DISTINCT ID
FROM MYRS
)
quit;
How can I create the following logic:
Wrap this into some sort of function
Before sql execution, count amount of records in MYDS and SAVE the count
Execute SQL and update MYDS
Count amount of records in MYDS
If MYDS count did not change, stop execution
Else, goto #3
I am very new to SAS (3 days to be exact) and trying to put everything together. (I would use the logic above if I was to do that in Java)
Here is a macro approach, it mostly follows your logic but transforms your data first and the input/output is a list of IDs (you can easily get to and from emails with this).
This code will probably introduce quite a few SAS features that you are unfamiliar with, but the comments and explanations below should help . If any of it is still unclear take a look at the links or add a comment.
It expects input data:
inData: Your TABLE1 with ID and EMAIL* variables
matched: An initial list of known wanted IDs
It returns:
matched: An updated list of wanted IDs
/* Wrap the processing in a macro so that we can use a %do loop */
%macro looper(maxIter = 5);
/* Put all the emails in one column to make comparison simpler */
proc transpose data = inData out = trans (rename = (col1 = email));
by ID;
var email:;
run;
/* Initialise the counts for the %where condition */
%let _nMatched = 0;
%let nMatched = 1;
%let i = 0;
/* Loop until no new IDs are added (or maximum number of iterations) */
%do %while(&_nMatched. < &nMatched. and &i < &maxIter.);
%let _nMatched = &nMatched.;
%let i = %eval(&i. + 1);
%put NOTE: Loop &i.: &nMatched. matched.;
/* Move matches to a temporary table */
proc datasets library = work nolist nowarn;
delete _matched;
change matched = _matched;
quit;
/* Get new matched IDs */
proc sql noprint;
create table matched as
select distinct c.ID
from _matched as a
left join trans as b
on a.ID = b.ID
left join trans as c
on b.email = c.email;
/* Get new count */
select count(*) into :nMatched from matched;
quit;
%end;
%mend looper;
%looper(maxIter = 10);
The interesting bits are:
proc transpose: Converts the input into a deep table so that all the email addresses are in one variable, this makes writing the email comparison logic simpler (less repetition needed) and puts the data in a format that will make it easier for you to clean the email addresses if necessary (think upcase(), strip(), etc.).
%macro %mend: The statements used to define a macro. This is necessary as you cannot use macro logic or loops in open code. I've also added an argument so you can see how that works.
%let and select into :: Two ways to create macro variables. Macro variables are referenced with the prefix & and are used to insert text into the SAS program before it is executed.
%do %while() %end: One of the ways to perform a loop within a macro. The code within will be run repeatedly until the condition evaluates to false.
proc datasets: A procedure for performing admin tasks on datasets and libraries. Used here to delete and rename temporary tables.

Common Table Expression in sqlite using rowid

I found a good article on converting adjacency to nested sets at http://dataeducation.com/the-hidden-costs-of-insert-exec/
The SQL language used is Microsoft SQL Server (I think) and I am trying to convert the examples given in the article to sqlite (as this is what I have easy access to on my Macbook).
The problem I appear to be having is converting the part of the overall CTE query to do with the Employee Rows
EmployeeRows AS
(
SELECT
EmployeeLevels.*,
ROW_NUMBER() OVER (ORDER BY thePath) AS Row
FROM EmployeeLevels
)
I converted this to
EmployeeRows AS
(
SELECT
EmployeeLevels.*,
rowid AS Row
FROM EmployeeLevels
ORDER BY thePath
)
and the CTE query runs (no syntax errors) but the output I get is a table without the Row and Lft and Rgt columns populated
ProductName ProductID ParentProductID TreePath HLevel Row Lft Rgt
----------- ---------- --------------- ---------- ---------- ---------- ---------- ----------
Baby Goods 0 0 1
Baby Food 10 0 0.10 2
All Ages Ba 100 10 0.10.100 3
Strawberry 200 100 0.10.100.2 4
Baby Cereal 250 100 0.10.100.2 4
Beginners 150 10 0.10.150 3
Formula Mil 300 150 0.10.150.3 4
Heinz Formu 310 300 0.10.150.3 5
Nappies 20 0 0.20 2
Small Pack 400 20 0.20.400 3
Bulk Pack N 450 20 0.20.450 3
I think the start of the problem is the Row is not getting populated and therefore the Lft and Rgt columns do not get populated by the following parts of the query.
Are there any sqlite experts out there to tell me:
am I translating the rowid part of the query correctly
does sqlite support a rowid in a part of a CTE query
is there a better way? :)
Any help appreciated :)
am I translating the rowid part of the query correctly
No.
The SQL:
SELECT
EmployeeLevels.*,
rowid AS Row
FROM EmployeeLevels
ORDER BY thePath
has the Row defined as the rowid of table EmployeeLevels in SQLite, ignoring the order clause. Which is different from the intention of ROW_NUMBER() OVER (ORDER BY thePath) AS Row
does sqlite support a rowid in a part of a CTE query
Unfortunately no. I assume you mean this:
WITH foo AS (
SELECT * FROM bar ORDER BY col_a
)
SELECT rowid, *
FROM foo
but SQLite will report no such column of rowid in foo.
is there a better way?
Not sure it is better but at least it works. In SQLite, you have a mechanism of temp table which exists as long as your connection opens and you didn't delete it deliberately. Rewrite the above SQL in my example:
CREATE TEMP TABLE foo AS
SELECT * FROM bar ORDER BY col_a
;
SELECT rowid, *
FROM foo
;
DROP TABLE foo
;
This one will run without SQLite complaining.
update:
As of SQLite version 3.25.0, window function is supported. Hence you can use row_number() over (order by x) expression in your CTE if you happen to use a newer SQLite

How can I concatenate(or merge) values from 2 result sets with the same PK?

I don't know if I'm being dumb here but I can't seem to find an efficient way to do this. I wrote a very long and inefficient query that does what I need, but what I WANT is a more efficient way.
I have 2 result sets that displays an ID (a PK which is generic/from the same source in both sets) and a FLAG (A - approve and V - Validate).
Result Set 1
ID FLAG
1 V
2 V
3 V
4 V
5 V
6 V
Result Set 2
ID FLAG
2 A
5 A
7 A
8 A
I want to "merge" these two sets to give me this output:
ID FLAG
1 V
2 (V/A)
3 V
4 V
5 (V/A)
6 V
7 A
8 A
Neither of the 2 result sets will at any time have all the ID's to make a simple left join with a case statement on the other result set an easy solution.
I'm currently doing a union between the two sets to get ALL the ID's. Thereafter I left join the 2 result sets to get the required '(V/A)' by use of a case statement.
There must be a more efficient way but I just can't seem to figure it out now as I'm running low on amps... I need a holiday... :-/
Thanks in advance!
Use a FULL OUTER JOIN:
SELECT ID,
CASE
WHEN t1.FLAG IS NULL THEN t2.FLAG
WHEN t2.FLAG IS NULL THEN t1.FLAG
ELSE '(' || t1.FLAG || '/' || t2.FLAG || ')'
END AS MERGED_FLAG
FROM TABLE1 t1
FULL OUTER JOIN TABLE2 t2
USING (ID)
ORDER BY ID
See this SQLFiddle.
Share and enjoy.
I think that you can use xmlagg. Here an exemple :
SELECT deptno,
SUBSTR (REPLACE (REPLACE (XMLAGG (XMLELEMENT ("x", ename)
ORDER BY ename),'</x>'),'<x>','|'),2) as concated_list
FROM emp
GROUP BY deptno
ORDER BY deptno;
Bye

Resources