Renumber table rows with a recursive statement - sqlite

To understand the behaviour of recursion (in SQLite), I tried the following statements to re-number the rows of a table with a recursive statement:
Let's create a sample table,
CREATE TABLE tb
(x TEXT(1) PRIMARY KEY);
INSERT INTO tb
VALUES ('a'), ('b'), ('c');
and re-number the rows starting from, say, 2 via
SELECT tb.x as x, tb.rowid + 1 as idx from tb;
/* yields expected:
a|2
b|3
c|4
*/
Attempting to do the same with a recursive WITH (neglecting ROWID), results in divergence — here, I have added LIMIT 6 to prevent the divergence:
WITH RECURSIVE
newtb AS (
SELECT tb.x, 2 AS idx FROM tb
UNION ALL
SELECT tb.x, newtb.idx + 1
FROM tb, newtb
LIMIT 6 -- only to prevent divergence!
)
SELECT * FROM newtb;
/* yields indefinitely:
a|2
b|2
c|2
a|3
b|3
c|3
...
*/
Why does the recursion does not stop when it reaches the end of table tb? Could this be prevented?
Note that the problem can be reformulated as how to produce the result of the following procedural pseudo-code in SQLite (without too much ado):
tb := {'a', 'b', 'c'};
num := {1, 2, 3};
result := {}; # initialize an empty table
for i in {1, ..., length(tb)} # assume index starts from 1
append tuple(num[i], tb[i]) to result;
end for
# result will be {(1, 'a'), (2, 'b'), (3, 'c')}
This is equivalent to the zip operation in a language like Python.
According to a hint by #CPerkins, one can achieve this goal via window functions (for SQLite >= 3.25) very elegantly; eg.,
SELECT (row_number() OVER (ORDER BY x)) + 2 AS newId, x FROM tb;

Why does the recursion does not stop when it reaches the end of table tb?
Because that's the way it is designed to be and it is extremely useful. It is little different from most languages that have some form of recursion and is often and efficient and effective the way to resolve some programming issues such as traversing a directory tree.
Most computer programming languages support recursion by allowing a
function to call itself from within its own code. Some functional
programming languages do not define any looping constructs but rely
solely on recursion to repeatedly call code. Computability theory
proves that these recursive-only languages are Turing complete; they
are as computationally powerful as Turing complete imperative
languages, meaning they can solve the same kinds of problems as
imperative languages even without iterative control structures such as
while and for.Recursion (computer science)
If you used LIMIT (SELECT count() FROM tb) instead of LIMIT 6 then the recursion would stop based upon the number of rows in the table.
However, if you are looking to renumber (by adding 1 to the rowid) then you would be looking at something more like :-
WITH RECURSIVE
cte(idx,newidx) AS (
SELECT (SELECT max(rowid) FROM tb),(SELECT max(rowid) FROM tb) +1
UNION ALL
SELECT
idx-1, newidx-1 FROM cte
WHERE idx > 0
)
SELECT (SELECT x FROM tb WHERE tb.rowid = cte.idx) AS x, newidx, idx AS original FROM cte WHERE x IS NOT NULL;
This would (assuming that tb had rows with a, b and c .... X, Y and Z and that rows d-w had been deleted) result in :-
SQlite's reasoning is :-
Recursive common table expressions provide the ability to do
hierarchical or recursive queries of trees and graphs, a capability
that is not otherwise available in the SQL language.
SQL As Understood By SQLite - WITH clause
Could this be prevented?
Yes, you can not use recursion as there may be alternatives but as with recursion throughout other languages if you do use recursion then you have to have some means of detecting when the recursion should finish. The use of a WHERE or LIMIT clause facilitates this.

Related

Can I speed up calculations between R and Sqlite by using data.tables?

I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)

Alternative for recursive aggregate queries not supported in sqlite3

I would like to perform a SQL computation of a system evolving in time as
v <- v + a (*) v
where v is a vector of N components (N >> 10), a is an N-by-N matrix, fairly sparse, (*) denotes matrix multiplication, and the evolution is recursively computed as a sequence of timesteps, with each step using the previous value of v. a changes with time as an external factor, but it is sufficient for this question to assume a is constant.
I could do this recursion loop in an imperative language, but the underlying data was kind of messy and SQL was brilliant for normalising. It would be kind of neat to just finish the job in one language.
I found that matrix multiplication is fine. Recursion is fine too, as of sqlite 3.8. But matrix multiplication inside a recursion loop does not appear to be possible. Here is my progress so far (also at http://sqlfiddle.com/#!5/ed521/1 ):
-- Example vector v
DROP TABLE IF EXISTS coords;
CREATE TABLE coords( row INTEGER PRIMARY KEY, val FLOAT );
INSERT INTO coords
VALUES
(1, 0.0 ),
(2, 1.0 );
-- Example matrix a
DROP TABLE IF EXISTS matrix;
CREATE TABLE matrix( row INTEGER, col INTEGER, val FLOAT, PRIMARY KEY( row, col ) );
INSERT INTO matrix
VALUES
( 1, 1, 0.0 ),
( 1, 2, 0.03 ),
( 2, 1, -0.03 ),
( 2, 2, 0.0 );
-- The timestep equation can also be expressed: v <- ( I + a ) (*) v, where the
-- identity matrix I is first added to a.
UPDATE matrix
SET val = val + 1.0
WHERE row == col;
-- Matrix multiply to evaluate the first step.
SELECT a.row AS row, SUM( a.val*v.val ) AS val
FROM coords v
JOIN matrix a
ON a.col == v.row
GROUP BY a.row;
Here is where the problem arises. I can't see how to do a matrix multiply without a
GROUP BY (aggregation) operation, but Sqlite3 specifically does not permit aggregation inside of a recursion loop:
-- Recursive matrix multiply to evaluate a sequences of steps.
WITH RECURSIVE trajectory( row, val ) AS
(
SELECT row, val
FROM coords
UNION ALL
SELECT a.row AS row, SUM( a.val*v.val ) AS val
FROM trajectory v -- recursive sequence of steps
--FROM coords v -- non-recursive first step only
JOIN matrix a
ON a.col == v.row
GROUP BY a.row
LIMIT 50
)
SELECT *
FROM trajectory;
Returns
Error: recursive aggregate queries not supported
No doubt the designers had some clear reason for excluding it! I am surprised that JOINs are allowed but GROUP BYs are not. I am not sure what my alternatives are, though.
I've found a few other recursion examples but they all seem to have carefully selected problems for which aggregation or self-joins inside the loop are not required. In the docs( https://www.sqlite.org/lang_with.html ) an example query walks a tree recursively, and performs an avg() on the output. This is subtly different: the aggregation happens outside the loop, and tree-walking uses JOINs but no aggregation inside the recursion loop. That problem proceeds only because the recursion does not depend on the aggregations, as it does in this problem.
Another example, the Fibonacci generator is an example of an N = 2 linear dynamical system, but with N = 2 the implementations can just hard-code the two values and the matrix multiply directly into the query, so no aggregating SUM() is needed. More generally with N >> 10 it is not feasible to go down this path.
Any help would be much appreciated. Thanks!

The Eight-Queen Puzzle in Programming in Lua Fourth Edition

I'm currently reading Programming in Lua Fourth Edition and I'm already stuck on the first exercise of "Chapter 2. Interlude: The Eight-Queen Puzzle."
The example code is as follows:
N = 8 -- board size
-- check whether position (n, c) is free from attacks
function isplaceok (a, n ,c)
for i = 1, n - 1 do -- for each queen already placed
if (a[i] == c) or -- same column?
(a[i] - i == c - n) or -- same diagonal?
(a[i] + i == c + n) then -- same diagonal?
return false -- place can be attacked
end
end
return true -- no attacks; place is OK
end
-- print a board
function printsolution (a)
for i = 1, N do -- for each row
for j = 1, N do -- and for each column
-- write "X" or "-" plus a space
io.write(a[i] == j and "X" or "-", " ")
end
io.write("\n")
end
io.write("\n")
end
-- add to board 'a' all queens from 'n' to 'N'
function addqueen (a, n)
if n > N then -- all queens have been placed?
printsolution(a)
else -- try to place n-th queen
for c = 1, N do
if isplaceok(a, n, c) then
a[n] = c -- place n-th queen at column 'c'
addqueen(a, n + 1)
end
end
end
end
-- run the program
addqueen({}, 1)
The code's quite commented and the book's quite explicit, but I can't answer the first question:
Exercise 2.1: Modify the eight-queen program so that it stops after
printing the first solution.
At the end of this program, a contains all possible solutions; I can't figure out if addqueen (n, c) should be modified so that a contains only one possible solution or if printsolution (a) should be modified so that it only prints the first possible solution?
Even though I'm not sure to fully understand backtracking, I tried to implement both hypotheses without success, so any help would be much appreciated.
At the end of this program, a contains all possible solutions
As far as I understand the solution, a never contains all possible solutions; it either includes one complete solution or one incomplete/incorrect one that the algorithm is working on. The algorithm is written in a way that simply enumerates possible solutions skipping those that generate conflicts as early as possible (for example, if first and second queens are on the same line, then the second queen will be moved without checking positions for other queens, as they wouldn't satisfy the solution anyway).
So, to stop after printing the first solution, you can simply add os.exit() after printsolution(a) line.
Listing 1 is an alternative to implement the requirement. The three lines, commented respectively with (1), (2), and (3), are the modifications to the original implementation in the book and as listed in the question. With these modifications, if the function returns true, a solution was found and a contains the solution.
-- Listing 1
function addqueen (a, n)
if n > N then -- all queens have been placed?
return true -- (1)
else -- try to place n-th queen
for c = 1, N do
if isplaceok(a, n, c) then
a[n] = c -- place n-th queen at column 'c'
if addqueen(a, n + 1) then return true end -- (2)
end
end
return false -- (3)
end
end
-- run the program
a = {1}
if not addqueen(a, 2) then print("failed") end
printsolution(a)
a = {1, 4}
if not addqueen(a, 3) then print("failed") end
printsolution(a)
Let me start from Exercise 2.2 in the book, which, based on my past experience to explain "backtracking" algorithms to other people, may help to better understand the original implementation and my modifications.
Exercise 2.2 requires to generate all possible permutations first. A straightforward and intuitive solution is in Listing 2, which uses nested for-loops to generate all permutations and validates them one by one in the inner most loop. Although it fulfills the requirement of Exercise 2.2, the code does look awkward. Also it is hard-coded to solve 8x8 board.
-- Listing 2
local function allsolutions (a)
-- generate all possible permutations
for c1 = 1, N do
a[1] = c1
for c2 = 1, N do
a[2] = c2
for c3 = 1, N do
a[3] = c3
for c4 = 1, N do
a[4] = c4
for c5 = 1, N do
a[5] = c5
for c6 = 1, N do
a[6] = c6
for c7 = 1, N do
a[7] = c7
for c8 = 1, N do
a[8] = c8
-- validate the permutation
local valid
for r = 2, N do -- start from 2nd row
valid = isplaceok(a, r, a[r])
if not valid then break end
end
if valid then printsolution(a) end
end
end
end
end
end
end
end
end
end
-- run the program
allsolutions({})
Listing 3 is equivalent to List 2, when N = 8. The for-loop in the else-end block does what the whole nested for-loops in Listing 2 do. Using recursive call makes the code not only compact, but also flexible, i.e., it is capable of solving NxN board and board with pre-set rows. However, recursive calls sometimes do cause confusions. Hope the code in List 2 helps.
-- Listing 3
local function addqueen (a, n)
n = n or 1
if n > N then
-- verify the permutation
local valid
for r = 2, N do -- start from 2nd row
valid = isplaceok(a, r, a[r])
if not valid then break end
end
if valid then printsolution(a) end
else
-- generate all possible permutations
for c = 1, N do
a[n] = c
addqueen(a, n + 1)
end
end
end
-- run the program
addqueen({}) -- empty board, equivalent allsolutions({})
addqueen({1}, 2) -- a queen in 1st row and 1st column
Compare the code in Listing 3 with the original implementation, the difference is that it does validation after all eight queens are placed on the board, while the original implementation validates every time when a queen is added and will not go further to next row if the newly-added queen causes conflicts. This is all what "backtracking" is about, i.e. it does "brute-force" search, it abandons the search branch once it finds a node that will not lead to a solution, and it has to reach a leaf of the search tree to determine it is a valid solution.
Back to the modifications in Listing 1.
(1) When the function hits this point, it reaches a leaf of the search tree and a valid solution is found, so let it return true representing success.
(2) This is the point to stop the function from further searching. In original implementation, the for-loop continues regardless of what happened to the recursive call. With modification (1) in place, the recursive call returns true if a solution was found, the function needs to stop and to propagate the successful signal back; otherwise, it continues the for-loop, searching for other possible solutions.
(3) This is the point the function returns after finishing the for-loop. With modification (1) and (2) in place, it means that it failed to find a solution when the function hits this point, so let it explicitly return false representing failure.

Oracle query to count rows based on value from next record

Input values to the query : 1-20
Values in the database : 4,5, 15,16
I would like a query that gives me results as following
Value - Count
===== - =====
1 - 3
6 - 9
17 - 3
So basically, first generate continuous numbers from 1 to 20, count available numbers.
I wrote a query but I can not get it to fully work:
with avail_ip as (
SELECT (0) + LEVEL AS val
FROM DUAL
CONNECT BY LEVEL < 20),
grouped_tab as (
select val,lead(val,1,0) over (order by val) next_val
from avail_ip u
where not exists (
select 'x' from (select 4 val from dual) b
where b.val=u.val) )
select
val,next_val-val difference,
count(*) over (partition by next_val-val) avail_count
from grouped_tab
order by 1
It gives me count but i am not sure how to compress the rows to just three rows.
I was not able to add complete query, I kept getting 'error occurred while submission'. For some reason it does not like union clause. So I am attaching query as a image :(
More details of exact requirement:
I am writing a ip management module and i need to find out available (free) ip addresses within a ip block. Block could be /16 or /24 or even /12. To make it even challenging, i also support IPv6 so will have lot more numbers to manage. All issued ip addresses are stored in decimal format. So my thought is to first generate all ip decimals within the block range from network address to broadcast address. For eg. in a /24, there would 255 addresses and in case of /16 would 64K.
Now, secondly find all used addresses within a block, and find out available number of address with a starting ip. So in the above example, starting 1 ip- 3 addresses are available, starting with 6, 9 are available .
My last concern would be the query should be able to run fast enough to run through millions of numbers.
And sorry again, if my original question was not clear enough.
Similar sort of idea to what you tried:
with all_values as (
select :start_val + level - 1 as val
from dual
connect by level <= (:end_val - :start_val) + 1
),
missing_values as (
select val
from all_values
where not exists (select null from t42 where id = val)
),
chains as (
select val,
val - (row_number() over (order by val) + :start_val - 1) as chain
from missing_values
)
select min(val), count(*) - 1 as gap_count
from chains
group by chain
order by min(val);
With start_val as 1 and end_val as 20, and your data in table t42, that gets:
MIN(VAL) GAP_COUNT
---------- ----------
1 3
6 9
17 4
I've made end_val inclusive though; not sure if if you want it to be inclusive or exclusive. And I've perhaps made it more flexible that you need - your version also assumes you're always starting from 1.
The all_values CTE is basically the same as your, generating all the numbers between the start and end values - 1 to 20 (inclusive!) in this case.
The missing_values CTE removes the values that are in the table, so you're left with 1,2,3,6,7,8,9,10,11,12,13,14,17,18,19,20.
The chains CTE does the magic part. This gets the difference between each value and where you would expect it to be in a contiguous list. The difference - what I've called 'chain' - is the same for all contiguous missing values; 1,2,3 all get 0, 6 to 14 all get 2, and 17 to 20 all get 4. That chain value can then be used to group by, and you can use the aggregate count and min to get the answer you need.
SQL Fiddle of a simplified version that is specifically for 1-20, showing the data from each intermediate step. This would work for any upper limit, just by changing the 20, but assumes you'll always start from 1.

Can recursion be dynamic programming?

I was asked to use dynamic programming to solve a problem. I have mixed notes on what constitutes dynamic programming. I believe it requires a "bottom-up" approach, where smallest problems are solved first.
One thing I have contradicting information on, is whether something can be dynamic programming if the same subproblems are solved more than once, as is often the case in recursion.
For instance. For Fibonacci, I can have a recursive algorithm:
RecursiveFibonacci(n)
if (n=1 or n=2)
return 1
else
return RecursiveFibonacci(n-1) + RecursiveFibonacci(n-2)
In this situation, the same sub-problems may be solved over-and-over again. Does this render it is not dynamic programming? That is, if I wanted dynamic programming, would I have to avoid resolving subproblems, such as using an array of length n and storing the solution to each subproblem (the first indices of the array are 1, 1, 2, 3, 5, 8, 13, 21)?
Fibonacci(n)
F1 = 1
F2 = 1
for i=3 to n
Fi=Fi-1 + Fi-2
return Fn
Dynamic programs can usually be succinctly described with recursive formulas.
But if you implement them with simple recursive computer programs, these are often inefficient for exactly the reason you raise: the same computation is repeated. Fibonacci is a example of repeated computation, though it is not a dynamic program.
There are two approaches to avoiding the repetition.
Memoization. The idea here is to cache the answer computed for each set of arguments to the recursive function and return the cached value when it exists.
Bottom-up table. Here you "unwind" the recursion so that results at levels less than i are combined to the result at level i. This is usually depicted as filling in a table, where the levels are rows.
One of these methods is implied for any DP algorithm. If computations are repeated, the algorithm isn't a DP. So the answer to your question is "yes."
So an example... Let's try the problem of making change of c cents given you have coins with values v_1, v_2, ... v_n, using a minimum number of coins.
Let N(c) be the minimum number of coins needed to make c cents. Then one recursive formulation is
N(c) = 1 + min_{i = 1..n} N(c - v_i)
The base cases are N(0)=0 and N(k)=inf for k<0.
To memoize this requires just a hash table mapping c to N(c).
In this case the "table" has only one dimension, which is easy to fill in. Say we have coins with values 1, 3, 5, then the N table starts with
N(0) = 0, the initial condition.
N(1) = 1 + min(N(1-1), N(1-3), N(1-5) = 1 + min(0, inf, inf) = 1
N(2) = 1 + min(N(2-1), N(2-3), N(2-5) = 1 + min(1, inf, inf) = 2
N(3) = 1 + min(N(3-1), N(3-3), N(3-5) = 1 + min(2, 0, inf) = 1
You get the idea. You can always compute N(c) from N(d), d < c in this manner.
In this case, you need only remember the last 5 values because that's the biggest coin value. Most DPs are similar. Only a few rows of the table are needed to get the next one.
The table is k-dimensional for k independent variables in the recursive expression.
We think of a dynamic programming approach to a problem if it has
overlapping subproblems
optimal substructure
In very simple words we can say dynamic programming has two faces, they are top-down and bottom-up approaches.
In your case, it is a top-down approach if you are talking about the recursion.
In the top-down approach, we will try to write a recursive solution or a brute-force solution and memoize the results so that we will try to use that result when a similar subproblem arrives, so it is brute-force + memoization. We can achieve that brute-force approach with a simple recursive relation.

Resources