SQLite recalculates random() multiple times in subquery or CTE - sqlite

If I run the following query:
WITH cte AS(SELECT random() AS rand)
SELECT rand,rand FROM cte;
the value rand is calculated once, and the same value appears twice in in the result.
If I run that with a table:
DROP TABLE IF EXISTS data;
CREATE TABLE data(n int);
INSERT INTO data(n) VALUES(1),(2),(3),(4),(5);
WITH cte AS(SELECT random() AS rand FROM data)
SELECT rand,rand FROM cte;
or, if you prefer with a recursive CTE:
WITH
data AS (SELECT 1 AS n UNION ALL SELECT 1+n FROM data WHERE n<5),
cte AS(SELECT random() AS rand FROM data)
SELECT rand,rand FROM cte;
… the value for rand is recalculated for every instance, and so each row in the result set has two different values. See http://sqlfiddle.com/#!5/73fd4/1
I expected random() to be recalculated for each row, but I didn’t expect expect that the rand value be recalculated after the CTE.
I don’t think this is standard behaviour, and it certainly isn’t how PostgreSQL, SQL Server and MySQL work.
How do I get SQLite to calculate the rand value only once per iteration?

I can offer you the following workaround which behaves the way you want:
WITH data AS (
SELECT 1 AS n, random() AS rand
UNION ALL SELECT 1 + n, random()
FROM data WHERE n < 5
)
SELECT rand, rand FROM data;
Demo
This approach just turns out the calls to random() directly inside your recursive CTE. For whatever reason, SQLite does not seem to get too smart for itself and try to cache away an initial call to random().

Related

SQLite RANDOM() function in CTE

I found behavior of RANDOM() function in SQLite, which doesn't seems correct.
I want to generate random groups using random RANDOM() and CASE. However, it looks like CTE is not behaving in a correct way.
First, let's create a table
DROP TABLE IF EXISTS tt10ROWS;
CREATE TEMP TABLE tt10ROWS (
some_int INTEGER);
INSERT INTO tt10ROWS VALUES
(1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
SELECT * FROM tt10ROWS;
Incorrect behaviour
WITH
-- 2.a add columns with random number and save in CTE
STEP_01 AS (
SELECT
*,
ABS(RANDOM()) % 4 + 1 AS RAND_1_TO_4
FROM tt10ROWS)
-- 2.b - get random group
select
*,
CASE
WHEN RAND_1_TO_4 = 1 THEN 'GROUP_01'
WHEN RAND_1_TO_4 = 2 THEN 'GROUP_02'
WHEN RAND_1_TO_4 = 3 THEN 'GROUP_03'
WHEN RAND_1_TO_4 = 4 THEN 'GROUP_04'
END AS GROUP_IT
from STEP_01;
Using such query we get a table, which generates correct values for RAND_1_TO_4 columns, but GROUP_IT column is incorrect. We can see, that groups don't match and some groups even missing.
Correct behaviour
I found a walkaround for such problem by creating a temporary table instead of using CTE. It helped.
-- 1.a - add column with random number 1-4 and save as TEMP TABLE
drop table if exists ttSTEP01;
CREATE TEMP TABLE ttSTEP01 AS
SELECT
*,
ABS(RANDOM()) % 4 + 1 AS RAND_1_TO_4
FROM tt10ROWS;
-- 1.b - get random group
select
*,
CASE
WHEN RAND_1_TO_4 = 1 THEN 'GROUP_01'
WHEN RAND_1_TO_4 = 2 THEN 'GROUP_02'
WHEN RAND_1_TO_4 = 3 THEN 'GROUP_03'
WHEN RAND_1_TO_4 = 4 THEN 'GROUP_04'
END AS GROUP_IT
from ttSTEP01;
QUESTION
What is the reasons behind such behaviour, where GROUP_IT column is not generated properly?
If you look at the bytecode generated by the incorrect query using EXPLAIN, you'll see that every time the RAND_1_TO_4 column is referenced, its value is re-calculated and a new random number is used (I suspect but aren't 100% sure this has something to do with how random() is a non-deterministic function). The null values are for those times when none of the CASE tests end up being true.
When you insert into a temporary table and then use that for the rest, the values of course remain static and it works as expected.

Sqlite rowid > too many levels of trigger recursion, but trigger works fine?

I havw a table: MYTABLE(ID int);
I am using this query to generate N numbers of rowids in mytable:
create trigger mytrigger after insert on MYTABLE
when new.id < 1000
begin
insert into MYTABLE select max(id)+1 from MYTABLE;
end;
insert into MYTABLE values (1);
It works fine, sqlite would generate me a rowid from 1 to 1000.
But when I subsitute:
when new.id < 1000
with larger number like:
when new.id < 10000000
I receive an error: too many levels of trigger recursion
Now my question is, what's the point of a trigger if it can not handle at least one million options? Is there any way to solve that or should i just go and insert each row by myself:)
Triggers are not meant to have an arbitrary level of recursion.
The mechanism for arbitrary recursions are recursive common table expressions:
INSERT INTO MyTable(id)
WITH RECURSIVE n(i) AS (
SELECT 1
UNION ALL
SELECT i + 1 FROM n WHERE i < 1000
)
SELECT i FROM n;

Teradata Aggregation computation locally vs globally

I have two tables Table1 and Table2 with both tables having primary index as col1,col2,col3 and col4.
I join these two tables and do a group by on a set of columns which includes the PI of the tables.
Can someone tell me why in the explain plan I get "Aggregate Intermediate Results are computed globally"
rather than locally. My understanding is that the when the group by column contain all the PI column
aggregate results are computed locally rather than globally.
select
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
,SUM(col10)
,COUNT(col11)
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
Below is the Explain plan for the Query
1) First, we lock a distinct DATEBASE_NAME."pseudo table" for read on a
RowHash to prevent global deadlock for DATEBASE_NAME.S.
2) Next, we lock a distinct DATEBASE_NAME."pseudo table" for write on a
RowHash to prevent global deadlock for
DATEBASE_NAME.TARGET_TABLE.
3) We lock a distinct DATEBASE_NAME."pseudo table" for read on a RowHash
to prevent global deadlock for DATEBASE_NAME.E.
4) We lock DATEBASE_NAME.S for read, we lock
DATEBASE_NAME.TARGET_TABLE for write, and we lock
DATEBASE_NAME.E for read.
5) We do an all-AMPs JOIN step from DATEBASE_NAME.S by way of a RowHash
match scan with no residual conditions, which is joined to
DATEBASE_NAME.E by way of a RowHash match scan. DATEBASE_NAME.S and
DATEBASE_NAME.E are left outer joined using a merge join, with
condition(s) used for non-matching on left table ("(NOT
(DATEBASE_NAME.S.col1 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col2 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col3 IS NULL )) AND (NOT
(DATEBASE_NAME.S.col4 IS NULL ))))"), with a join condition of (
"(DATEBASE_NAME.S.col1 = DATEBASE_NAME.E.col1) AND
((DATEBASE_NAME.S.col2 = DATEBASE_NAME.E.col2) AND
((DATEBASE_NAME.S.col3 = DATEBASE_NAME.E.col3) AND
(DATEBASE_NAME.S.col4 = DATEBASE_NAME.E.col4 )))"). The input
table DATEBASE_NAME.S will not be cached in memory. The result goes
into Spool 3 (all_amps), which is built locally on the AMPs. The
result spool file will not be cached in memory. The size of Spool
3 is estimated with low confidence to be 675,301,664 rows (
812,387,901,792 bytes). The estimated time for this step is 3
minutes and 37 seconds.
6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
DATEBASE_NAME.S.col1 ,DATEBASE_NAME.S.col2
,DATEBASE_NAME.S.col3 ,DATEBASE_NAME.S.col4
,DATEBASE_NAME.E.col5
,DATEBASE_NAME.S.col6 ,DATEBASE_NAME.S.col7
,DATEBASE_NAME.S.col8 ,DATEBASE_NAME.S.col9). Aggregate
Intermediate Results are computed globally, then placed in Spool 4.
The aggregate spool file will not be cached in memory. The size
of Spool 4 is estimated with low confidence to be 506,476,248 rows
(1,787,354,679,192 bytes). The estimated time for this step is 1
hour and 1 minute.
7) We do an all-AMPs MERGE into DATEBASE_NAME.TARGET_TABLE
from Spool 4 (Last Use). The size is estimated with low
confidence to be 506,476,248 rows. The estimated time for this
step is 33 hours and 12 minutes.
8) We spoil the parser's dictionary cache for the table.
9) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
you just use col1,col2,col3,col4 to aggregate
then it would be aggregate locall?
more details from this url:
http://www.teradataforum.com/teradata/20040526_133730.htm
I believe that is because of intermediate spool. You are using columns from that spool, not from original table for group by. I was able to compute the aggregate intermediate results locally using a volatile table.
Essentially what happened in this case is that I took the spool from step 5, gave it a name and enforced a PI on it. Since PI of volatile table is same as initial tables, the volatile table generation is also a local amp operation.
CREATE VOLATILE TABLE x AS
(
SELECT
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
--,SUM(col10)
--,COUNT(col11)
from
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
--group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
)
WITH DATA PRIMARY INDEX (col1, col2, col3, col4)
;
SELECT
col1
,col2
,col3
,col4
,col5
,col6
,col7
,col8
,col9
SUM(col10)
COUNT(col11)
from
x
GROUP BY
col1,col2,col3,col4,col5,col6,col7,col8,col9

Select random row from a sqlite table

I have a sqlite table with the following schema:
CREATE TABLE foo (bar VARCHAR)
I'm using this table as storage for a list of strings.
How do I select a random row from this table?
Have a look at Selecting a Random Row from an SQLite Table
SELECT * FROM table ORDER BY RANDOM() LIMIT 1;
The following solutions are much faster than anktastic's (the count(*) costs a lot, but if you can cache it, then the difference shouldn't be that big), which itself is much faster than the "order by random()" when you have a large number of rows, although they have a few inconvenients.
If your rowids are rather packed (ie. few deletions), then you can do the following (using (select max(rowid) from foo)+1 instead of max(rowid)+1 gives better performance, as explained in the comments):
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1));
If you have holes, you will sometimes try to select a non-existant rowid, and the select will return an empty result set. If this is not acceptable, you can provide a default value like this :
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1)) or rowid = (select max(rowid) from node) order by rowid limit 1;
This second solution isn't perfect : the distribution of probability is higher on the last row (the one with the highest rowid), but if you often add stuff to the table, it will become a moving target and the distribution of probabilities should be much better.
Yet another solution, if you often select random stuff from a table with lots of holes, then you might want to create a table that contains the rows of the original table sorted in random order :
create table random_foo(foo_id);
Then, periodicalliy, re-fill the table random_foo
delete from random_foo;
insert into random_foo select id from foo;
And to select a random row, you can use my first method (there are no holes here). Of course, this last method has some concurrency problems, but the re-building of random_foo is a maintainance operation that's not likely to happen very often.
Yet, yet another way, that I recently found on a mailing list, is to put a trigger on delete to move the row with the biggest rowid into the current deleted row, so that no holes are left.
Lastly, note that the behavior of rowid and an integer primary key autoincrement is not identical (with rowid, when a new row is inserted, max(rowid)+1 is chosen, wheras it is higest-value-ever-seen+1 for a primary key), so the last solution won't work with an autoincrement in random_foo, but the other methods will.
You need put "order by RANDOM()" on your query.
Example:
select * from quest order by RANDOM();
Let's see an complete example
Create a table:
CREATE TABLE quest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
quest TEXT NOT NULL,
resp_id INTEGER NOT NULL
);
Inserting some values:
insert into quest(quest, resp_id) values ('1024/4',6), ('256/2',12), ('128/1',24);
A default select:
select * from quest;
| id | quest | resp_id |
1 1024/4 6
2 256/2 12
3 128/1 24
--
A select random:
select * from quest order by RANDOM();
| id | quest | resp_id |
3 128/1 24
1 1024/4 6
2 256/2 12
--*Each time you select, the order will be different.
If you want to return only one row
select * from quest order by RANDOM() LIMIT 1;
| id | quest | resp_id |
2 256/2 12
--*Each time you select, the return will be different.
What about:
SELECT COUNT(*) AS n FROM foo;
then choose a random number m in [0, n) and
SELECT * FROM foo LIMIT 1 OFFSET m;
You can even save the first number (n) somewhere and only update it when the database count changes. That way you don't have to do the SELECT COUNT every time.
Here is a modification of #ank's solution:
SELECT *
FROM table
LIMIT 1
OFFSET ABS(RANDOM()) % MAX((SELECT COUNT(*) FROM table), 1)
This solution also works for indices with gaps, because we randomize an offset in a range [0, count). MAX is used to handle a case with empty table.
Here are simple test results on a table with 16k rows:
sqlite> .timer on
sqlite> select count(*) from payment;
16049
Run Time: real 0.000 user 0.000140 sys 0.000117
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
14746
Run Time: real 0.002 user 0.000899 sys 0.000132
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
12486
Run Time: real 0.001 user 0.000952 sys 0.000103
sqlite> select payment_id from payment order by random() limit 1;
3134
Run Time: real 0.015 user 0.014022 sys 0.000309
sqlite> select payment_id from payment order by random() limit 1;
9407
Run Time: real 0.018 user 0.013757 sys 0.000208
SELECT bar
FROM foo
ORDER BY Random()
LIMIT 1
I came up with the following solution for the large sqlite3 databases:
SELECT * FROM foo WHERE rowid = abs(random()) % (SELECT max(rowid) FROM foo) + 1;
The abs(X) function returns the absolute value of the numeric argument
X.
The random() function returns a pseudo-random integer between
-9223372036854775808 and +9223372036854775807.
The operator % outputs the integer value of its left operand modulo its right operand.
Finally, you add +1 to prevent rowid equal to 0.

SQLite - getting closest value

I have SQLite database and I have in it certain column of type "double".
I want to get a row that has in this column value closest to a specified one.
For example, in my table I have:
id: 1; value: 47
id: 2; value: 56
id: 3; value: 51
And I want to get a row that has its value closest to 50. So I want to receive id: 3 (value = 51).
How can I achieve this goal?
Thanks.
Using an order-by, SQLite will scan the entire table and load all the values into a temporary b-tree to order them, making any index useless. This will be very slow and use a lot of memory on large tables:
explain query plan select * from 'table' order by abs(10 - value) limit 1;
0|0|0|SCAN TABLE table
0|0|0|USE TEMP B-TREE FOR ORDER BY
You can get the next lower or higher value using the index like this:
select min(value) from 'table' where x >= N;
select max(value) from 'table' where x <= N;
And you can use union to get both from a single query:
explain query plan
select min(value) from 'table' where value >= 10
union select max(value) from 'table' where value <= 10;
1|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
This will be pretty fast even on large tables. You could simply load both values and evaluate them in your code, or use even more sql to select one in various ways:
explain query plan select v from
( select min(value) as v from 'table' where value >= 10
union select max(value) as v from 'table' where value <= 10)
order by abs(10-v) limit 1;
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
3|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 USING TEMP B-TREE (UNION)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR ORDER BY
or
explain query plan select 10+v from
( select min(value)-10 as v from 'table' where value >= 10
union select max(value)-10 as v from 'table' where value <= 10)
group by v having max(abs(v)) limit 1;
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
3|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 USING TEMP B-TREE (UNION)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR GROUP BY
Since you're interested in values both arbitrarily greater and less than the target, you can't avoid doing two index searches. If you know that the target is within a small range, though, you could use "between" to only hit the index once:
explain query plan select * from 'table' where value between 9 and 11 order by abs(10-value) limit 1;
0|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>? AND value<?)
0|0|0|USE TEMP B-TREE FOR ORDER BY
This will be around 2x faster than the union query above when it only evaluates 1-2 values, but if you start having to load more data it will quickly become slower.
This should work:
SELECT * FROM table
ORDER BY ABS(? - value)
LIMIT 1
Where ? represents the value you want to compare against.

Resources