Why does sqlite rescan the table with distinct? - sqlite

I need to get the distinct elements of ref and alt. I have a very efficient query until I add distinct and it rescans the base table? Since I have a temp table shouldn't it simply use that as a source of data?
sqlite> explain query plan
...> select t1.ref, t1.alt from (SELECT * from Sample_szes where str_id
= 'STR_832206') as t1;
selectid|order|from|detail
1|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx ( .
str_id=?) (~10 rows)
1|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx
(str_id=?) (~10 rows)
2|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx
(str_id=?) (~10 rows)
2|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx
(str_id=?) (~2 rows)
2|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)
Add distinct and it rescans the large base table.
sqlite> explain query plan
...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes
where str_id = 'STR_832206') as t1;
selectid|order|from|detail
2|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
2|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx
(str_id=?) (~10 rows)
3|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
3|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?) (~2 rows)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1 (~1400000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT

You should create a composite index for the ref and alt columns. The index would then be used. Otherwise the temporary B-TREE (index) is created which requires an entire scan to sort the data for the index.
I believe the explanation is as per :-
If a SELECT query contains an ORDER BY, GROUP BY or DISTINCT clause,
SQLite may need to use a temporary b-tree structure to sort the output
rows. Or, it might use an index. Using an index is almost always much
more efficient than performing a sort.
If a temporary b-tree is required, a record is added to the EXPLAIN
QUERY PLAN output with the "detail" field set to a string value of the
form "USE TEMP B-TREE FOR xxx", where xxx is one of "ORDER BY", "GROUP
BY" or "DISTINCT". For example:
sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c;
QUERY PLAN
|--SCAN TABLE t2
`--USE TEMP B-TREE FOR ORDER BY
In this case using the temporary b-tree can be avoided by creating an
index on t2(c), as follows:
sqlite> CREATE INDEX i4 ON t2(c);
sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c;
QUERY PLAN
`--SCAN TABLE t2 USING INDEX i4
EXPLAIN QUERY PLAN - 1.2. Temporary Sorting B-Trees

I think I might have found the answer. on my mac I have the following version of sqlite
SQLite version 3.19.3 2017-06-27 16:48:08
sqlite> explain query plan
...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes where str_id = 'STR_832206') as t1;
2|0|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx (str_id=?)
2|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|0|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?)
3|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR DISTINCT

Related

Trying to get one Column from random query in sqlite

Hello I have the following query in SQLite. ISBN is a text variable.
insert into BOOK_ORDER
SELECT OrderID FROM tableOrder WHERE OrderID = 1 UNION SELECT ISBN FROM BOOK ORDER BY random() LIMIT 1;
I am trying to add two columns together
However I get an error:
1st ORDER BY term does not match any column in the result set:
insert into BOOK_ORDER
SELECT OrderID FROM tableOrder WHERE OrderID = 1 UNION SELECT ISBN FROM BOOK
ORDER BY random() LIMIT 1;
I want to have a two column result table:
OrderID ISBN
4 192374125
EDIT:
I think I need to use a cross join, can someone help me?
UNION results in more rows (unless they are duplicates).
To get more columns, you have to list all of them in the SELECT clause.
In this case, you can use subqueries:
insert into BOOK_ORDER
VALUES ((SELECT OrderID FROM tableOrder WHERE OrderID = 1),
(SELECT ISBN FROM BOOK ORDER BY random() LIMIT 1));

how to query by row number in oracle

Is there any way to query directly by row number in a table in Oracle? In other words, to achieve the same effect of ordinary lookup in an array in some basic language like C or Java. I've not yet tried virtual columns.
For instance, the following is an example of an efficient query, but it wastes disk space:
create table ary (row_position_id number(10) NOT NULL,
datum binary_float NOT NULL);
declare i pls_integer;
begin
for i in 0..10000000
loop
insert into ary values (i, dbms_random.normal());
end loop;
commit;
end;
create unique index ary_rp on ary(row_position_id);
now, i'm going to create a set of query values to store in another "parameter" table:
create table query_values (qval number(10) NOT NULL);
declare i pls_integer;
begin
for i in 0..10000
loop
insert into query_values (abs(dbms_random.random() % 10000000));
end loop;
commit;
end;
now, having these query values, i'm going to query the original table
select d.* from ary d where exists (select 0 from query_values v
where d.row_position_id = v.qval);
Now, this query would be fine -- it would use INDEX UNIQUE SCAN and TABLE access by ROWID. The problem I have is that the row_position_id takes up as much space in the table blocks as the actual data (the DATUM column).
I am aware of Index-organized tables and also Virtual Columns (which cannot be used with IOTs). And, of course, things like ROWNUM and ROW_NUMBER are irrelevant here (unless I'm misunderstanding something).
Also worth pointing out, this table is static data -- once loaded, it will never change. I would likely do an ALTER TABLE ARY READ ONLY;
What I would really like is:
create table ary (datum binary_float not null);
-- load rows in a specific order
-- efficiently query this table by implicit row position
Thanks very much!
Henry
I think you're going to want to keep the extra column. Here's why:
As you said, ROWNUM and ROW_NUMBER are not applicable here because they are generated as rows are returned in the query; they will not tell you anything about insert order.
What about ROWID? ROWID is just where a row is stored - again, from the docs:
The data object number of the object
The data block in the data file in which the row resides
The position of the row in the data block (first row is 0)
The data file in which the row resides (first file is 1). The file number is relative to the tablespace.
The "position in the data block" sounds interesting, but you would have no idea what order of the data blocks that were inserted (Oracle could use whatever datablocks it can quickly make use of) so this would not be a reliable option, and even so, you'd be having to parse ROWIDs which are not human readable (e.g. in 12g they look like this: *BAGAASMCwQL+ )
Another option is ORA_ROWSCN which is interesting in that it does give you some idea of order, in terms of the system change number. However, it doesn't come for free. Just to start, you have to create your table with the ROWDEPENDENCIES option and as per docs:
ROWDEPENDENCIES Specify ROWDEPENDENCIES if you want to enable
row-level dependency tracking. This setting is useful primarily to
allow for parallel propagation in replication environments. It
increases the size of each row by 6 bytes.
The other catch with this is that you would have to have to follow each row inserted with a commit so each row would get a different SCN.
If you're willing to go this far, you'll still have to convert the rows to have indexes (starting with, say, 0 or 1) that you can use to join to other tables.
Here's a quick sample of what it would involve:
DROP TABLE temp;
CREATE TABLE temp
( a number(10)
, b varchar2(10)
)
ROWDEPENDENCIES
;
-- one commit after all rows
INSERT INTO temp VALUES (1, 'A');
INSERT INTO temp VALUES (2, 'B');
INSERT INTO temp VALUES (3, 'C');
INSERT INTO temp VALUES (4, 'D');
INSERT INTO temp VALUES (5, 'E');
INSERT INTO temp VALUES (6, 'F');
COMMIT;
SELECT X.*, ROWNUM
FROM (SELECT T.*
, ORA_ROWSCN
FROM TEMP T
ORDER BY ORA_ROWSCN
) x
;
A B ORA_ROWSCN ROWNUM
1 A 2272340 1
2 B 2272340 2
6 F 2272340 3
4 D 2272340 4
5 E 2272340 5
3 C 2272340 6
Whoops. Those rows are definitely not in the order they came in.
Now using one commit per row:
TRUNCATE TABLE temp;
INSERT INTO temp VALUES (1, 'A');
COMMIT;
INSERT INTO TEMP VALUES (2, 'B');
COMMIT;
INSERT INTO temp VALUES (3, 'C');
COMMIT;
INSERT INTO temp VALUES (4, 'D');
COMMIT;
INSERT INTO temp VALUES (5, 'E');
COMMIT;
INSERT INTO temp VALUES (6, 'F');
COMMIT;
SELECT X.*, ROWNUM
FROM (SELECT T.*
, ORA_ROWSCN
FROM TEMP T
ORDER BY ORA_ROWSCN
) x
;
A B ORA_ROWSCN ROWNUM
1 A 2272697 1
2 B 2272699 2
3 C 2272701 3
4 D 2272703 4
5 E 2272705 5
6 F 2272707 6
Better. But if you've got a significant number of rows it's not going to go in fast. (I think this is what you would do if you intentionally wanted to slow down your inserts. ;) )
I think that's about as good as you'll get trying to get around using your own column, BUT there is still hope to economize storage: you can do away with the table + index and just go with an index-organized table. It's basically an index that you query directly.
It's just this easy:
CREATE TABLE TEMP2
( A NUMBER(10)
, B VARCHAR2(10)
, CONSTRAINT PK_CONSTRAINT PRIMARY KEY (A)
)
ORGANIZATION INDEX
;
There are other parameters you'll want to consider for this as well, but for more info check out... the docs.

Select random row from a sqlite table

I have a sqlite table with the following schema:
CREATE TABLE foo (bar VARCHAR)
I'm using this table as storage for a list of strings.
How do I select a random row from this table?
Have a look at Selecting a Random Row from an SQLite Table
SELECT * FROM table ORDER BY RANDOM() LIMIT 1;
The following solutions are much faster than anktastic's (the count(*) costs a lot, but if you can cache it, then the difference shouldn't be that big), which itself is much faster than the "order by random()" when you have a large number of rows, although they have a few inconvenients.
If your rowids are rather packed (ie. few deletions), then you can do the following (using (select max(rowid) from foo)+1 instead of max(rowid)+1 gives better performance, as explained in the comments):
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1));
If you have holes, you will sometimes try to select a non-existant rowid, and the select will return an empty result set. If this is not acceptable, you can provide a default value like this :
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1)) or rowid = (select max(rowid) from node) order by rowid limit 1;
This second solution isn't perfect : the distribution of probability is higher on the last row (the one with the highest rowid), but if you often add stuff to the table, it will become a moving target and the distribution of probabilities should be much better.
Yet another solution, if you often select random stuff from a table with lots of holes, then you might want to create a table that contains the rows of the original table sorted in random order :
create table random_foo(foo_id);
Then, periodicalliy, re-fill the table random_foo
delete from random_foo;
insert into random_foo select id from foo;
And to select a random row, you can use my first method (there are no holes here). Of course, this last method has some concurrency problems, but the re-building of random_foo is a maintainance operation that's not likely to happen very often.
Yet, yet another way, that I recently found on a mailing list, is to put a trigger on delete to move the row with the biggest rowid into the current deleted row, so that no holes are left.
Lastly, note that the behavior of rowid and an integer primary key autoincrement is not identical (with rowid, when a new row is inserted, max(rowid)+1 is chosen, wheras it is higest-value-ever-seen+1 for a primary key), so the last solution won't work with an autoincrement in random_foo, but the other methods will.
You need put "order by RANDOM()" on your query.
Example:
select * from quest order by RANDOM();
Let's see an complete example
Create a table:
CREATE TABLE quest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
quest TEXT NOT NULL,
resp_id INTEGER NOT NULL
);
Inserting some values:
insert into quest(quest, resp_id) values ('1024/4',6), ('256/2',12), ('128/1',24);
A default select:
select * from quest;
| id | quest | resp_id |
1 1024/4 6
2 256/2 12
3 128/1 24
--
A select random:
select * from quest order by RANDOM();
| id | quest | resp_id |
3 128/1 24
1 1024/4 6
2 256/2 12
--*Each time you select, the order will be different.
If you want to return only one row
select * from quest order by RANDOM() LIMIT 1;
| id | quest | resp_id |
2 256/2 12
--*Each time you select, the return will be different.
What about:
SELECT COUNT(*) AS n FROM foo;
then choose a random number m in [0, n) and
SELECT * FROM foo LIMIT 1 OFFSET m;
You can even save the first number (n) somewhere and only update it when the database count changes. That way you don't have to do the SELECT COUNT every time.
Here is a modification of #ank's solution:
SELECT *
FROM table
LIMIT 1
OFFSET ABS(RANDOM()) % MAX((SELECT COUNT(*) FROM table), 1)
This solution also works for indices with gaps, because we randomize an offset in a range [0, count). MAX is used to handle a case with empty table.
Here are simple test results on a table with 16k rows:
sqlite> .timer on
sqlite> select count(*) from payment;
16049
Run Time: real 0.000 user 0.000140 sys 0.000117
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
14746
Run Time: real 0.002 user 0.000899 sys 0.000132
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
12486
Run Time: real 0.001 user 0.000952 sys 0.000103
sqlite> select payment_id from payment order by random() limit 1;
3134
Run Time: real 0.015 user 0.014022 sys 0.000309
sqlite> select payment_id from payment order by random() limit 1;
9407
Run Time: real 0.018 user 0.013757 sys 0.000208
SELECT bar
FROM foo
ORDER BY Random()
LIMIT 1
I came up with the following solution for the large sqlite3 databases:
SELECT * FROM foo WHERE rowid = abs(random()) % (SELECT max(rowid) FROM foo) + 1;
The abs(X) function returns the absolute value of the numeric argument
X.
The random() function returns a pseudo-random integer between
-9223372036854775808 and +9223372036854775807.
The operator % outputs the integer value of its left operand modulo its right operand.
Finally, you add +1 to prevent rowid equal to 0.

SQLite - query involving 2 tables

I want to choose a row from a certain table and order the results basing on another table.
Here are my tables:
lang1_words:
word_id - word
statuses:
word_id - status
In each table word_id corresponds to a value in another table.
Here is my query:
SELECT statuses.word_id FROM statuses, lang1_words
WHERE statuses.status >= 0
ORDER BY lang1_words.word ASC
But it return more than 1 row of the same word_id and they results are not being sorted alphabetically.
What is the problem with my query and how can I achieve my goal?
Thanks.
You need to join the two tables, one way of doing it is:
SELECT statuses.word_id FROM
statuses JOIN lang1_words ON statuses.word_id = lang1_words.word_id
WHERE statuses.status >= 0
ORDER BY lang1_words.word ASC

SQLite - getting closest value

I have SQLite database and I have in it certain column of type "double".
I want to get a row that has in this column value closest to a specified one.
For example, in my table I have:
id: 1; value: 47
id: 2; value: 56
id: 3; value: 51
And I want to get a row that has its value closest to 50. So I want to receive id: 3 (value = 51).
How can I achieve this goal?
Thanks.
Using an order-by, SQLite will scan the entire table and load all the values into a temporary b-tree to order them, making any index useless. This will be very slow and use a lot of memory on large tables:
explain query plan select * from 'table' order by abs(10 - value) limit 1;
0|0|0|SCAN TABLE table
0|0|0|USE TEMP B-TREE FOR ORDER BY
You can get the next lower or higher value using the index like this:
select min(value) from 'table' where x >= N;
select max(value) from 'table' where x <= N;
And you can use union to get both from a single query:
explain query plan
select min(value) from 'table' where value >= 10
union select max(value) from 'table' where value <= 10;
1|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)
This will be pretty fast even on large tables. You could simply load both values and evaluate them in your code, or use even more sql to select one in various ways:
explain query plan select v from
( select min(value) as v from 'table' where value >= 10
union select max(value) as v from 'table' where value <= 10)
order by abs(10-v) limit 1;
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
3|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 USING TEMP B-TREE (UNION)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR ORDER BY
or
explain query plan select 10+v from
( select min(value)-10 as v from 'table' where value >= 10
union select max(value)-10 as v from 'table' where value <= 10)
group by v having max(abs(v)) limit 1;
2|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>?)
3|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value<?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 USING TEMP B-TREE (UNION)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR GROUP BY
Since you're interested in values both arbitrarily greater and less than the target, you can't avoid doing two index searches. If you know that the target is within a small range, though, you could use "between" to only hit the index once:
explain query plan select * from 'table' where value between 9 and 11 order by abs(10-value) limit 1;
0|0|0|SEARCH TABLE table USING COVERING INDEX value_index (value>? AND value<?)
0|0|0|USE TEMP B-TREE FOR ORDER BY
This will be around 2x faster than the union query above when it only evaluates 1-2 values, but if you start having to load more data it will quickly become slower.
This should work:
SELECT * FROM table
ORDER BY ABS(? - value)
LIMIT 1
Where ? represents the value you want to compare against.

Resources