I have a very long running query (and subsequent non-professional attempts of improved queries) involving LEFT JOINs against a MariaDB.
engine is InnoDB
tables are pb_work (4 columns) and pb_instance (9 columns)
both tables contain 1.000.000 rows (theoretically one work can have many instances, this is just for testing purposes)
all involved id-columns are binary(16) (representing UUIDs)
The initial query was the following:
SELECT * FROM pb_work LEFT JOIN pb_instance pi on pb_work.id = pi.work_id;
(Takes ~1m24s)
Naively I tried to limit that as my use-case allows to request stuff in batches:
SELECT * FROM pb_work LEFT JOIN pb_instance pi on pb_work.id = pi.work_id LIMIT 300;
(Takes ~55s, honestly has me wondering why it is faster at all as I thought the join happens first anyways before the limiting)
Assuming that's probably too many entries in the first table:
SELECT * FROM
(SELECT * FROM pb_work LIMIT 300) as subq
LEFT JOIN pb_instance pi on subq.id = pi.work_id;
(Takes ~1m, has me wondering why it's slower; executing the inner select alone runs 94ms)
EXPLAIN-output for the first and second version:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
extra
1
SIMPLE
pb_work
ALL
1113950
1
SIMPLE
pi
ALL
IDX_CA4ED742BB3453DB
IDX_CA4ED742BB3453DB
db.pb_work.id
1
Using where; Using join buffer (flat, BNL join)
EXPLAIN-output for the third version:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
extra
1
PRIMARY
<derived2>
ALL
300
1
PRIMARY
pi
ref
IDX_CA4ED742BB3453DB
IDX_CA4ED742BB3453DB
16
subq.id
1
Using where; Using join buffer (flat, BNL join)
2
DERIVED
pb_work
ALL
1113950
There are indexes on those tables. Output of SHOW indexes FROM ... omitting unrelated indexes.
pb_work:
Non_unique
Key_name
Seq_in_index
Column_name
Collation
Cardinality
Sub_part
Packed
Null
Index_type
Comment
Index_comment
Ignored
0
PRIMARY
1
id
A
0
BTREE
NO
pb_instance:
Non_unique
Key_name
Seq_in_index
Column_name
Collation
Cardinality
Sub_part
Packed
Null
Index_type
Comment
Index_comment
Ignored
0
PRIMARY
1
id
A
0
BTREE
NO
1
IDX_CA4ED742BB3453DB
1
work_id
A
0
BTREE
NO
I have very little knowledge about DBMS so I can't do a lot with that information.
Is there a way to speed this up?
Related
I have an application that I noticed suddenly decreased in speed when copying from the live database to a test database. Both databases are on the same server, and the test database was created from a mysqldump of the live data. So they are identical, on the same instance.
When I explain a slow query on the two databases, one is using indexes and the other is not. This is happening on more than one query type, but I will show one example:
Here is the query I'm running:
SELECT * FROM product
INNER JOIN product_category pc
ON product.id = pc.product_id
INNER JOIN category c
ON c.id = pc.category_id
WHERE
(c.discount_amount > 0 OR c.discount_percent > 0)
AND (c.`discount_start_date` <= NOW() OR c.`discount_start_date` IS NULL)
AND (c.`discount_end_date` >= NOW() OR c.`discount_end_date` IS NULL)
Here is the EXPLAIN result from the live database:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
c
index_merge
PRIMARY,category_discount_start_date,category_discount_end_date,category_discount_amount,category_discount_percent
category_discount_amount,category_discount_percent
8,8
NULL
10
Using sort_union(category_discount_amount,category_discount_percent); Using where
1
SIMPLE
pc
ref
category_id,product_id
category_id
4
lollipop_site.c.id
19
1
SIMPLE
product
eq_ref
PRIMARY
PRIMARY
4
lollipop_site.pc.product_id
1
and here is the EXPLAIN result from the test database:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
product
ALL
PRIMARY
NULL
NULL
NULL
1
1
SIMPLE
pc
ALL
category_id,product_id
NULL
NULL
NULL
1
Using where; Using join buffer (flat, BNL join)
1
SIMPLE
c
eq_ref
PRIMARY,category_discount_start_date,category_discount_end_date,category_discount_amount,category_discount_percent
PRIMARY
4
lollipop_sandbox.pc.category_id
1
Using where
I am using MariaDB inside docker version is 10.7.3-MariaDB-1:10.7.3+maria~focal.
I'm hoping someone can shed some light onto why the server is using a different query plan for the same query on the same data just being in different databases.
Note this query was previously using WHERE id IN (SELECT product_id FROM... style query and I converted it as recommended by other stackoverflow answers. This install has a number of those queries that are also having this problem.
Having a similar issue with query not using primary keys, in my case on two different mariadb server. Maridb is on exactly same version, also the configuration is the same. Here is my query:
select .....
from model_number mn
inner join manufacturer m on (mn.manufacturer_id = m.id)
inner join product_type pt on (mn.product_type_id = pt.id)
inner join user cu on cu.id = mn.created_by
inner join user uu on uu.id = mn.updated_by
inner join replacement r on (mn.id = r.model_number_by_id)
inner join mapping ma on r.model_number_by_id = ma.model_number_id and r.physical_item_type_id = ma.physical_item_type_id
where r.model_number_id = 1355
and r.physical_item_type_id = 4
indexes (primary keys) are not using for m, pt, cu and uu.
also the order in query plan is different:
server using primary keys: r,mn,uu,pt,m,cu,ma
server not using primary keys: r,m,pt,cu,uu,mn,ma
I have no glue what is wrong
Here is query plan from the server where all works as expected:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
r
ref
PRIMARY,fk_replacement_model_number_by,fk_replacement_physical_item_type
fk_replacement_physical_item_type
8
const,const
3
Using index
1
SIMPLE
mn
eq_ref
PRIMARY,model_number_unq_idx2,fk_model_number_user_c,fk_model_number_user_u,fk_model_number_product_type
PRIMARY
4
vat_warehouse.r.model_number_by_id
1
Using where
1
SIMPLE
uu
eq_ref
PRIMARY
PRIMARY
1
vat_warehouse.mn.updated_by
1
1
SIMPLE
pt
eq_ref
PRIMARY
PRIMARY
4
vat_warehouse.mn.product_type_id
1
1
SIMPLE
m
eq_ref
PRIMARY
PRIMARY
4
vat_warehouse.mn.manufacturer_id
1
1
SIMPLE
cu
eq_ref
PRIMARY
PRIMARY
1
vat_warehouse.mn.created_by
1
1
SIMPLE
ma
ref
old_mapping_uniq_idx,fk_old_mampping_physical_item_type
old_mapping_uniq_idx
8
vat_warehouse.r.model_number_by_id,const
1
Using index
and here is a query plan from server where indexes are not used:
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
r
ref
PRIMARY,fk_replacement_model_number_by,fk_replacement_physical_item_type
fk_replacement_physical_item_type
8
const,const
1
Using index
1
SIMPLE
m
ALL
PRIMARY
NULL
NULL
NULL
1
Using join buffer (flat, BNL join)
1
SIMPLE
pt
ALL
PRIMARY
NULL
NULL
NULL
1
Using join buffer (incremental, BNL join)
1
SIMPLE
cu
ALL
PRIMARY
NULL
NULL
NULL
1
Using join buffer (incremental, BNL join)
1
SIMPLE
uu
ALL
PRIMARY
NULL
NULL
NULL
1
Using join buffer (incremental, BNL join)
1
SIMPLE
mn
eq_ref
PRIMARY,model_number_unq_idx2,fk_model_number_user_c,fk_model_number_user_u,fk_model_number_product_type
PRIMARY
4
vat_warehouse.r.model_number_by_id
1
Using where
1
SIMPLE
ma
ref
old_mapping_uniq_idx,fk_old_mampping_physical_item_type
old_mapping_uniq_idx
8
vat_warehouse.r.model_number_by_id,const
1
Using index
We're using sqlite version 3.16.0.
I would like to create some views to simplify some common recursive operations I do on our schema. However, these views turn out to be significantly slower than running the SQL directly.
Specifically, a view to show me the ancestors for a given node:
CREATE VIEW ancestors AS
WITH RECURSIVE ancestors
(
leafid
, parentid
, name
, depth
)
AS
(SELECT id
, parentid
, name
, 1
FROM objects
UNION ALL
SELECT a.leafid
, f.parentid
, f.name
, a.depth + 1
FROM objects f
JOIN ancestors a
ON f.id = a.parentid
) ;
when used with this query:
SELECT *
FROM ancestors
WHERE leafid = 157609;
yields this result:
sele order from deta
---- ------------- ---- ----
2 0 0 SCAN TABLE objects
3 0 1 SCAN TABLE ancestors AS a
3 1 0 SEARCH TABLE objects AS f USING INTEGER PRIMARY KEY (rowid=?)
1 0 0 COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0 0 0 SCAN SUBQUERY 1
Run Time: real 0.374 user 0.372461 sys 0.001483
Yet running the query directly (with a WHERE constraint on the initial query for the same row), yields:
WITH RECURSIVE ancestors
(
leafid, parentid, name, depth
)
AS
(SELECT id, parentid , name, 1
FROM objects
WHERE id = 157609
UNION ALL
SELECT a.leafid, f.parentid , f.name, a.depth + 1
FROM objects f
JOIN ancestors a
ON f.id = a.parentid
)
SELECT *
FROM ancestors;
Run Time: real 0.021 user 0.000249 sys 0.000111
sele order from deta
---- ------------- ---- ----
2 0 0 SEARCH TABLE objects USING INTEGER PRIMARY KEY (rowid=?)
3 0 1 SCAN TABLE ancestors AS a
3 1 0 SEARCH TABLE objects AS f USING INTEGER PRIMARY KEY (rowid=?)
1 0 0 COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0 0 0 SCAN SUBQUERY 1
The second result is around 15 times faster because we're using the PK index on objects to get the initial row, whereas the view seems to scan the entire table, filtering on leaf node only after the ancestors for all rows are found.
Is there any way to write the view such that I can apply a constraint on a consuming select that would be applied to the optimization of the initial query?
You are asking for the WHERE leafid = 157609 to be moved inside the first subquery. This is the push-down optimization, and SQLite tries to do it whenever possible.
However, this is possible only if the database is able to prove that the result is guaranteed to be the same. For this particular query, you know that the transformation would be valid, but, at the moment, there is no algorithm to make this proof for recursive CTEs.
There is an overbearing chance that this might be an incredibly stupid question, so bear with me :)
I have over the last couple of weeks been learning and implementing Sqlite on some data for a project. I love the concept of keys, but there is however one thing that I cannot wrap my head around.
How do you reference the foreign key when inserting a big dataset in the db? Ill give you an example:
Im inserting say 300 rows of data, each row containing ("a","b","c","d","e","f","g"). Everything is going into the same table(original_table).
Now that i have my data in the db, I want to create another table(secondary_table) for the values "c". I then naturally want original_table to have a foreign key which links to the secondary_tables primary key.
I understand that you can create a foreign key before inserting, and then replacing "c" with the corresponding integer before you insert. This however seems very ineffiecient as you would have to replace huge amounts of data before inserting.
So my question is how can I have the foreign key replace the text in an already created table?
Cheers
So my question is how can I have the foreign key replace the text in
an already created table?
yes/no
That is you you can replace column C with the reference to the secondary table (as has been done below in addition to adding the new suggested column) BUT without dropping the table you CANNOT redefine the column's attributes and therefore make it have a type affinity of INTEGER (not really an issue) or specify that it has the FOREIGN KEY constraint.
Mass update is probably not an issue (not not even done withing a transaction here) for something like 300 rows.
How do you reference the foreign key when inserting a big dataset in
the db?
Here's the SQL for how you could do this but instead of trying to play around with column C add a new column that effectively makes column C redundant. However, the new column will have INTEGER type affinity and also have the FOREIGN KEY constraint applied.
300 rows is nothing, the example code uses 3000 rows, although column C only contains a short text value.
:-
-- Create the original table with column c having a finite number of values (0-25)
DROP TABLE IF EXISTS original_table;
CREATE TABLE IF NOT EXISTS original_table (A TEXT, B TEXT, C TEXT, D TEXT, E TEXT, F TEXT, G TEXT);
-- Load the original table with some data
WITH RECURSIVE counter(cola,colb,colc,cold,cole,colf,colg) AS (
SELECT random() % 26 AS cola, random() % 26 AS colb,abs(random() % 26) AS colc,random() % 26 AS cold,random() % 26 AS cole,random() % 26 AS colf,random() % 26 AS colg
UNION ALL
SELECT random() % 26 AS cola, random() % 26 AS colb,abs(random()) % 26 AS colc,random() % 26 AS cold,random() % 26 AS cole,random() % 26 AS colf,random() % 26 AS colg
FROM counter LIMIT 3000
)
INSERT INTO original_table SELECT * FROM counter;
SELECT * FROM original_table ORDER BY C ASC; -- Query 1 the original original_table
-- Create the secondary table by extracting values from the C column of the original table
DROP TABLE IF EXISTS secondary_table;
CREATE TABLE IF NOT EXISTS secondary_table (id INTEGER PRIMARY KEY, c_value TEXT);
INSERT INTO secondary_table (c_value) SELECT DISTINCT C FROM original_table ORDER BY C ASC;
SELECT * FROM secondary_table; -- Query 2 the new secondary table
-- Add the new column as a Foreign key to reference the new secondary_table
ALTER TABLE original_table ADD COLUMN secondary_table_reference INTEGER REFERENCES secondary_table(id);
SELECT * FROM original_table; -- Query 3 the altered original_table but without any references
-- Update the original table to apply the references to the secondary_table
UPDATE original_table
SET secondary_table_reference = (SELECT id FROM secondary_table WHERE c_value = C)
-- >>>>>>>>>> NOTE USE ONLY 1 OR NONE OF THE FOLLOWING 2 LINES <<<<<<<<<<
, C = null; -- OPTIONAL TO CLEAR COLUMN C
-- , C = (SELECT id FROM secondary_table WHERE c_value = C) -- ANOTHER OPTION SET C TO REFERENCE SECONDARY TABLE
;
SELECT * FROM original_table; -- Query 4 the final original table i.e. with references applied (column C now not needed)
Hopefully comments explain.
Results :-
Query 1 The original table without the secondary table :-
Query 2 The secondary table as generated from the original table :-
Query 3 The altered original_table without references applied :-
Query 4 The original table after application of references (applied to new column and old C column) :-
Timings (would obviously depend on numerous factors) :-
-- Create the original table with column c having a finite number of values (0-25)
DROP TABLE IF EXISTS original_table
> OK
> Time: 0.94s
CREATE TABLE IF NOT EXISTS original_table (A TEXT, B TEXT, C TEXT, D TEXT, E TEXT, F TEXT, G TEXT)
> OK
> Time: 0.353s
-- Load the original table with some data
WITH RECURSIVE counter(cola,colb,colc,cold,cole,colf,colg) AS (
SELECT random() % 26 AS cola, random() % 26 AS colb,abs(random() % 26) AS colc,random() % 26 AS cold,random() % 26 AS cole,random() % 26 AS colf,random() % 26 AS colg
UNION ALL
SELECT random() % 26 AS cola, random() % 26 AS colb,abs(random()) % 26 AS colc,random() % 26 AS cold,random() % 26 AS cole,random() % 26 AS colf,random() % 26 AS colg
FROM counter LIMIT 3000
)
INSERT INTO original_table SELECT * FROM counter
> Affected rows: 3000
> Time: 0.67s
SELECT * FROM original_table ORDER BY C ASC
> OK
> Time: 0.012s
-- Query 1 the original original_table
-- Create the secondary table by extracting values from the C column of the original table
DROP TABLE IF EXISTS secondary_table
> OK
> Time: 0.328s
CREATE TABLE IF NOT EXISTS secondary_table (id INTEGER PRIMARY KEY, c_value TEXT)
> OK
> Time: 0.317s
INSERT INTO secondary_table (c_value) SELECT DISTINCT C FROM original_table ORDER BY C ASC
> Affected rows: 26
> Time: 0.24s
SELECT * FROM secondary_table
> OK
> Time: 0s
-- Query 2 the new secondary table
-- Add the new column as a Foreign key to reference the new secondary_table
ALTER TABLE original_table ADD COLUMN secondary_table_reference INTEGER REFERENCES secondary_table(id)
> OK
> Time: 0.31s
SELECT * FROM original_table
> OK
> Time: 0.01s
-- Query 3 the altered original_table but without any references
-- Update the original table to apply the references to the secondary_table
UPDATE original_table
SET secondary_table_reference = (SELECT id FROM secondary_table WHERE c_value = C)
-- , C = null; -- OPTIONAL TO CLEAR COLUMN C
, C = (SELECT id FROM secondary_table WHERE c_value = C)
> Affected rows: 3000
> Time: 0.743s
SELECT * FROM original_table
> OK
> Time: 0.01s
-- Query 4 the final original table i.e. with references applied (column C now not needed)
> not an error
> Time: 0s
Supplementary Query
The following query utilises the combined tables :-
SELECT A,B,D,E,F,G, secondary_table.c_value FROM original_table JOIN secondary_table ON secondary_table_reference = secondary_table.id;
To result in :-
Note the data will not correlate with the previous results as this was run as a separate run and the data is generated randomly.
I have a database with "num" table like this
user_id | number | unix_time
-----------------------------
123 2 xxxxxxxx
123 40 xxxxxxxx
123 24 xxxxxxxx
333 23 xxxxxxxx
333 67 xxxxxxxx
854 90 xxxxxxxx
I'd like to select the last 5 numbers inserted by each user_id, but I can't figure out how to do it.
I tried:
SELECT b.n, a.user_id
FROM num a
JOIN num b on a.user_id = b.user_id
WHERE (
SELECT COUNT(*)
FROM num b2
WHERE b2.n <= b.n
AND b2.user_id = b.user_id
) <= 5
I am adapting the answer from (sql query - how to apply limit within group by).
I use "2" instead of "5" to make the effect visible within your sample data.
Note that I used actual dates instead of your "xxxxxxxx", assuming that most likely you mean "most recent 5" when you write "last 5" and that only works for actual times.
select * from toy a
where a.ROWID IN
( SELECT b.ROWID FROM toy b
WHERE b.user_id = a.user_id
ORDER by unix_time DESC
LIMIT 2
) ;
How is it done:
make on-the-fly tables (i.e. the part within ())
one for each user_id, WHERE b.user_id = a.user_id
order each on-the-fly table separatly (that is the first trick),
by doing the ordering inside the ()
order chronologically backwards ORDER by unix_time DESC
limit to 5 (in the example 2) entries LIMIT 2
limit each on-the-fly table separatly (that is the second trick),
by doing the limiting inside the ()
select everything from the actual table, select * from toy,
but only select from the actual table those lines which occur in the total of all on-the-fly tables,
where a.ROWID IN (
introduce the distinguishing alias "a" for the total view of the table,
toy a
introduce the distinguishing alias "b" for the single-user_id view of the table,
toy b
By the way, here is the dump of what I used for testing
(it is a convenient way of making most of a MCVE):
BEGIN TRANSACTION;
CREATE TABLE toy (user_id int, number int, unix_time date);
INSERT INTO toy VALUES(123,2,'1970-01-01 05:33:20');
INSERT INTO toy VALUES(123,40,'1970-01-01 06:56:40');
INSERT INTO toy VALUES(123,24,'1970-01-01 08:20:00');
INSERT INTO toy VALUES(333,23,'1970-01-01 11:06:40');
INSERT INTO toy VALUES(333,67,'1970-01-01 12:30:00');
INSERT INTO toy VALUES(854,90,'1970-01-01 13:53:20');
COMMIT;
If you want to select last 5 records from the SQlite database then use query
SELECT * FROM table_name ORDER BY user_id DESC LIMIT 5;
Using this query you can select last n transactions...Hope I helped you
I found a good article on converting adjacency to nested sets at http://dataeducation.com/the-hidden-costs-of-insert-exec/
The SQL language used is Microsoft SQL Server (I think) and I am trying to convert the examples given in the article to sqlite (as this is what I have easy access to on my Macbook).
The problem I appear to be having is converting the part of the overall CTE query to do with the Employee Rows
EmployeeRows AS
(
SELECT
EmployeeLevels.*,
ROW_NUMBER() OVER (ORDER BY thePath) AS Row
FROM EmployeeLevels
)
I converted this to
EmployeeRows AS
(
SELECT
EmployeeLevels.*,
rowid AS Row
FROM EmployeeLevels
ORDER BY thePath
)
and the CTE query runs (no syntax errors) but the output I get is a table without the Row and Lft and Rgt columns populated
ProductName ProductID ParentProductID TreePath HLevel Row Lft Rgt
----------- ---------- --------------- ---------- ---------- ---------- ---------- ----------
Baby Goods 0 0 1
Baby Food 10 0 0.10 2
All Ages Ba 100 10 0.10.100 3
Strawberry 200 100 0.10.100.2 4
Baby Cereal 250 100 0.10.100.2 4
Beginners 150 10 0.10.150 3
Formula Mil 300 150 0.10.150.3 4
Heinz Formu 310 300 0.10.150.3 5
Nappies 20 0 0.20 2
Small Pack 400 20 0.20.400 3
Bulk Pack N 450 20 0.20.450 3
I think the start of the problem is the Row is not getting populated and therefore the Lft and Rgt columns do not get populated by the following parts of the query.
Are there any sqlite experts out there to tell me:
am I translating the rowid part of the query correctly
does sqlite support a rowid in a part of a CTE query
is there a better way? :)
Any help appreciated :)
am I translating the rowid part of the query correctly
No.
The SQL:
SELECT
EmployeeLevels.*,
rowid AS Row
FROM EmployeeLevels
ORDER BY thePath
has the Row defined as the rowid of table EmployeeLevels in SQLite, ignoring the order clause. Which is different from the intention of ROW_NUMBER() OVER (ORDER BY thePath) AS Row
does sqlite support a rowid in a part of a CTE query
Unfortunately no. I assume you mean this:
WITH foo AS (
SELECT * FROM bar ORDER BY col_a
)
SELECT rowid, *
FROM foo
but SQLite will report no such column of rowid in foo.
is there a better way?
Not sure it is better but at least it works. In SQLite, you have a mechanism of temp table which exists as long as your connection opens and you didn't delete it deliberately. Rewrite the above SQL in my example:
CREATE TEMP TABLE foo AS
SELECT * FROM bar ORDER BY col_a
;
SELECT rowid, *
FROM foo
;
DROP TABLE foo
;
This one will run without SQLite complaining.
update:
As of SQLite version 3.25.0, window function is supported. Hence you can use row_number() over (order by x) expression in your CTE if you happen to use a newer SQLite