Nested subquery is too slow - outer join equivalent? - sqlite

I'm collecting some basic statistics on our codebase and am trying to generate a query using the following schema data
A files table holding all the files (synthetic Primary Key ID, unique path, and a region column which holds who the file belongs to.
A file_stats table holding data for the files on a specific date (Primary Key is combination of date and file_id)
CREATE TABLE files (
id INT PRIMARY KEY,
path VARCHAR(255) NOT NULL UNIQUE,
region VARCHAR(4) CHECK (region IN ('NYK', 'LDN', 'CORE', 'TKY')),
)
CREATE TABLE file_stats (
date DATE NOT NULL,
file_id INT NOT NULL REFERENCES files,
num_lines INT NOT NULL,
CONSTRAINT file_stats__pk PRIMARY KEY(date, file_id)
)
I'm trying to create a query which will return all combinations of dates and regions in the tables and the number of files for that combination.
The simple approach of
SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region
doesn't work as not all regions are represnted at all dates.
I've tried
SELECT
d.date,
r.region,
(SELECT COUNT(*) FROM file_stats fs, files f
WHERE fs.file_id = file.id AND fs.date = d.date AND d.region = r.region
) AS num_files
FROM
(SELECT DISTINCT date FROM file_stats) AS d,
(SELECT DiSTINCT region FROM files) AS r
but the performance is unacceptable because of the nested subquery.
I've tried LEFT OUTER JOINS, but never seem to be able to make them work.
The database is SQLITE
Can anyone suggest a better query?

SELECT date, region, COUNT(*) FROM file_stats fs, files f WHERE fs.file_id = f.id
GROUP BY date, region
doesn't work as not all regions are
represnted at all dates.
Assuming you mean it works correctly, but you need all the dates to show whether a region might appear there or not, then you need two things.
A calendar table.
A left join on the calendar table.
After you have a calendar table, something like this . . .
SELECT c.cal_date, f.region, COUNT(*)
FROM calendar c
LEFT JOIN file_stats fs ON (fs.date = c.cal_date)
INNER JOIN files f ON (fs.file_id = f.id)
GROUP BY date, region
I used cal_date above. The name you use depends on your calendar table. This will get you started. You can use a spreadsheet to generate the dates.
CREATE TABLE calendar (cal_date date primary key);
INSERT INTO "calendar" VALUES('2011-01-01');
INSERT INTO "calendar" VALUES('2011-01-02');
INSERT INTO "calendar" VALUES('2011-01-03');
INSERT INTO "calendar" VALUES('2011-01-04');
INSERT INTO "calendar" VALUES('2011-01-05');
INSERT INTO "calendar" VALUES('2011-01-06');
INSERT INTO "calendar" VALUES('2011-01-07');
INSERT INTO "calendar" VALUES('2011-01-08');
If you're certain that all the dates are in file_stats, you can do without a calendar table. But there are some cautions.
select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region;
This will work if your data is right, but your tables don't guarantee the data will be right. You don't have a foreign key reference, so there might be file id numbers in each table that don't have matching id numbers in the other table. Let's have some sample data.
insert into files values (1, 'a long path', 'NYK');
insert into files values (2, 'another long path', 'NYK');
insert into files values (3, 'a shorter long path', 'LDN'); -- not in file_stats
insert into file_stats values ('2011-01-01', 1, 35);
insert into file_stats values ('2011-01-02', 1, 37);
insert into file_stats values ('2011-01-01', 2, 40);
insert into file_stats values ('2011-01-01', 4, 35); -- not in files
Running this query (same as immediately above, but add ORDER BY) . . .
select fs.date, f.region, count(*)
from file_stats fs
left join files f on (f.id = fs.file_id)
group by fs.date, f.region
order by fs.date, f.region;
. . . returns
2011-01-01||1
2011-01-01|NYK|2
2011-01-02|NYK|1
'LDN' doesn't show, because there's no row in file_stats with file id number 3. One row has a null region, because no row in files has file id number 4.
You can quickly find mismatched rows with a left join.
select f.id, fs.file_id
from files f
left join file_stats fs on (fs.file_id = f.id)
where fs.file_id is null;
returns
3|
meaning that there's a row in files that has id 3, but no row in file_stats that has id 3. Flip the table around to determine the rows in file_stats that have no matching row in files.
select fs.file_id, f.id
from file_stats fs
left join files f on (fs.file_id = f.id)
where f.id is null;

One (slower due to performance hit of a second half) way of doing what you want is a UNION of things that have a count with manufactured list of things that have zero count:
-- Include the counts for date/region pairs that HAVE files
SELECT date, region, COUNT(*) as COUNT1
FROM file_stats fs, files f
WHERE fs.file_id = f.id
GROUP BY date, region
UNION
SELECT DISTINCT date, region, 0 as COUNT1
FROM file_stats fs0, files f0
WHERE NOT EXISTS (
SELECT 1
FROM file_stats fs, files f
WHERE fs.file_id = f.id
AND fs.date=fs0.date
AND f.region=f0.region
)
I'm not entirely sure why you're opposed to the use of temp tables? E.g. (this is Sybasyish syntax for temp table population but should port easily - don't recall exact SQLite one). Table size should be minimal (just # of days * # of regions)
CREATE TABLE COMBINATIONS TEMPORARY (region VARCHAR(4), date DATE)
INSERT COMBINATIONS SELECT DISTINCT date, region FROM files, file_stats
SELECT c.date, c.region, SUM(CASE WHEN file_stats.id IS NULL THEN 0 ELSE 1 END)
FROM COMBINATIONS c
LEFT JOIN files f ON f.region=c.region
LEFT OUTER JOIN file_stats fs ON fs.date=c.date AND fs.file_id = f.id
GROUP BY c.date, c.region

I suspect that it is having to try scan file_stats and files for every single row of the output. The following version might be substantially faster. And it won't require creating new tables.
SELECT d.date
, r.region
, count(f.file_id) AS num_files
FROM (SELECT DISTINCT date FROM file_states) AS d,
(SELECT DISTINCT region FROM files) AS r,
LEFT JOIN file_stats AS fs
ON fs.date = d.date
LEFT JOIN files f
ON f.file_id = fs.file_id
AND f.region = r.region
GROUP BY d.date, r.region;

Related

Is it possible to compare value to multiple columns in ''In'' clause?

select m.value
from MY_TABLE m
where m.value in (select m2.some_third_value, m2.some_fourth_value
from MY_TABLE_2 m2
where m2.first_val member of v_my_array
or m2.second_val member of v_my_array_2)
Is it possible to write a select similar to this, where m.value is compared to two columns and has to match at least one of those? Something like where m.value in (select m2.first_val, m2.second_val). Or is writing two separate selects unavoidable here?
No. When there are multiple columns in the IN clause, there must be the same number of columns in the WHERE clause. The pairwise query compares each record in the WHERE clause against the records returned by the sub-query. The statement below
SELECT *
FROM table_main m
WHERE ( m.col_1, m.col_2 ) IN (SELECT s.col_a,
s.col_b
FROM table_sub s)
is equivalent to
SELECT *
FROM table_main m
WHERE EXISTS (SELECT 1
FROM table_sub s
WHERE m.col_1 = s.col_a
AND m.col_2 = s.col_b)
The only way to search both columns in one SELECT statement would be to OUTER JOIN the second table to the first table.
SELECT m.*
FROM table_main m
LEFT JOIN table_sub s ON (m.col_1 = s.col_a OR m.col_1 = s.col_b)
WHERE m.col_1 = s.col_a
OR m.col_1 = s.col_b

Getting a min(date) AND max(date) AND their respective titles

I have three tables that I would like to select from
Table 1 has a bunch of static information about a user like their idnumber, name, registration date
Table 2 has the idnumber of the user, course number, and the date they registered for the course
Table 3 has the course number, and the title of the course
I am trying to use one query that will select the columns mentioned in table 1, with the most recent course they registered (name and date registered) as well as their first course registered (name and date registered)
Here is what I came up with
SELECT u.idst, u.userid, u.firstname, u.lastname, u.email, u.register_date,
MIN(l.date_inscr) as mindate, MAX(l.date_inscr) as maxdate, lc.coursename
FROM table1 u,table3 lc
LEFT JOIN table2 l
ON l.idCourse = lc.idCourse
WHERE u.idst = 12787
AND u.idst = l.idUser
And this gives me everything i need, and the dates are correct but I have no idea how to display BOTH of the names of courses. The most recent and the first.
And help would be great.
Thanks!!!
You can get your desired results by generating the min/max date_inscr for each user in a derived table and then joining that twice to table2 and table3, once to get each course name:
SELECT u.idst, u.userid, u.firstname, u.lastname, u.email, u.register_date,
l.mindate, lc1.coursename as first_course,
l.maxdate, lc2.coursename as latest_course
FROM table1 u
LEFT JOIN (SELECT idUser, MIN(date_inscr) AS mindate, MAX(date_inscr) AS maxdate
FROM table2
WHERE idUser = 12787
) l ON l.idUser = u.idst
LEFT JOIN table2 l1 ON l1.idUser = l.idUser AND l1.date_inscr = l.mindate
LEFT JOIN table3 lc1 ON lc1.idCourse = l1.idCourse
LEFT JOIN table2 l2 ON l2.idUser = l.idUser AND l2.date_inscr = l.maxdate
LEFT JOIN table3 lc2 ON lc2.idCourse = l2.idCourse
As #BillKarwin pointed out, this is more easily done using two separate queries.

No more spool space in Teradata while trying Update

I'm trying to update a table with to many rows 388.000.
This is the query:
update DL_RG_ANALYTICS.SH_historico
from
(
SELECT
CAST((MAX_DIA - DIA_PAGO) AS INTEGER) AS DIAS_AL_CIERRE_1
FROM
(SELECT * FROM DL_RG_ANALYTICS.SH_historico A
LEFT JOIN
(SELECT ANO||MES AS ANO_MES, MAX(DIA) AS MAX_DIA FROM DL_RG_ANALYTICS.SH_CALENDARIO
GROUP BY 1) B
ON A.ANOMES = B.ANO_MES
) M) N
SET DIAS_AL_CIERRE = DIAS_AL_CIERRE_1;
Any help is apreciate.
This first thing I'd do is replace the SELECT * with only the columns you need. You can also remove the M derived table to make it easier to read:
UPDATE DL_RG_ANALYTICS.SH_historico
FROM (
SELECT CAST((MAX_DIA - DIA_PAGO) AS INTEGER) AS DIAS_AL_CIERRE_1
FROM DL_RG_ANALYTICS.SH_historico A
LEFT JOIN (
SELECT ANO || MES AS ANO_MES, MAX(DIA) AS MAX_DIA
FROM DL_RG_ANALYTICS.SH_CALENDARIO
GROUP BY 1
) B ON A.ANOMES = B.ANO_MES
) N
SET DIAS_AL_CIERRE = DIAS_AL_CIERRE_1;
What indexes are defined on the SH_CALENDARIO table? If there is a composite index of (ANO, MES) then you should re-write your LEFT JOIN sub-query to GROUP BY these two columns since you concatenate them together anyways. In general, you want to perform joins, GROUP BY and OLAP functions on indexes, so there will be less row re-distribution and they will run more efficiently.
Also, this query is updating all rows in the table with the same value. Is this intended, or do you want to include extra columns in your WHERE clause?

SQLITE, Create a temp table then select from it

just wondering how i can create a temp table and then select from it further down the script.
Example.
CREATE TEMPORARY TABLE TEMP_TABLE1 AS
Select
L.ID,
SUM(L.cost)/2 as Costs,
from Table1 L
JOIN Table2 C on L.ID = C.ID
Where C.name = 'mike'
Group by L.ID
Select
Count(L.ID)
from Table1 L
JOIN TEMP_TABLE1 TT1 on L.ID = TT1.ID;
Where L.ID not in (TT1)
And Sum(L.Cost) > TT1.Costs
Ideally I want to have a temp table then use it later in the script to reference from.
Any help would be great!
You simply refer to the table as temp.<table> or <table> the latter only if it is a unique table name.
As per :-
If a schema-name is specified, it must be either "main", "temp", or
the name of an attached database. In this case the new table is
created in the named database. If the "TEMP" or "TEMPORARY" keyword
occurs between the "CREATE" and "TABLE" then the new table is created
in the temp database. It is an error to specify both a schema-name and
the TEMP or TEMPORARY keyword, unless the schema-name is "temp". If no
schema name is specified and the TEMP keyword is not present then the
table is created in the main database.
SQL As Understood By SQLite - CREATE TABLE
The following example creates 3 tables :-
table1 with 3 columns as a permanent table.
table1 a temporary copy of the permanent table1.
temp_table another temporary copy of the permanent table1.
:-
DROP TABLE IF EXISTS temp.table1;
DROP TABLE IF EXISTS table1;
DROP TABLE IF EXISTS temp_table;
CREATE TABLE table1 (columnA INTEGER,columnB INTEGER, columnC INTEGER);
When creating the permanent table 1 it is loaded with 4 rows
:-
INSERT INTO table1 (columnA,columnB,columnC) VALUES
(1,5,20),
(2,7,21),
(3,8,80),
(4,3,63);
CREATE TEMP TABLE table1 AS select * from table1;;
CREATE TEMPORARY TABLE temp_table AS SELECT * FROM table1;
both temp tables are then used to in a union all to basically duplicate the rows, but with an indicator of the source table as a new column from_table
Not that two forms of referring to the temp tables are used. temp. and just the table name.
The latter only usable if the temporary table is a unique table name.
:-
SELECT 'temp_table' AS from_table,* FROM temp_table
UNION ALL
SELECT 'temp.table1' as from_table,* FROM temp.table1;
The result being :-
Re addition of example :-
CREATE TEMPORARY TABLE TEMP_TABLE1 AS
Select
L.ID,
SUM(L.cost)/2 as Costs,
from Table1 L
JOIN Table2 C on L.ID = C.ID
Where C.name = 'mike'
Group by L.ID
Select
Count(L.ID)
from Table1 L
JOIN TEMP_TABLE1 TT1 on L.ID = TT1.ID;
Where L.ID not in (TT1)
And Sum(L.Cost) > TT1.Costs
There are a few issues with this example bar the misuse of the aggregate (commented out) the following works.
Note for my convenience I've added an _ to the table names.
:-
DROP TABLE IF EXISTS Table_1;
DROP TABLE IF EXISTS Table_2;
DROP TABLE If EXISTS temp.temp_table1;
CREATE TABLE Table_1 (ID INTEGER PRIMARY KEY, cost REAL);
CREATE TABLE Table_2 (ID INTEGER PRIMARY KEY, name TEXT);
INSERT INTO Table_1 (cost) VALUES (100.45),(56.78),(99.99);
INSERT INTO Table_2 (name) VALUES ('mike'),('mike'),('fred');
CREATE TEMP TABLE temp_table1 AS
SELECT L.ID,
sum(L.cost)/2 as Costs
FROM Table_1 L
JOIN Table_2 C ON L.ID = C.ID
WHERE C.name = 'mike'
GROUP BY L.ID;
SELECT
count(L.ID)
FROM Table_1 L
JOIN temp_table1 TT1 ON L.ID = TT1.[L.ID]
WHERE
L.ID NOT IN (TT1.[L.ID])
-- AND Sum(L.cost) > TT1.costs --<<<< misuse of aggregate
The issues are based upon the column name being L.ID so this has to be enclosed (rules here SQL As Understood By SQLite - SQLite Keywords apply) [ and ] have been used above.
of course you could circumvent the need for enclosure by naming the column using AS e..g SELECT
L.ID AS lid, --<<<< AS lid ADDED
SUM(L.cost)/2 as Costs, ,.......
Adding the following may be suitable for getting around the misuse of aggregate :-
GROUP BY L.ID
HAVING sum(L.cost) > TT1.costs
Adding the following to the end of the script :-
SELECT
count(L.ID), *
FROM Table_1 L
JOIN temp_table1 TT1 ON L.ID = TT1.[L.ID];
results in :-
If this is only to be used by one SELECT statement then you can use the WITH clause:
WITH TmpTable(id,cost) AS
(
...SELECT statement that returns the two columns (id and cost)...
)
SELECT id, cost FROM TmpTable WHERE ...;

SQLITE equivalent for Oracle's ROWNUM?

I'm adding an 'index' column to a table in SQLite3 to allow the users to easily reorder the data, by renaming the old database and creating a new one in its place with the extra columns.
The problem I have is that I need to give each row a unique number in the 'index' column when I INSERT...SELECT the old values.
A search I did turned up a useful term in Oracle called ROWNUM, but SQLite3 doesn't have that. Is there something equivalent in SQLite?
You can use one of the special row names ROWID, OID or _ROWID_ to get the rowid of a column. See http://www.sqlite.org/lang_createtable.html#rowid for further details (and that the rows can be hidden by normal columns called ROWID and so on).
Many people here seems to mix up ROWNUM with ROWID. They are not the same concept and Oracle has both.
ROWID is a unique ID of a database ROW. It's almost invariant (changed during import/export but it is the same across different SQL queries).
ROWNUM is a calculated field corresponding to the row number in the query result. It's always 1 for the first row, 2 for the second, and so on. It is absolutely not linked to any table row and the same table row could have very different rownums depending of how it is queried.
Sqlite has a ROWID but no ROWNUM. The only equivalent I found is ROW_NUMBER() function (see http://www.sqlitetutorial.net/sqlite-window-functions/sqlite-row_number/).
You can achieve what you want with a query like this:
insert into new
select *, row_number() over ()
from old;
No SQLite doesn't have a direct equivalent to Oracle's ROWNUM.
If I understand your requirement correctly, you should be able to add a numbered column based on ordering of the old table this way:
create table old (col1, col2);
insert into old values
('d', 3),
('s', 3),
('d', 1),
('w', 45),
('b', 5465),
('w', 3),
('b', 23);
create table new (colPK INTEGER PRIMARY KEY AUTOINCREMENT, col1, col2);
insert into new select NULL, col1, col2 from old order by col1, col2;
The new table contains:
.headers on
.mode column
select * from new;
colPK col1 col2
---------- ---------- ----------
1 b 23
2 b 5465
3 d 1
4 d 3
5 s 3
6 w 3
7 w 45
The AUTOINCREMENT does what its name suggests: each additional row has the previous' value incremented by 1.
I believe you want to use the constrain LIMIT in SQLite.
SELECT * FROM TABLE can return thousands of records.
However, you can constrain this by adding the LIMIT keyword.
SELECT * FROM TABLE LIMIT 5;
Will return the first 5 records from the table returned in you query - if available
use this code For create Row_num 0....count_row
SELECT (SELECT COUNT(*)
FROM main AS t2
WHERE t2.col1 < t1.col1) + (SELECT COUNT(*)
FROM main AS t3
WHERE t3.col1 = t1.col1 AND t3.col1 < t1.col1) AS rowNum, * FROM Table_name t1 WHERE rowNum=0 ORDER BY t1.col1 ASC

Resources