Optimize JOIN for a large dataset in SQLite - sqlite

I need to create a table with the running total (aka cumulative sum) per client per day from a large dataset (200 GB). However, my current code is too slow: it has been running for days and still has not finished.
My data came from two tables. The table Orders informs the order id and client id. The table Transactions informs the order id, date and status (status=5 means order approved, status=7 means order rejected).
CREATE TABLE RunningTotal AS
SELECT DISTINCT Orders.ClientId,
Transactions.Date,
SUM(CASE WHEN Transactions.Status = 5 THEN 1 ELSE -1 END) OVER (PARTITION BY Orders.ClientId ORDER BY Transactions.Date) AS RunTotal
FROM Transactions
LEFT JOIN Orders ON Transactions.Order_Id= Orders.Id
WHERE Transactions.Status IN (5,7)
I used a toy table to test and my current code provides the expected results: the net running total of approved order per client per day. The problem seems to be how to optimize it to use with a large dataset. I might be wrong, but I believe my bottleneck is the JOIN. If I run a similar query with just one table, it takes minutes.
The command EXPLAIN QUERY PLAN gives​ me the following:
CO-ROUTINE 3
SEARCH TABLE Transactions USING INDEX idx_trans_status (Status=?)
SEARCH TABLE Orders USING INDEX idx_orders_id (Id=?)
USE TEMP B-TREE FOR ORDER BY
SCAN SUBQUERY 3
USE TEMP B-TREE FOR DISTINCT
USE TEMP B-TREE FOR ORDER BY
EDIT:
The output table would look like:
CREATE TABLE Transactions (
ClientId INTEGER,
Date DATE,
RunTotal INTEGER
PRIMARY KEY (ClientId, Date)
)
ClientId Date RunTotal
-------- ----------- --------
1 2018-12-28 110
1 2018-12-30 125
3 2018-10-15 87
3 2018-11-22 93
3 2018-11-24 99
'Orders' table looks like:
CREATE TABLE Orders (
Id INTEGER,
ClientId INTEGER
Class INTEGER,
Recall INTEGER,
Family INTEGER,
CreationDate DATE,
DeletionDate DATE,
Level INTEGER,
Notes TEXT,
Tags TEXT,
Operator_ID INTEGER,
Node INTEGER,
PRIMARY KEY (Id)
)
'Transactions' table looks like:
CREATE TABLE Transactions (
Id INTEGER,
Order_Id INTEGER,
Date DATE,
Status INTEGER,
Amount INTEGER,
Operator_ID INTEGER,
Notes TEXT,
PRIMARY KEY (Id)
)

Related

INNER JOIN to get Max value from one table and return the name from another table. (SQlite)

I have created 2 tables:
CREATE TABLE locations(location STRING, id INT);
CREATE TABLE temperatures(location_id INT, temperature INT, day INT)
I am using SELECT location_id, day, MAX(temperature) FROM temperatures; to get the max Temperature out of every day from a single location(the location is mentioned in the URL by id). Is there a way that I can get the location from locations table based on the id? So the result will be something like :
61.1435,-1.1234 | 5 |35.5
instead of :
0 | 5 | 35.5
In simple words I want to replace the location_id of the result with the actual location (coordinates).
Why do you not join the two tables? Something like:
SELECT locations.location, temperatures.day, temperatures.temperatur FROM temepratures, locations where temperatures.location_id = locations.id
Or search in google for JOINS, you will find your answer.

How to write an SQLite query that uses 4 tables and count()?

I've got 4 tables that I want data from.
Trip_T
tripID (PK)
userID (FK)
...
User_T
userID (PK)
username
...
Excursion_T
excursionID (PK)
tripID (FK)
...
POI_T
poiID (PK)
excursionID (FK)
...
I want to create a table with one row for each trip in the db.
Each row should include the tripID, title, the user's name associated with the trip, the number of excursions made on the the trip and the number of poi (points of interest) associated with those excursions.
I'm using the following query:
SELECT Trip_T.tripID, Trip_T.title, User_T.username
COUNT(DISTINCT Excursion_T.excursionID) AS numExcursions,
COUNT(DISTINCT POI_T.poiID) AS numPOI
FROM Trip_T
INNER JOIN User_T ON User_T.userID = Trip_T.userID
INNER JOIN Excursion_T ON Excursion_T.tripID = Trip_T.tripID
INNER JOIN POI_T ON POI_T.excursionID = Excursion_T.excursionID
Even though I have multiple trips in the db, each with multiple excursions and pois, the query returns 1 row with what looks like the total number of excursions and total number of pois for all trips.
Any help is appreciated.
You forgot to add grouping to your query:
GROUP BY Trip_T.tripID, Trip_T.title, User_T.username
This way the counters correspond to each triplet of (Trip_T.tripID, Trip_T.title, User_T.username)

Oracle: Composite unique key with Date column

I have created a table with composite unique key as below--
create table test11
(
aa number,
bb varchar2(10),
cc DATE,
dd number,
ee NUMBER
);
CREATE UNIQUE INDEX TEST11_IDX ON TEST11 (AA,BB,CC);
Now, whenever I try to insert data, I get this error:
ORA-00001: unique constraint (CDUREFDB.TEST11_IDX) violated
INSERT INTO TEST11 VALUES (1, 'AA', SYSDATE, 1, 1);
commit;
INSERT INTO TEST11 VALUES (1, 'AA', SYSDATE, 1, 1);
commit;
Is that because of DATE column is considering Date value till seconds?
Because I could see below query is returning is result--
select to_char(CC,'DD-Mon-YY HH:Mi:SS AM') from test11;
TO_CHAR(CC,'DD-MON-YYHH:MI:SSAM')
---------------------------------
17-Mar-16 04:28:37 PM
17-Mar-16 04:28:43 PM
So, what can be done in order to only consider Date value (not hours, mins, secs precision) as unique key member.
Also, DATE column above(CC) has partition on it.
UPDATE::
In this table, we have RANGE partition on DATE column(CC).
And we are planning to remove partitions periodically (i.e. after some days interval).
So if I Don't use direct CC in unique index ( instead of making trunc as Justin suggested) then i am getting error as ORA-01502: index 'CDUREFDB.TEST111_IDX' or partition of such index is in unusable state if I try to insert data after some old partition got removed .
UPDATE_1
As per #Justin suggestion below, this issue is resolved creating virtual column like below:
CREATE TABLE TEST11
(
AA NUMBER,
BB VARCHAR2(10),
CC DATE,
DD NUMBER ,
EE NUMBER,
FF DATE generated always AS (TRUNC(CC)) virtual
)
PARTITION BY RANGE
(
FF
)
INTERVAL
(
NUMTODSINTERVAL(1,'DAY')
)
(
PARTITION partition_test_1 VALUES LESS THAN (TO_DATE('01-APR-2006','dd-MON-yyyy'))
);
CREATE UNIQUE INDEX TEST111_IDX ON TEST11 (AA,BB,FF) LOCAL; -- creating unique local index
A date always has a time component so your two rows have different cc values. You could create a function-based index based on the trunc(cc) value which will set the time component to midnight.
CREATE UNIQUE INDEX TEST11_IDX
ON TEST11 (AA,BB,trunc(CC));
Of course, that means that if you want a query to use the index, you'd want to ensure that your predicate is on trunc(cc) rather than cc.

Should an INNER JOIN on a UNION with a GROUP BY take hours in SQLite?

I'm trying to learn SQLite and searching for techniques to speed up my query. I see some here trying to go squeeze out ms, when I'm easily in the mega seconds. I have one SQLite db with four tables, although I'm only querying three tables. Here's the query (I am using R to invoke the query):
SELECT a.date, a.symbol, SUM (a.oi*a.contract_close) AS oi, c.ret, c.prc
FROM (SELECT date, symbol, oi, contract_close FROM ann
UNION
SELECT date, symbol AS sym, oi, contract_close FROM qtr
WHERE oi > 100 AND contract_close > 0 AND date > 20090600) a
INNER JOIN
(SELECT date, symbol || '1C' AS sym, ret, prc FROM crsp
WHERE prc > 5 AND date>20090600) c
ON a.date = c.date AND a.symbol = c.sym
GROUP BY a.date, a.symbol
I have a an index on each table by date and symbol and just VACUUMed, but it's still very slow, as in an hour plus (and notice that I'm looking for a six month subset... I really want to query back to 2003).
Is this just a cache size issue? I have a relatively new laptop (MacBook Pro with 4gb RAM). Thanks!
Here's the .schema:
CREATE TABLE ann
( "date" INTEGER,
symbol TEXT,
contract_type_1 TEXT,
contract_type_2 TEXT,
product_type TEXT,
block_volume INTEGER,
oi_change INTEGER,
oi INTEGER,
efp_volume INTEGER,
total_volume INTEGER,
name TEXT,
contract_change INTEGER,
contract_open INTEGER,
contract_high INTEGER,
contract_low INTEGER,
contract_close INTEGER,
contract_settle INTEGER
);
CREATE TABLE crsp
( "date" INTEGER,
symbol TEXT,
permno INTEGER,
prc REAL,
ret REAL,
vwretd REAL,
ewretd REAL,
sprtrn REAL
);
CREATE TABLE dly
( "date" INTEGER,
symbol TEXT,
expiration INTEGER,
product_type TEXT,
shares_per_contract INTEGER,
"open" REAL,
high REAL,
low REAL,
"last" REAL,
settle REAL,
change REAL,
total_volume INTEGER,
efp_volume INTEGER,
block_volume INTEGER,
oi INTEGER
);
CREATE TABLE qtr
( "date" INTEGER,
symbol TEXT,
total_volume INTEGER,
block_volume INTEGER,
efp_volume INTEGER,
contract_high INTEGER,
contract_low INTEGER,
contract_open INTEGER,
contract_close INTEGER,
contract_settle INTEGER,
oi INTEGER,
oi_change INTEGER,
shares_per_contract INTEGER,
expiration INTEGER,
product_type TEXT,
unk TEXT,
name TEXT
);
CREATE INDEX idx_ann_date_sym ON ann (date, symbol);
CREATE INDEX idx_crsp_date_sym ON ann (date, symbol);
CREATE INDEX idx_dly_date_sym ON ann (date, symbol);
CREATE INDEX idx_qtr_date_sym ON ann (date, symbol);
You don't mention the critical piece of information, which is how many rows are in each table and how many are in your result set. A query shouldn't take an hour unless you have really enormous data sets.
That said, a few things I notice about your query:
I assume you're aware that in your UNION the WHERE clause only applies to the second table and you're getting the entire "ann" table included?
UNION ALL is generally faster than plain UNION unless you really need the de-duplication provided by plain UNION.
You do not need to repeat the filter for the date field on both sides on the JOIN. One side is enough, and you may achieve different speed results depending on which side of the JOIN you put the filter. By using it in both places you could possibly be tricking the query optimizer.
I'm not sure what "AS sym" is doing in the second SELECT in the UNION, because that column will be named "symbol" in the output (from the first SELECT in the UNION) and you're relying on the name symbol in your main SELECT statement.
In your main SELECT statement you don't have c.ret and c.prc in aggregate functions, but you don't include them in the GROUP BY, so it's not clear to me what value you expect to see in the results in the event that c contains multiple rows for a GROUP BY set.
The JOIN cannot be optimized because you're calculating one of the JOIN values as part of an inner SELECT. I'm not sure if there's a clever way to rewrite the JOIN conditions to be optimizable without storing a calculated symbol value in crsp.
Depending on the distribution of symbol and date values, you might want to reverse the order of the columns in your indexes (but only if you solve the problem of calculating the symbol value).
How fast does this run without the INNER JOIN. Check the speed of both halves of the join.
Try
Selecting and sorting by date an symbol from C.
Inner joining to the union A instead of table C.
Dropping alias of symbol as sym in half of union, or alias in both halves.

Select random row from a sqlite table

I have a sqlite table with the following schema:
CREATE TABLE foo (bar VARCHAR)
I'm using this table as storage for a list of strings.
How do I select a random row from this table?
Have a look at Selecting a Random Row from an SQLite Table
SELECT * FROM table ORDER BY RANDOM() LIMIT 1;
The following solutions are much faster than anktastic's (the count(*) costs a lot, but if you can cache it, then the difference shouldn't be that big), which itself is much faster than the "order by random()" when you have a large number of rows, although they have a few inconvenients.
If your rowids are rather packed (ie. few deletions), then you can do the following (using (select max(rowid) from foo)+1 instead of max(rowid)+1 gives better performance, as explained in the comments):
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1));
If you have holes, you will sometimes try to select a non-existant rowid, and the select will return an empty result set. If this is not acceptable, you can provide a default value like this :
select * from foo where rowid = (abs(random()) % (select (select max(rowid) from foo)+1)) or rowid = (select max(rowid) from node) order by rowid limit 1;
This second solution isn't perfect : the distribution of probability is higher on the last row (the one with the highest rowid), but if you often add stuff to the table, it will become a moving target and the distribution of probabilities should be much better.
Yet another solution, if you often select random stuff from a table with lots of holes, then you might want to create a table that contains the rows of the original table sorted in random order :
create table random_foo(foo_id);
Then, periodicalliy, re-fill the table random_foo
delete from random_foo;
insert into random_foo select id from foo;
And to select a random row, you can use my first method (there are no holes here). Of course, this last method has some concurrency problems, but the re-building of random_foo is a maintainance operation that's not likely to happen very often.
Yet, yet another way, that I recently found on a mailing list, is to put a trigger on delete to move the row with the biggest rowid into the current deleted row, so that no holes are left.
Lastly, note that the behavior of rowid and an integer primary key autoincrement is not identical (with rowid, when a new row is inserted, max(rowid)+1 is chosen, wheras it is higest-value-ever-seen+1 for a primary key), so the last solution won't work with an autoincrement in random_foo, but the other methods will.
You need put "order by RANDOM()" on your query.
Example:
select * from quest order by RANDOM();
Let's see an complete example
Create a table:
CREATE TABLE quest (
id INTEGER PRIMARY KEY AUTOINCREMENT,
quest TEXT NOT NULL,
resp_id INTEGER NOT NULL
);
Inserting some values:
insert into quest(quest, resp_id) values ('1024/4',6), ('256/2',12), ('128/1',24);
A default select:
select * from quest;
| id | quest | resp_id |
1 1024/4 6
2 256/2 12
3 128/1 24
--
A select random:
select * from quest order by RANDOM();
| id | quest | resp_id |
3 128/1 24
1 1024/4 6
2 256/2 12
--*Each time you select, the order will be different.
If you want to return only one row
select * from quest order by RANDOM() LIMIT 1;
| id | quest | resp_id |
2 256/2 12
--*Each time you select, the return will be different.
What about:
SELECT COUNT(*) AS n FROM foo;
then choose a random number m in [0, n) and
SELECT * FROM foo LIMIT 1 OFFSET m;
You can even save the first number (n) somewhere and only update it when the database count changes. That way you don't have to do the SELECT COUNT every time.
Here is a modification of #ank's solution:
SELECT *
FROM table
LIMIT 1
OFFSET ABS(RANDOM()) % MAX((SELECT COUNT(*) FROM table), 1)
This solution also works for indices with gaps, because we randomize an offset in a range [0, count). MAX is used to handle a case with empty table.
Here are simple test results on a table with 16k rows:
sqlite> .timer on
sqlite> select count(*) from payment;
16049
Run Time: real 0.000 user 0.000140 sys 0.000117
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
14746
Run Time: real 0.002 user 0.000899 sys 0.000132
sqlite> select payment_id from payment limit 1 offset abs(random()) % (select count(*) from payment);
12486
Run Time: real 0.001 user 0.000952 sys 0.000103
sqlite> select payment_id from payment order by random() limit 1;
3134
Run Time: real 0.015 user 0.014022 sys 0.000309
sqlite> select payment_id from payment order by random() limit 1;
9407
Run Time: real 0.018 user 0.013757 sys 0.000208
SELECT bar
FROM foo
ORDER BY Random()
LIMIT 1
I came up with the following solution for the large sqlite3 databases:
SELECT * FROM foo WHERE rowid = abs(random()) % (SELECT max(rowid) FROM foo) + 1;
The abs(X) function returns the absolute value of the numeric argument
X.
The random() function returns a pseudo-random integer between
-9223372036854775808 and +9223372036854775807.
The operator % outputs the integer value of its left operand modulo its right operand.
Finally, you add +1 to prevent rowid equal to 0.

Resources