SQLite query runs 10 times slower than MSAccess query - sqlite

I have a 800MB MS Access database that I migrated to SQLite. The structure of the database is as follows (the SQLite database, after migration, is around 330MB):
The table Occurrence has 1,600,000 records. The table looks like:
CREATE TABLE Occurrence
(
SimulationID INTEGER, SimRunID INTEGER, OccurrenceID INTEGER,
OccurrenceTypeID INTEGER, Period INTEGER, HasSucceeded BOOL,
PRIMARY KEY (SimulationID, SimRunID, OccurrenceID)
)
It has the following indexes:
CREATE INDEX "Occurrence_HasSucceeded_idx" ON "Occurrence" ("HasSucceeded" ASC)
CREATE INDEX "Occurrence_OccurrenceID_idx" ON "Occurrence" ("OccurrenceID" ASC)
CREATE INDEX "Occurrence_SimRunID_idx" ON "Occurrence" ("SimRunID" ASC)
CREATE INDEX "Occurrence_SimulationID_idx" ON "Occurrence" ("SimulationID" ASC)
The table OccurrenceParticipant has 3,400,000 records. The table looks like:
CREATE TABLE OccurrenceParticipant
(
SimulationID INTEGER, SimRunID INTEGER, OccurrenceID INTEGER,
RoleTypeID INTEGER, ParticipantID INTEGER
)
It has the following indexes:
CREATE INDEX "OccurrenceParticipant_OccurrenceID_idx" ON "OccurrenceParticipant" ("OccurrenceID" ASC)
CREATE INDEX "OccurrenceParticipant_ParticipantID_idx" ON "OccurrenceParticipant" ("ParticipantID" ASC)
CREATE INDEX "OccurrenceParticipant_RoleType_idx" ON "OccurrenceParticipant" ("RoleTypeID" ASC)
CREATE INDEX "OccurrenceParticipant_SimRunID_idx" ON "OccurrenceParticipant" ("SimRunID" ASC)
CREATE INDEX "OccurrenceParticipant_SimulationID_idx" ON "OccurrenceParticipant" ("SimulationID" ASC)
The table InitialParticipant has 130 records. The structure of the table is
CREATE TABLE InitialParticipant
(
ParticipantID INTEGER PRIMARY KEY, ParticipantTypeID INTEGER,
ParticipantGroupID INTEGER
)
The table has the following indexes:
CREATE INDEX "initialpart_participantTypeID_idx" ON "InitialParticipant" ("ParticipantGroupID" ASC)
CREATE INDEX "initialpart_ParticipantID_idx" ON "InitialParticipant" ("ParticipantID" ASC)
The table ParticipantGroup has 22 records. It looks like
CREATE TABLE ParticipantGroup (
ParticipantGroupID INTEGER, ParticipantGroupTypeID INTEGER,
Description varchar (50), PRIMARY KEY( ParticipantGroupID )
)
The table has the following index:
CREATE INDEX "ParticipantGroup_ParticipantGroupID_idx" ON "ParticipantGroup" ("ParticipantGroupID" ASC)
The table tmpSimArgs has 18 records. It has the following structure:
CREATE TABLE tmpSimArgs (SimulationID varchar, SimRunID int(10))
And the following indexes:
CREATE INDEX tmpSimArgs_SimRunID_idx ON tmpSimArgs(SimRunID ASC)
CREATE INDEX tmpSimArgs_SimulationID_idx ON tmpSimArgs(SimulationID ASC)
The table ‘tmpPartArgs’ has 80 records. It has the below structure:
CREATE TABLE tmpPartArgs(participantID INT)
And the below index:
CREATE INDEX tmpPartArgs_participantID_idx ON tmpPartArgs(participantID ASC)
I have a query that involves multiple INNER JOINs and the problem I am facing is the Access version of the query takes about a second whereas the SQLite version of the same query takes 10 seconds (about 10 times slow!) It is impossible for me to migrate back to Access and SQLite is my only option.
I am new to writing database queries hence these queries might look stupid, so please advise on anything you see faulty or kid-dish.
The query in Access is (the entire query takes 1 second to execute):
SELECT ParticipantGroup.Description, Occurrence.SimulationID, Occurrence.SimRunID, Occurrence.Period, Count(OccurrenceParticipant.ParticipantID) AS CountOfParticipantID FROM
(
ParticipantGroup INNER JOIN InitialParticipant ON ParticipantGroup.ParticipantGroupID = InitialParticipant.ParticipantGroupID
) INNER JOIN
(
tmpPartArgs INNER JOIN
(
(
tmpSimArgs INNER JOIN Occurrence ON (tmpSimArgs.SimRunID = Occurrence.SimRunID) AND (tmpSimArgs.SimulationID = Occurrence.SimulationID)
) INNER JOIN OccurrenceParticipant ON (Occurrence.OccurrenceID = OccurrenceParticipant.OccurrenceID) AND (Occurrence.SimRunID = OccurrenceParticipant.SimRunID) AND (Occurrence.SimulationID = OccurrenceParticipant.SimulationID)
) ON tmpPartArgs.participantID = OccurrenceParticipant.ParticipantID
) ON InitialParticipant.ParticipantID = OccurrenceParticipant.ParticipantID WHERE (((OccurrenceParticipant.RoleTypeID)=52 Or (OccurrenceParticipant.RoleTypeID)=49)) AND Occurrence.HasSucceeded = True GROUP BY ParticipantGroup.Description, Occurrence.SimulationID, Occurrence.SimRunID, Occurrence.Period;
The SQLite query is as follows (this query takes around 10 seconds):
SELECT ij1.Description, ij2.occSimulationID, ij2.occSimRunID, ij2.Period, Count(ij2.occpParticipantID) AS CountOfParticipantID FROM
(
SELECT ip.ParticipantGroupID AS ipParticipantGroupID, ip.ParticipantID AS ipParticipantID, ip.ParticipantTypeID, pg.ParticipantGroupID AS pgParticipantGroupID, pg.ParticipantGroupTypeID, pg.Description FROM ParticipantGroup as pg INNER JOIN InitialParticipant AS ip ON pg.ParticipantGroupID = ip.ParticipantGroupID
) AS ij1 INNER JOIN
(
SELECT tpa.participantID AS tpaParticipantID, ij3.* FROM tmpPartArgs AS tpa INNER JOIN
(
SELECT ij4.*, occp.SimulationID as occpSimulationID, occp.SimRunID AS occpSimRunID, occp.OccurrenceID AS occpOccurrenceID, occp.ParticipantID AS occpParticipantID, occp.RoleTypeID FROM
(
SELECT tsa.SimulationID AS tsaSimulationID, tsa.SimRunID AS tsaSimRunID, occ.SimulationID AS occSimulationID, occ.SimRunID AS occSimRunID, occ.OccurrenceID AS occOccurrenceID, occ.OccurrenceTypeID, occ.Period, occ.HasSucceeded FROM tmpSimArgs AS tsa INNER JOIN Occurrence AS occ ON (tsa.SimRunID = occ.SimRunID) AND (tsa.SimulationID = occ.SimulationID)
) AS ij4 INNER JOIN OccurrenceParticipant AS occp ON (occOccurrenceID = occpOccurrenceID) AND (occSimRunID = occpSimRunID) AND (occSimulationID = occpSimulationID)
) AS ij3 ON tpa.participantID = ij3.occpParticipantID
) AS ij2 ON ij1.ipParticipantID = ij2.occpParticipantID WHERE (((ij2.RoleTypeID)=52 Or (ij2.RoleTypeID)=49)) AND ij2.HasSucceeded = 1 GROUP BY ij1.Description, ij2.occSimulationID, ij2.occSimRunID, ij2.Period;
I don’t know what I am doing wrong here. I have all the indexes but I thinking I am missing declaring some key index that will do the trick for me. The interesting thing is before migration my ‘research’ on SQLite showed that SQLite is faster, smaller and better in all aspects than Access. But I cant seem to get SQLite work faster than Access in terms of querying. I reiterate that I am new to SQLite and obviously do not have much idea as well as experience so if any learned soul could help me out with this, it will be much appreciated.

I have reformatting your code (using my home-brew sql formatter) to hopefully make it easier for others to read..
Reformatted Query:
SELECT
ij1.Description,
ij2.occSimulationID,
ij2.occSimRunID,
ij2.Period,
Count(ij2.occpParticipantID) AS CountOfParticipantID
FROM (
SELECT
ip.ParticipantGroupID AS ipParticipantGroupID,
ip.ParticipantID AS ipParticipantID,
ip.ParticipantTypeID,
pg.ParticipantGroupID AS pgParticipantGroupID,
pg.ParticipantGroupTypeID,
pg.Description
FROM ParticipantGroup AS pg
INNER JOIN InitialParticipant AS ip
ON pg.ParticipantGroupID = ip.ParticipantGroupID
) AS ij1
INNER JOIN (
SELECT
tpa.participantID AS tpaParticipantID,
ij3.*
FROM tmpPartArgs AS tpa
INNER JOIN (
SELECT
ij4.*,
occp.SimulationID AS occpSimulationID,
occp.SimRunID AS occpSimRunID,
occp.OccurrenceID AS occpOccurrenceID,
occp.ParticipantID AS occpParticipantID,
occp.RoleTypeID
FROM (
SELECT
tsa.SimulationID AS tsaSimulationID,
tsa.SimRunID AS tsaSimRunID,
occ.SimulationID AS occSimulationID,
occ.SimRunID AS occSimRunID,
occ.OccurrenceID AS occOccurrenceID,
occ.OccurrenceTypeID,
occ.Period,
occ.HasSucceeded
FROM tmpSimArgs AS tsa
INNER JOIN Occurrence AS occ
ON (tsa.SimRunID = occ.SimRunID)
AND (tsa.SimulationID = occ.SimulationID)
) AS ij4
INNER JOIN OccurrenceParticipant AS occp
ON (occOccurrenceID = occpOccurrenceID)
AND (occSimRunID = occpSimRunID)
AND (occSimulationID = occpSimulationID)
) AS ij3
ON tpa.participantID = ij3.occpParticipantID
) AS ij2
ON ij1.ipParticipantID = ij2.occpParticipantID
WHERE (
(
(ij2.RoleTypeID) = 52
OR
(ij2.RoleTypeID) = 49
)
)
AND ij2.HasSucceeded = 1
GROUP BY
ij1.Description,
ij2.occSimulationID,
ij2.occSimRunID,
ij2.Period;
As per JohnFx (above), I was confused by the derived views. I think there is actually no need for it, especially since they are all inner joins. So, below I have attempted to reduce the complexity. Please review and test for performance. I have had to do a cross join with tmpSimArgs since it is only joined to Occurence - I assume this is desired behaviour.
SELECT
pg.Description,
occ.SimulationID,
occ.SimRunID,
occ.Period,
COUNT(occp.ParticipantID) AS CountOfParticipantID
FROM ParticipantGroup AS pg
INNER JOIN InitialParticipant AS ip
ON pg.ParticipantGroupID = ip.ParticipantGroupID
CROSS JOIN tmpSimArgs AS tsa
INNER JOIN Occurrence AS occ
ON tsa.SimRunID = occ.SimRunID
AND tsa.SimulationID = occ.SimulationID
INNER JOIN OccurrenceParticipant AS occp
ON occ.OccurrenceID = occp.OccurrenceID
AND occ.SimRunID = occp.SimRunID
AND occ.SimulationID = occp.SimulationID
INNER JOIN tmpPartArgs AS tpa
ON tpa.participantID = occp.ParticipantID
WHERE occ.HasSucceeded = 1
AND (occp.RoleTypeID = 52 OR occp.RoleTypeID = 49 )
GROUP BY
pg.Description,
occ.SimulationID,
occ.SimRunID,
occ.Period;

I have presented a smaller scaled down version of my query. Hope this is more clear and legible than my earlier one.
SELECT5 * FROM
(
SELECT4 FROM ParticipantGroup as pg INNER JOIN InitialParticipant AS ip ON pg.ParticipantGroupID = ip.ParticipantGroupID
) AS ij1 INNER JOIN
(
SELECT3 * FROM tmpPartArgs AS tpa INNER JOIN
(
SELECT2 * FROM
(
SELECT1 * FROM tmpSimArgs AS tsa INNER JOIN Occurrence AS occ ON (tsa.SimRunID = occ.SimRunID) AND (tsa.SimulationID = occ.SimulationID)
) AS ij4 INNER JOIN OccurrenceParticipant AS occp ON (occOccurrenceID = occpOccurrenceID) AND (occSimRunID = occpSimRunID) AND (occSimulationID = occpSimulationID)
) AS ij3 ON tpa.participantID = ij3.occpParticipantID
) AS ij2 ON ij1.ipParticipantID = ij2.occpParticipantID WHERE (((ij2.RoleTypeID)=52 Or (ij2.RoleTypeID)=49)) AND ij2.HasSucceeded = 1
The application that I am working on is a Simulation application and in order to understand the context of the above query I thought it necessary to give a brief explanation of the application. Let us assume there is a planet with some initial resources and living agents. The planet is allowed to exist for 1000 years and the actions performed by the agents are monitored and stored in the database. After 1000 years the planet is destroyed and again re-created with the same set of initial resources and living agents as the first time. This (the creation and destruction) is repeated 18 times and all the actions of the agents performed during those 1000 years are stored in the database. Thus our entire experiment consists of 18 re-creations which is termed as the ‘Simulation’. Each of the 18 times the planet is recreated is termed as a run and each of the 1000 years of a run is called a period. So a ‘Simulation’ consists of 18 runs and each run consists of 1000 periods. At the start of each run, we assign the ‘Simulation’ an initial set of knowledge items and dynamic agents that interact with each other and the items. A knowledge item is stored by an agent inside a knowledge store. The knowledge store is also considered to be a participating entity in our Simulation. But this concept (regarding knowledge stores) is not important. I have tried to be detailed about every SELECT statement and the tables involved.
SELECT1: I think this query could be replaced by just the table ‘Occurrence’, since it does nothing much. The table Occurrence stores the different actions taken by the agents, in each period of every simulation run of a particular ‘Simulation’. Normally each ‘Simulation’ consists of 18 runs. And each run consists of a 1000 periods. An agent is allowed to take an action in every period of every run in the ‘Simulation’. But the Occurrence table does not store any details about the agents that perform the actions. The Occurrence table might store data related to multiple ‘Simulations’.
SELECT2: This query simply returns the details of actions performed in every period of every run of a ‘Simulation’ along with the details of all participants of the ‘Simulation’ like their respective ParticipantIDs. The OccurrenceParticipant table stores records for every participating entity of the Simulation and that includes agents, knowledge stores, knowledge items, etc.
SELECT3: This query returns only those records from the pseudo table ij3 that are due to agents and knowledge items. All records in ij3 concerning knowledge items will be filtered out.
SELECT4: This query attaches the ‘Description’ field to every record of ‘InitialParticipant’. Please note that the column ‘Description’ is an Output column of the entire query. The table InitialParticipant contains a record for every agent and every knowledge item that is initially assigned to the ‘Simulation’
SELECT5: This final query returns all records from the pseudo table ij2 for which the RoleType of the participating entity (which may either be an agent or a knowledge item) is 49 or 52.

I would suggest moving the ij2.RoleTypeID filtering from the outermost query to ij3, use IN instead of OR, and move the HasSucceeded query to ij4.

Related

How to write complex recursive maria db query

Im trying to write a recursive query for a use on a old and poorly designed database - and so the queries get quite complex.
Here is the (relevant) table relationships
Because people asked - here is the creation code for these tables:
CREATE TABLE CircuitLayout(
CircuitLayoutID int,
PRIMARY KEY (CircuitLayoutID)
);
CREATE TABLE LitCircuit (
LitCircuitID int,
CircuitLayoutID int,
PRIMARY KEY (LitCircuitID)
FOREIGN KEY (CircuitLayoutID) REFERENCES CircuitLayout(CircuitLayoutID)
);
CREATE TABLE CircuitLayoutItem(
CircuitLayoutItemID int,
CircuitLayoutID int,
TableName varchar(255),
TablePK int,
PRIMARY KEY (CircuitLayoutItemID)
FOREIGN KEY (CircuitLayoutID) REFERENCES CircuitLayout(CircuitLayoutID)
);
TableName refers to another table in the database and thus TablePK is a primary key from the specified table
One of the valid options for TableName is LitCircuit
I'm trying to write a query that will select a circuit and any circuit it is related to
I am having trouble understanding the syntax for recursive ctes
my non-functional attempt is this:
WITH RECURSIVE carries AS (
SELECT LitCircuit.LitCircuitID AS recurseList FROM LitCircuit
JOIN CircuitLayoutItem ON LitCircuit.CircuitLayoutID = CircuitLayoutItem.CircuitLayoutID
WHERE CircuitLayoutItem.TableName = "LitCircuit" AND CircuitLayoutItem.TablePK IN (00340)
UNION
SELECT LitCircuit.LitCircuitID AS CircuitIDs FROM LitCircuit
JOIN CircuitLayout ON LitCircuit.CircuitLayoutID = CircuitLayoutItem.CircuitLayoutID
WHERE CircuitLayoutItem.TableName = "LitCircuit" AND CircuitLayoutItem.TablePK IN (SELECT recurseList FROM carries)
)
SELECT * FROM carries;
the "00340" is a dummy number for testing, and it would get replaced with an actual list in usage
What i'm attempting to do is get a list of LitCircuitIDs based on one or many LitCircuitIDs - that's the anchor member, and that works fine.
What I want to do is take this result and feed it back into itself.
I lack an understanding of how to access data from the anchor member:
I don't know if it is a table with the columns from the select in the anchor or if it is simply a list of resulting values
I dont understand if or where I need to include "carries" in the FROM part of a query
If I were to write this function in python I would do it like this:
def get_circuits(circuit_list):
result_list = []
for layout_item_key, layout_item in CircuitLayoutItem.items():
if layout_item['TableName'] == "LitCircuit" and layout_item['TablePK'] in circuit_list:
layout = layout_item['CircuitLayoutID']
for circuit_key, circuit in LitCircuit.items():
if circuit["CircuitLayoutID"] == layout:
result_list.append(circuit_key)
result_list.extend(get_circuits(result_list))
return result_list
How do I express this in SQL?
danblack's comment made me realize something I was missing:
Here is what I was trying to do:
WITH RECURSIVE carries AS (
SELECT LitCircuit.LitCircuitID FROM LitCircuit
JOIN CircuitLayoutItem ON LitCircuit.CircuitLayoutID = CircuitLayoutItem.CircuitLayoutID
WHERE CircuitLayoutItem.TableName = 'LitCircuit' AND CircuitLayoutItem.TablePK IN (00340)
UNION ALL
SELECT LitCircuit.LitCircuitID FROM carries
JOIN CircuitLayoutItem ON carries.LitCircuitID = CircuitLayoutItem.TablePK
JOIN LitCircuit ON CircuitLayoutItem.CircuitLayoutID = LitCircuit.CircuitLayoutID
WHERE CircuitLayoutItem.TableName = 'LitCircuit'
)
SELECT DISTINCT LitCircuitID FROM carries;
I did not think of the CTE as a table to query against - rather just a result set, so I did not realize you have to SELECT from it - or in general treat it like a table.

Update row with value from next row sqlite

I have the following columns in a SQLite DB.
id,ts,origin,product,bid,ask,nextts
1,2016-10-18 20:20:54.733,SourceA,Dow,1.09812,1.0982,
2,2016-10-18 20:20:55.093,SourceB,Oil,7010.5,7011.5,
3,2016-10-18 20:20:55.149,SourceA,Dow,18159.0,18161.0,
How can I populate the 'next timestamp' column (nextts) with the next timestamp for the same product (ts), from the same source? I've been trying the following, but I can't seem to put a subquery in an UPDATE statement.
UPDATE TEST a SET nextts = (select ts
from TEST b
where b.id> a.id and a.origin = b.origin and a.product = b.product
order by id asc limit 1);
If I call this, I can display it, but I haven't found a way of updating the value yet.
select a.*,
(select ts
from TEST b
where b.id> a.id and a.origin = b.origin and a.product = b.product
order by id asc limit 1) as nextts
from TEST a
order by origin, a.id;
The problem is that you're using table alias for table in UPDATE statement, which is not allowed. You can skip alias from there and use unaliased (but table-name prefixed) reference to its columns (while keeping aliased references for the SELECT), like this:
UPDATE TEST
SET nextts = (
SELECT b.ts
FROM TEST b
WHERE b.id > TEST.id AND
TEST.origin = b.origin AND
TEST.product = b.product
ORDER BY b.id ASC
LIMIT 1
);
Prefixing unaliased column references with the table name is necessary for SQLite to identify that you're referencing to unaliased table. Otherwise the id column whould be understood as the id from the closest[*] possible data source, in which case it's the aliased table (as b alias), while we're interested in the unaliased table, therefore we need to explicitly tell SQLite that.
[*] Closest data source is the one listed in the same query, or parent query, or parent's parent query, etc. SQLite is looking for the first data source (going from inner part to the outside) in the query hierarchy that defines this column.

How to improve sqlite SELECT performance?

Some SELECT statements take several seconds to return data and I would like to know if and how I could improve performance. The DB normally is quite small (~10-40MB) but the larger it gets, the longer it takes.
One example query which takes very long is the following:
SELECT intf_id FROM interfaces
WHERE intfType IN (SELECT intfType FROM interfaces
WHERE intf_id=39151)
AND macAddress IN (SELECT l2_addr FROM neighbor
INNER JOIN nlink ON nlink.neighbor_neighbor_id=neighbor.neighbor_id
INNER JOIN interfaces ON interfaces.intf_id=nlink.interfaces_intf_id
WHERE interfaces.intf_id=39151)
AND status LIKE 'UP' AND phl=1 AND intf_id <> 39151
Maybe it's because of the nested SELECT statements?
The DB Layout is as follows:
EXPLAIN QUERY PLAN Output:
EXPLAIN QUERY PLAN csv:
"0","0","0","SCAN TABLE interfaces USING COVERING INDEX ii1"
"0","0","0","EXECUTE LIST SUBQUERY 1"
"1","0","0","SEARCH TABLE interfaces USING INTEGER PRIMARY KEY (rowid=?)"
"0","0","0","EXECUTE LIST SUBQUERY 2"
"2","0","2","SEARCH TABLE interfaces USING INTEGER PRIMARY KEY (rowid=?)"
"2","1","0","SCAN TABLE neighbor"
"2","2","1","SEARCH TABLE nlink USING COVERING INDEX sqlite_autoindex_nlink_1 (neighbor_neighbor_id=? AND interfaces_intf_id=?)"
could you try creating an index on interfaces.macAddress and using the following joined Query instead? The subqueries seem to be slower in this case.
SELECT interface_RH.intf_id FROM interfaces AS interface_LH
INNER JOIN nlink ON nlink.interfaces_intf_id = interface_LH.intf_id
INNER JOIN neighbor ON nlink.neighbor_neighbor_id = neighbor.neighbor_id
INNER JOIN interfaces AS interface_RH ON interface_RH.macAddress = neighbor.l2_addr
WHERE
interface_LH.intf_id=39151
AND interface_RH.status LIKE 'UP'
AND interface_RH.phl = 1
AND interface_RH.intf_id <> 39151
AND interface_RH.intfType = interface_LH.intfType

SQLite outer join column filtering

As a training exercise I'm working on a fictional SQLite database resembling League of Legends, and I need to perform a left outer join to get a table of all players and if they have skins that are not called 'Classic', return those too.
I currently have this query:
SELECT * FROM players
LEFT OUTER JOIN (SELECT * FROM playerchampions WHERE NOT championskin = 'Classic')
ON name = playername
Which returns what I am looking for, but also a lot of columns I don't want (player experience, player IP, player RP, playername in the playerchampions table. The code for the two tables is as following:
CREATE TABLE players (
name TEXT PRIMARY KEY,
experience INTEGER,
currencyip INTEGER,
currencyrp INTEGER
);
CREATE TABLE playerchampions (
playername TEXT REFERENCES players ( name ) ON UPDATE CASCADE,
championname TEXT REFERENCES champions ( name ) ON UPDATE CASCADE,
championskin TEXT REFERENCES skins ( skinname ) ON UPDATE CASCADE,
PRIMARY KEY ( playername, championname, championskin )
);
As I said, the query executes, but I can't use SELECT players.name, playerchampions.championname, playerchampions.championskin as the playerchampions columns are not given their proper table name when returned.
How do I fix this?
Try using aliases:
SELECT p.name, c.championskin FROM players p LEFT OUTER JOIN (SELECT pc.playername playername, pc.championskin championskin FROM playerchampions pc WHERE NOT pc.championskin = 'Classic') c ON p.name = c.playername;
Not sure if its exactly what you need, but it will get you closer...
SELECT * FROM players p LEFT OUTER JOIN playerchampions pc ON (p.name = pc.playername) WHERE NOT pc.championskin = 'Classic'

Is it possible to use WHERE clause in same query as PARTITION BY?

I need to write SQL that keeps only the minimum 5 records per each identifiable record in a table. For this, I use partition by and delete all records where the value returned is greater than 5. When I attempt to use the WHERE clause in the same query as the partition by statement, I get the error "Ordered Analytical Functions not allowed in WHERE Clause". So, in order to get it to work, I have to use three subqueries. My SQL looks ilke this:
delete mydb.mytable where (field1,field2) in
(
select field1,field2 from
(
select field1,field2,
Rank() over
(
partition BY field1
order by field1,field2
) n
from mydb.mytable
) x
where n > 5
)
The innermost subquery just returns the raw data. Since I can't use WHERE there, I wrapped it with a subquery, the purpose of which is to 1) use WHERE to get records greater than 5 in rank and 2) select only field1 and field2. The reason why I select only those two fields is so that I can use the IN statement for deleting those records in the outermost query.
It works, but it appears a bit cumbersome. I'd like to consolidate the inner two subqueries into a single subquery. Is this possible?
Sounds like you need to use the QUALIFY clause which is the HAVING clause for Window Aggregate functions. Below is my take on what you are trying to accomplish.
Please do not run this SQL directly against your production data without first testing it.
/* Physical Delete */
DELETE TGT
FROM MyDB.MyTable TGT
INNER JOIN
(SELECT Field1
, Field2
FROM MyDB.MyTable
QUALIFY ROW_NUMBER() (PARTITION BY Field1, ORDER BY Field1,2)
> 5
) SRC
ON TGT.Field1 = SRC.Field1
AND TGT.Field2 = SRC.Fileld2
/* Logical Delete */
UPDATE TGT
FROM MyDB.MyTable TGT
,
(SELECT Field1
, Field2
FROM MyDB.MyTable
QUALIFY ROW_NUMBER() (PARTITION BY Field1, ORDER BY Field1,2)
> 5
) SRC
SET Deleted = 'Y'
/* RecordExpireDate = Date - 1 */
WHERE TGT.Field1 = SRC.Field1
AND TGT.Field2 = SRC.Fileld2

Resources