How can I get a good query plan without selecting partitions by a literal value? - database-partitioning

I have a large table, foos, partitioned by foo_type. The following yields a good query plan (just one partition selected):
select count(*) from foos where foo_type=1;
But if I try to change the literal "1" to an (equivalent) subquery - I end up with a plan that scans every partition -
select count(*) from foos where foo_type=(select min(foo_type) from favorite_foo_types);
How can I write a query that uses a subselect in a 'where' clause such as that and doesn't end up scanning every partition?

You didn't provide code so no one answered the question. The short answer, dynamic partition elimination works in Greenplum but the explain is different from a plan where the literal value is provided.
Example:
First create your favorite_foo_types table.
create table public.favorite_foo_types
(id int, foo_type int)
distributed by (id);
insert into public.favorite_foo_types
values (1, 1), (2,2), (3,3), (4,4), (5,5);
analyze public.favorite_foo_types;
Next, create your partitioned table.
create table public.foos
(id int, foo_type int)
distributed by (id)
partition by list (foo_type)
(
partition foo_1 values (1),
partition foo_2 values (2),
partition foo_3 values (3),
partition foo_4 values (4),
partition foo_5 values (5)
);
insert into public.foos
select i as id, case when i between 1 and 1999 then 1
when i between 2000 and 3999 then 2
when i between 4000 and 5999 then 3
when i between 6000 and 7999 then 4
when i between 8000 and 9999 then 5 end as foo_type
from generate_series(1,9999) as i;
analyze public.foos;
Here is the plan when using a literal value. You can see it is picking just one partition too.
explain analyze
select count(*)
from public.foos
where foo_type = 1;
Aggregate (cost=0.00..431.07 rows=1 width=8)
Rows out: 1 rows with 0.722 ms to first row, 0.723 ms to end, start offset by 0.298 ms.
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.07 rows=1 width=8)
Rows out: 2 rows at destination with 0.717 ms to first row, 0.718 ms to end, start offset by 0.299 ms.
-> Aggregate (cost=0.00..431.07 rows=1 width=8)
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg0) with 0.287 ms to end, start offset by 0.663 ms.
-> Sequence (cost=0.00..431.07 rows=1000 width=4)
Rows out: Avg 999.5 rows x 2 workers. Max 1000 rows (seg0) with 0.036 ms to first row, 0.215 ms to end, start offset by 0.663 ms.
-> Partition Selector for foos (dynamic scan id: 1) (cost=10.00..100.00 rows=50 width=4)
Filter: foo_type = 1
Partitions selected: 1 (out of 5)
Rows out: 0 rows (seg0) with 0.004 ms to end, start offset by 0.663 ms.
-> Dynamic Table Scan on foos (dynamic scan id: 1) (cost=0.00..431.07 rows=1000 width=4)
Filter: foo_type = 1
Rows out: Avg 999.5 rows x 2 workers. Max 1000 rows (seg0) with 0.032 ms to first row, 0.140 ms to end, start offset by 0.667 ms.
Partitions scanned: Avg 1.0 (out of 5) x 2 workers. Max 1 parts (seg0).
Slice statistics:
(slice0) Executor memory: 408K bytes.
(slice1) Executor memory: 195K bytes avg x 2 workers, 195K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Settings: optimizer=on
Optimizer status: PQO version 1.650
Total runtime: 1.162 ms
Now, your query:
explain analyze
select count(*)
from public.foos
where foo_type=(select min(foo_type) from public.favorite_foo_types);
Aggregate (cost=0.00..863.04 rows=1 width=8)
Rows out: 1 rows with 6.466 ms to end, start offset by 24 ms.
-> Gather Motion 2:1 (slice3; segments: 2) (cost=0.00..863.04 rows=1 width=8)
Rows out: 2 rows at destination with 5.415 ms to first row, 6.459 ms to end, start offset by 24 ms.
-> Aggregate (cost=0.00..863.04 rows=1 width=8)
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg0) with 4.514 ms to end, start offset by 24 ms.
-> Hash Join (cost=0.00..863.04 rows=5000 width=1)
Hash Cond: foos.foo_type = inner.min
Rows out: Avg 999.5 rows x 2 workers. Max 1000 rows (seg0) with 3.464 ms to first row, 4.441 ms to end, start offset by 24 ms.
Executor memory: 1K bytes avg, 1K bytes max (seg0).
Work_mem used: 1K bytes avg, 1K bytes max (seg0). Workfile: (0 spilling, 0 reused)
(seg0) Hash chain length 1.0 avg, 1 max, using 1 of 524341 buckets.
-> Dynamic Table Scan on foos (dynamic scan id: 1) (cost=0.00..431.10 rows=5000 width=4)
Rows out: Avg 999.5 rows x 2 workers. Max 1000 rows (seg0) with 0.382 ms to first row, 0.478 ms to end, start offset by 27 ms.
Partitions scanned: Avg 1.0 (out of 5) x 2 workers. Max 1 parts (seg0).
-> Hash (cost=100.00..100.00 rows=50 width=4)
Rows in: Avg 1.0 rows x 2 workers. Max 1 rows (seg0) with 0.197 ms to end, start offset by 27 ms.
-> Partition Selector for foos (dynamic scan id: 1) (cost=10.00..100.00 rows=50 width=4)
Filter: foos.id = min
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg0) with 0.189 ms to first row, 0.190 ms to end, start offset by 27 ms.
-> Broadcast Motion 1:2 (slice2) (cost=0.00..431.00 rows=2 width=4)
Rows out: Avg 1.0 rows x 2 workers at destination. Max 1 rows (seg0) with 0.015 ms to end, start offset by 27 ms.
-> Aggregate (cost=0.00..431.00 rows=1 width=4)
Rows out: 1 rows with 0.020 ms to end, start offset by 26 ms.
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=4)
Rows out: 2 rows at destination with 0.009 ms to first row, 0.010 ms to end, start offset by 26 ms.
-> Aggregate (cost=0.00..431.00 rows=1 width=4)
Rows out: Avg 1.0 rows x 2 workers. Max 1 rows (seg0) with 0.079 ms to end, start offset by 25 ms.
-> Table Scan on favorite_foo_types (cost=0.00..431.00 rows=3 width=4)
Rows out: Avg 2.5 rows x 2 workers. Max 3 rows (seg0) with 0.065 ms to first row, 0.067 ms to end, start offset by 25 ms.
Slice statistics:
(slice0) Executor memory: 414K bytes.
(slice1) Executor memory: 245K bytes avg x 2 workers, 245K bytes max (seg0).
(slice2) Executor memory: 253K bytes (entry db).
(slice3) Executor memory: 8493K bytes avg x 2 workers, 8493K bytes max (seg0). Work_mem: 1K bytes max.
Statement statistics:
Memory used: 128000K bytes
Settings: optimizer=on
Optimizer status: PQO version 1.650
Total runtime: 30.161 ms
Notice in the query plan, it has "Dynamic Table Scan on foos" and then below that, " Partitions scanned: Avg 1.0 (out of 5)". This means, it dynamically eliminated 4 partitions and only 1 was scanned.
There is also a graphical plan checker on greenplum.org which can help you read the plan.

Related

Get the previous calculated record divided by group - SQL

I'm strugging to build two calculated columns (named balance and avg). My original SQLite base is:
name seq side price qnt
groupA 1 B 30 100
groupA 2 B 36 200
groupA 3 S 23 300
groupA 4 B 30 100
groupA 5 B 54 400
groupB 1 B 70 300
groupB 2 B 84 300
groupB 3 B 74 600
groupB 4 S 90 100
Rational for the 2 calculated new columns:
balance: the first line of each group (seq = 1), must have the same value of qnt. The next records follow the below formula (Excel-based scheme):
if(side="B"; `previous balance record` + `qnt`; `previous balance record` - `qnt`)
avg: the first line of each group (seq = 1), must have the same value of price. The next records follow the below formula (Excel-based scheme):
if(side="B"; ((`price` \* `qnt`) + (`previous balance record` \* `previous avg record`)) / (`qnt` + `previous balance record`); `previous avg record`)
Example with numbers (the second row of groupA is calculated below):
--> balance: 100 + 200 = 300
--> avg: ((36 * 200) + (100 * 30)) / (200 + 100) = 34
I think this problem must be solved with CTE because I need the previous record, which is in being calculated every time.
I wouldn't like to aggregate groups - my goal is to display every record.
Finally, this is what I expect as the output:
name seq side price qnt balance avg
groupA 1 B 30 100 100 30
groupA 2 B 36 200 300 34
groupA 3 S 23 300 0 34
groupA 4 B 30 100 100 30
groupA 5 B 54 400 500 49,2
groupB 1 B 70 300 300 70
groupB 2 B 84 300 600 77
groupB 3 B 74 600 1200 75,5
groupB 4 S 90 100 1100 75,5
Thank you in advance!
Here is my dbfiddle test: https://dbfiddle.uk/TSarc3Nl
I tried to explain part of the coding (commented) to make things easier.
The balance can be derived from a cumulative sum (using a case expression for when to deduct instead of add).
Then the recursive part just needs a case expression of its own.
WITH
adjust_table AS
(
SELECT
*,
SUM(
CASE WHEN side='B'
THEN qnt
ELSE -qnt
END
)
OVER (
PARTITION BY name
ORDER BY seq
)
AS balance
FROM
mytable
),
recurse AS
(
SELECT adjust_table.*, price AS avg FROM adjust_table WHERE seq = 1
UNION ALL
SELECT
n.*,
CASE WHEN n.side='B'
THEN ((n.price * n.qnt * 1.0) + (s.balance * s.avg)) / (n.qnt + s.balance)
ELSE s.avg
END
AS avg
FROM
adjust_table n
INNER JOIN
recurse s
ON n.seq = s.seq + 1
AND n.name = s.name
)
SELECT
*
FROM
recurse
ORDER BY
name,
seq
https://dbfiddle.uk/mWz945pG
Though I'm not sure what the avg is meant to be doing, so it's possible I got that wrong and/or it could possibly be simplified to not need recursion.
NOTE: Never use , to join tables.
EDIT: Recursion-less version
Use window functions to accumulate the balance, and also the total spent.
Then, use that to a enable the use of another window function to accumulate how much 'spend' is being 'recouped' by sales.
Your avg is then the adjusted spend divided by the current balance.
WITH
accumulate AS
(
SELECT
*,
SUM(
CASE WHEN side='B' THEN qnt ELSE -qnt END
)
OVER (
PARTITION BY name
ORDER BY seq
)
AS balance,
1.0
*
SUM(
CASE WHEN side='B' THEN price * qnt END
)
OVER (
PARTITION BY name
ORDER BY seq
)
AS total_spend
FROM
mytable
)
SELECT
*,
(
total_spend
-
SUM(
CASE WHEN side='S'
THEN qnt * total_spend / (balance+qnt)
ELSE 0
END
)
OVER (
PARTITION BY name
ORDER BY seq
)
-- cumulative sum for amount 'recouped' by sales
)
/ NULLIF(balance, 0)
AS avg
FROM
accumulate
https://dbfiddle.uk/O0HEr556
Note: I still don't understand why you're calculating the avg price this way, but it matched your desired results/formulae, without recursion.

MariaDB running total up to N and rows NOT included in its calculation

I have a table which amongst other columns has amt and created(timestamp).
I'm trying to calculate the running total of amt up to N
Get all the rows not included in the calculation leading to the sum up to N
I'm doing this in code but was wondering if there was a way to get these with SQL and ideally in one query.
Looking around and it's easy to find examples of calculating the running total like
https://stackoverflow.com/a/1290936/400048 but less so to find running total up N and then only actually return rows not involved in calculating N.
You can use the window version of the SUM aggregate function to get the running total for each row.
CREATE TABLE TEST (ID BIGINT PRIMARY KEY, AMT INT, CREATED TIMESTAMP);
INSERT INTO TEST VALUES
(1, 1, TIMESTAMP '2000-01-01 00:00:00'),
(2, 2, TIMESTAMP '2000-01-02 00:00:00'),
(3, 1, TIMESTAMP '2000-01-03 00:00:00'),
(4, 3, TIMESTAMP '2000-01-04 00:00:00'),
(5, 5, TIMESTAMP '2000-01-05 00:00:00'),
(6, 1, TIMESTAMP '2000-01-07 00:00:00');
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST ORDER BY CREATED;
> ID AMT RT CREATED
> -- --- -- -------------------
> 1 1 1 2000-01-01 00:00:00
> 2 2 3 2000-01-02 00:00:00
> 3 1 4 2000-01-03 00:00:00
> 4 3 7 2000-01-04 00:00:00
> 5 5 12 2000-01-05 00:00:00
> 6 1 13 2000-01-07 00:00:00
Then you can use a non-standard QUALIFY clause in H2 or a subquery (in both MariaDB and H2) to filter out rows below the limit.
If N is a running total limit and by “rows not included in the calculation” you mean rows above the limit, the queries will look like these:
-- Simple non-standard query for H2
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
QUALIFY RT > 10 ORDER BY CREATED;
-- Equivalent standard query with subquery for MariaDB, H2, and many others
SELECT * FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT > 10 ORDER BY CREATED;
> ID AMT RT CREATED
> -- --- -- -------------------
> 5 5 12 2000-01-05 00:00:00
> 6 1 13 2000-01-07 00:00:00
RT - AMT in the first row here is a running total of all previous rows. You can select it separately, if you wish:
-- Non-standard query for H2
SELECT SUM(AMT) OVER (ORDER BY CREATED) RT FROM TEST
QUALIFY RT < 10 ORDER BY CREATED DESC FETCH FIRST ROW ONLY;
-- Non-standard query for MariaDB or H2
SELECT RT FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT < 10 ORDER BY CREATED DESC LIMIT 1;
-- Standard query for H2 and others (but not for MariaDB)
SELECT RT FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT < 10 ORDER BY CREATED DESC FETCH FIRST ROW ONLY;
> RT
> --
> 7
If you meant something else, the QUALIFY or WHERE criteria will be different.

Creating even ranges based on values in an oracle table

I have a big table which is 100k rows in size and the PRIMARY KEY is of the datatype NUMBER. The way data is populated in this column is using a random number generator.
So my question is, can there be a possibility to have a SQL query that can help me with getting partition the table evenly with the range of values. Eg: If my column value is like this:
1
2
3
4
5
6
7
8
9
10
And I would like this to be broken into three partitions, then I would expect an output like this:
Range 1 1-3
Range 2 4-7
Range 3 8-10
It sounds like you want the WIDTH_BUCKET() function. Find out more.
This query will give you the start and end range for a table of 1250 rows split into 20 buckets based on id:
with bkt as (
select id
, width_bucket(id, 1, 1251, 20) as id_bucket
from t23
)
select id_bucket
, min(id) as bkt_start
, max(id) as bkt_end
, count(*)
from bkt
group by id_bucket
order by 1
;
The two middle parameters specify min and max values; the last parameter specifies the number of buckets. The output is the rows between the minimum and maximum bows split as evenly as possible into the specified number of buckets. Be careful with the min and max parameters; I've found poorly chosen bounds can have an odd effect on the split.
This solution works without width_bucket function. While it is more verbose and certainly less efficient it will split the data as evenly as possible, even if some ID values are missing.
CREATE TABLE t AS
SELECT rownum AS id
FROM dual
CONNECT BY level <= 10;
WITH
data AS (
SELECT id, rownum as row_num
FROM t
),
total AS (
SELECT count(*) AS total_rows
FROM data
),
parts AS (
SELECT rownum as part_no, total.total_rows, total.total_rows / 3 as part_rows
FROM dual, total
CONNECT BY level <= 3
),
bounds AS (
SELECT parts.part_no,
parts.total_rows,
parts.part_rows,
COALESCE(LAG(data.row_num) OVER (ORDER BY parts.part_no) + 1, 1) AS start_row_num,
data.row_num AS end_row_num
FROM data
JOIN parts
ON data.row_num = ROUND(parts.part_no * parts.part_rows, 0)
)
SELECT bounds.part_no, d1.ID AS start_id, d2.ID AS end_id
FROM bounds
JOIN data d1
ON d1.row_num = bounds.start_row_num
JOIN data d2
ON d2.row_num = bounds.end_row_num
ORDER BY bounds.part_no;
PART_NO START_ID END_ID
---------- ---------- ----------
1 1 3
2 4 7
3 8 10

column delta values

I am planning to store query data in sqlite3 database.
I have these fields in sqlite3
UNIX_EPOCH, CUMULATIVE_QUERY_RATE
1452128581, 150
1452128582, 190
1452128583, 220
1452128584, 270
I want to get queries-per-second column as below:
QPS
0
40
30
50
how do I do it in sqlite3.
You simply have to subtract the value of the previous second:
SELECT unix_epoch,
(SELECT T1.cumulative_query_rate - T2.cumulative_query_rate
FROM SuperSecretTableName AS T2
WHERE T1.unix_epoch - 1 = T2.unix_epoch
) AS qps
FROM SuperSecretTableName AS T1;

how to improve Oracle select performance?

below query takes more than 1 minute, how to improve the performance. full scan is happening in both the tables. how to avoid?
query plan:
SELECT STATEMENT ALL_ROWSCost: 62 Bytes: 14,355 Cardinality: 45
3 HASH JOIN Cost: 62 Bytes: 14,355 Cardinality: 45
1 TABLE ACCESS FULL TABLE SYSADM.POSITIONS Cost: 9 Bytes: 520 Cardinality: 4
2 TABLE ACCESS FULL TABLE SYSADM.PORTCONSUMPTION Cost: 52 Bytes: 797,202 Cardinality: 4,218
SELECT
a.Consumption AS Consumption ,
a.Cost AS Cost ,
a.CreatedBy AS CreatedBy ,
a.CreatedDate AS CreatedDate ,
a.UpdatedBy AS UpdatedBy ,
a.UpdatedDate AS UpdatedDate
FROM PortConsumption a
JOIN Positions b
ON a.PortRotationId = b.Id
WHERE b.VoyageId ='82A042031E1B4C38A9832A6678A695A4';
Positions (*115970 records)*
Id - Primary key (indexed)
VoyageId - indexed
PortConsumption (*1291000 records)*
Id - Primary key (indexed)
PortRotationId - indexed
after executing
dbms_stats.gather_table_stats ('SYSADM', 'POSITIONS');
dbms_stats.gather_table_stats ('SYSADM', 'PORTCONSUMPTION');
full scan is not happening, but performance is still same , takes 50 secs.
Plan
SELECT STATEMENT ALL_ROWSCost: 20 Bytes: 16,536 Cardinality: 52
6 NESTED LOOPS
4 NESTED LOOPS Cost: 20 Bytes: 16,536 Cardinality: 52
2 TABLE ACCESS BY INDEX ROWID TABLE SYSADM.POSITIONS Cost: 5 Bytes: 520 Cardinality: 4
1 INDEX RANGE SCAN INDEX SYSADM.INX_POSITIONS_VOYAGEID Cost: 3 Cardinality: 4
3 INDEX RANGE SCAN INDEX SYSADM.INX_PORTCONS_PORTROTID Cost: 2 Cardinality: 12
5 TABLE ACCESS BY INDEX ROWID TABLE SYSADM.PORTCONSUMPTION Cost: 4 Bytes: 2,256 Cardinality: 12
You need to gather stats on the tables, because Oracle currently thinks POSITIONS has 4 rows not 115970, and that PORTCONSUMPTION has 4218 rows not 1.2 million, and hence that full scans of both is the best way to answer the query.
This code will gather stats on the 2 tables using default settings:
begin
dbms_stats.gather_table_stats ('SYSADM', POSITIONS');
dbms_stats.gather_table_stats ('SYSADM', PORTCONSUMPTION ');
end;
See DBMS_STATS documentation for more details on how to gather stats.

Resources