postgres: Joining a small table with large table - postgresql-9.1

Postgres Server Version: server 9.1.9
explain analyze
select * from A, B where A.groupid = B.groupid;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=1.23..8.64 rows=2 width=104) (actual time=0.076..204.212 rows=3 loops=1)
Merge Cond: (A.groupid = B.groupid)
-> Index Scan using A_pkey on A (cost=0.00..68144.37 rows=1065413 width=88) (actual time=0.008..115.366 rows=120938 loops=1)
-> Sort (cost=1.03..1.03 rows=2 width=16) (actual time=0.013..0.016 rows=3 loops=1)
Sort Key: B.groupid
Sort Method: quicksort Memory: 25kB
-> Seq Scan on B (cost=0.00..1.02 rows=2 width=16) (actual time=0.002..0.004 rows=3 loops=1)
Total runtime: 204.257 ms
(8 rows)
Table A has 1 million+ rows. Table B has 3 rows. In actual production query, there are other where clauses on table A that reduce #of rows to 15k+, but still query takes more than 50ms and appears in our slow query logs.
Is there any way to improve performance here? I guess index scan on larger table is causing the slowness.

Related

Can I speed up calculations between R and Sqlite by using data.tables?

I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)

Why does sqlite rescan the table with distinct?

I need to get the distinct elements of ref and alt. I have a very efficient query until I add distinct and it rescans the base table? Since I have a temp table shouldn't it simply use that as a source of data?
sqlite> explain query plan
...> select t1.ref, t1.alt from (SELECT * from Sample_szes where str_id
= 'STR_832206') as t1;
selectid|order|from|detail
1|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx ( .
str_id=?) (~10 rows)
1|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx
(str_id=?) (~10 rows)
2|0|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx
(str_id=?) (~10 rows)
2|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx
(str_id=?) (~2 rows)
2|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 (UNION ALL)
Add distinct and it rescans the large base table.
sqlite> explain query plan
...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes
where str_id = 'STR_832206') as t1;
selectid|order|from|detail
2|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
2|1|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx
(str_id=?) (~10 rows)
3|0|0|SCAN TABLE vcfBase AS base (~1000000 rows)
3|1|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?) (~2 rows)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?) (~2 rows)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1 (~1400000 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
You should create a composite index for the ref and alt columns. The index would then be used. Otherwise the temporary B-TREE (index) is created which requires an entire scan to sort the data for the index.
I believe the explanation is as per :-
If a SELECT query contains an ORDER BY, GROUP BY or DISTINCT clause,
SQLite may need to use a temporary b-tree structure to sort the output
rows. Or, it might use an index. Using an index is almost always much
more efficient than performing a sort.
If a temporary b-tree is required, a record is added to the EXPLAIN
QUERY PLAN output with the "detail" field set to a string value of the
form "USE TEMP B-TREE FOR xxx", where xxx is one of "ORDER BY", "GROUP
BY" or "DISTINCT". For example:
sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c;
QUERY PLAN
|--SCAN TABLE t2
`--USE TEMP B-TREE FOR ORDER BY
In this case using the temporary b-tree can be avoided by creating an
index on t2(c), as follows:
sqlite> CREATE INDEX i4 ON t2(c);
sqlite> EXPLAIN QUERY PLAN SELECT c, d FROM t2 ORDER BY c;
QUERY PLAN
`--SCAN TABLE t2 USING INDEX i4
EXPLAIN QUERY PLAN - 1.2. Temporary Sorting B-Trees
I think I might have found the answer. on my mac I have the following version of sqlite
SQLite version 3.19.3 2017-06-27 16:48:08
sqlite> explain query plan
...> select distinct t1.ref, t1.alt from (SELECT * from Sample_szes where str_id = 'STR_832206') as t1;
2|0|1|SEARCH TABLE vcfhomozyg AS hzyg USING INDEX homozyg_strid_idx (str_id=?)
2|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|0|1|SEARCH TABLE vcfAlt AS alt USING INDEX vcfAlt_strid_idx (str_id=?)
3|1|0|SEARCH TABLE vcfBase AS base USING INDEX vcfBase_strid_idx (str_id=?)
3|2|2|SEARCH TABLE altGT AS gt USING INDEX altGT_strid_idx (str_id=?)
1|0|0|COMPOUND SUBQUERIES 2 AND 3 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|USE TEMP B-TREE FOR DISTINCT

BigQuery export to csv and load in R for huge tables

I have a 100GB table which I want to process in R. When I export it to csvs I get 500 csv files - when I read them in r into data tables and bind them - I get a huge data table which cann't be saved/loaded (even when I increase the memory of the virtual instance that the R is installed on). I wanted to try a different attitude - split the original table, export to R, then process each table seperately. The problem is that I din't want the split to "break" in the middle of some grouping. For example - my key variable is "visit", and each visit may have several rows. I don't want that there will be a visit which is broken into different sub-tables (beacuse all my processing in R is done using visit as the grouping variable of data table). what is the best way to do it? I tried to order the visit ids by time, to export only their names to a spearate csv etc. - all the order by trials are ended with an error (not enough resources). The table currently contains more than 100M rows, with 64 variables.
I wanted to try a different attitude - split the original table …
The problem is that I din't want the split to "break" in the middle of some grouping.
Below is how to identify batches such that rows for same visitid will be in the same batch
For each batch max and min visitid are identified so that you can then use them to extract only rows for those visitids between min and max values thus controlling size of your to be processed data
1 – Batching by number of rows
Replace 1000000 below with whatever you want batch size to be in terms of number of rows
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
2 – Batching by bytes size of batch
Replace 1000000000 below with whatever you want batch size to be in terms of bytes
And replace 123 below with eastimated average size of one row in bytes
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, SUM(123) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
Above helps you to be prepared for proper splitting your original table using batches min and max values
Hope this help you to proceed further
Note: above assumes normal distribution of rows for visitid and relatively big number of rows in table (like in your example), so batches will be reasonably evenly sized
Note 2: I realized I wrote it quickly in Legacy SQL , so below is version in Standard SQL in case if you want to migrate or already using it
#standardSQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
CAST(CEIL(total/1000000) as INT64) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM `yourproject.yourdataset.yourtable`
GROUP BY visitId
)
)
)
GROUP BY batch

Teradata Aggregation computation locally vs globally

I have two tables Table1 and Table2 with both tables having primary index as col1,col2,col3 and col4.
I join these two tables and do a group by on a set of columns which includes the PI of the tables.
Can someone tell me why in the explain plan I get "Aggregate Intermediate Results are computed globally"
rather than locally. My understanding is that the when the group by column contain all the PI column
aggregate results are computed locally rather than globally.
select
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
,SUM(col10)
,COUNT(col11)
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
Below is the Explain plan for the Query
1) First, we lock a distinct DATEBASE_NAME."pseudo table" for read on a
RowHash to prevent global deadlock for DATEBASE_NAME.S.
2) Next, we lock a distinct DATEBASE_NAME."pseudo table" for write on a
RowHash to prevent global deadlock for
DATEBASE_NAME.TARGET_TABLE.
3) We lock a distinct DATEBASE_NAME."pseudo table" for read on a RowHash
to prevent global deadlock for DATEBASE_NAME.E.
4) We lock DATEBASE_NAME.S for read, we lock
DATEBASE_NAME.TARGET_TABLE for write, and we lock
DATEBASE_NAME.E for read.
5) We do an all-AMPs JOIN step from DATEBASE_NAME.S by way of a RowHash
match scan with no residual conditions, which is joined to
DATEBASE_NAME.E by way of a RowHash match scan. DATEBASE_NAME.S and
DATEBASE_NAME.E are left outer joined using a merge join, with
condition(s) used for non-matching on left table ("(NOT
(DATEBASE_NAME.S.col1 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col2 IS NULL )) AND ((NOT
(DATEBASE_NAME.S.col3 IS NULL )) AND (NOT
(DATEBASE_NAME.S.col4 IS NULL ))))"), with a join condition of (
"(DATEBASE_NAME.S.col1 = DATEBASE_NAME.E.col1) AND
((DATEBASE_NAME.S.col2 = DATEBASE_NAME.E.col2) AND
((DATEBASE_NAME.S.col3 = DATEBASE_NAME.E.col3) AND
(DATEBASE_NAME.S.col4 = DATEBASE_NAME.E.col4 )))"). The input
table DATEBASE_NAME.S will not be cached in memory. The result goes
into Spool 3 (all_amps), which is built locally on the AMPs. The
result spool file will not be cached in memory. The size of Spool
3 is estimated with low confidence to be 675,301,664 rows (
812,387,901,792 bytes). The estimated time for this step is 3
minutes and 37 seconds.
6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
DATEBASE_NAME.S.col1 ,DATEBASE_NAME.S.col2
,DATEBASE_NAME.S.col3 ,DATEBASE_NAME.S.col4
,DATEBASE_NAME.E.col5
,DATEBASE_NAME.S.col6 ,DATEBASE_NAME.S.col7
,DATEBASE_NAME.S.col8 ,DATEBASE_NAME.S.col9). Aggregate
Intermediate Results are computed globally, then placed in Spool 4.
The aggregate spool file will not be cached in memory. The size
of Spool 4 is estimated with low confidence to be 506,476,248 rows
(1,787,354,679,192 bytes). The estimated time for this step is 1
hour and 1 minute.
7) We do an all-AMPs MERGE into DATEBASE_NAME.TARGET_TABLE
from Spool 4 (Last Use). The size is estimated with low
confidence to be 506,476,248 rows. The estimated time for this
step is 33 hours and 12 minutes.
8) We spoil the parser's dictionary cache for the table.
9) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
you just use col1,col2,col3,col4 to aggregate
then it would be aggregate locall?
more details from this url:
http://www.teradataforum.com/teradata/20040526_133730.htm
I believe that is because of intermediate spool. You are using columns from that spool, not from original table for group by. I was able to compute the aggregate intermediate results locally using a volatile table.
Essentially what happened in this case is that I took the spool from step 5, gave it a name and enforced a PI on it. Since PI of volatile table is same as initial tables, the volatile table generation is also a local amp operation.
CREATE VOLATILE TABLE x AS
(
SELECT
A.col1
,A.col2
,A.col3
,A.col4
,col5
,col6
,col7
,col8
,col9
--,SUM(col10)
--,COUNT(col11)
from
table1 A
left outer join
table2 B
on A.col1 = B.col1
A.col2 = B.col2
A.col3 = B.col3
A.col4 = B.col4
--group by A.col1,A.col2,A.col3,A.col4,col5,col6,col7,col8,col9
)
WITH DATA PRIMARY INDEX (col1, col2, col3, col4)
;
SELECT
col1
,col2
,col3
,col4
,col5
,col6
,col7
,col8
,col9
SUM(col10)
COUNT(col11)
from
x
GROUP BY
col1,col2,col3,col4,col5,col6,col7,col8,col9

Group by ranges in SQLite

I have a SQLite table which contains a numeric field field_name. I need to group by ranges of this column, something like this: SELECT CAST(field_name/100 AS INT), COUNT(*) FROM table GROUP BY CAST(field_name/100 AS INT), but including ranges which have no value (COUNT for them should be 0). And I can't get how to perform such a query?
You can do this by using a join and (though kludgy) an extra table.
The extra table would contain each of the values you want a row for in the response to your query (this would not only fill in missing CAST(field_name/100 AS INT) values between your returned values, but also let you expand it such that if your current groups were 5, 6, 7 you could include 0 through 10.
In other flavors of SQL you'd be able to right join or full outer join, and you'd be on your way. Alas, SQLite doesn't offer these.
Accordingly, we'll use a cross join (join everything to everything) and then filter. If you've got a relatively small database or a small number of groups, you're in good shape. If you have large numbers of both, this will be a very intensive way to go about this (the cross join result will have #ofRowsOfData * #ofGroups rows, so watch out).
Example:
TABLE: groups_for_report
desired_group
-------------
0
1
2
3
4
5
6
Table: data
fieldname other_field
--------- -----------
250 somestuff
230 someotherstuff
600 stuff
you would use a query like
select groups_for_report.desired_group, count(data.fieldname)
from data
cross join groups_for_report
where CAST(fieldname/100.0 AS INT)=desired_group
group by desired_group;

Resources