Teradata performance impacted due to count distinct - teradata

There is no join in the query, it is a simple query with two count distinct. But it is consuming more than 9k cpu.
I have taken the necessary stats, but unable to reduce the CPU. please suggest some good methods to reduce the CPU
can you please let me know what is the best way to reduce the impact CPU

I think the target table is a SET table so your query is taking a lot of CPU (duplicate row elimination).
1) Test your select query on a MULTISET table.
insert into multiset_table
select count(distinct col1) from source_table.
And I believe that your primary index is skewed, the reason for high impact CPU.
2) Make sure your primary index is unique.
select hashamp(hashbucket(hashrow(<primary index columns>))), count(*) (bigint) cnt from target_table group by 1 order by 2 desc;
If the cnt column is not distributed evenly, then change primary index of the table with more unique columns.
Only 2 things can cause merge to run slow,
1) Target table is SET table
2) Primary index of the target table is badly skewed

Related

sqlite group-by with sort-by-desc does not work as expected

I have a aggregate query properly indexed to return ordered results fast (simple index scan). This works as expected when ordering is ascending (ASC), but reversing the order (DESC) results in sqlite creating a TEMP B-TREE
sqlite version 3.26.0
CREATE TABLE t1(x,y);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t1 VALUES(1,2);
INSERT INTO t1 VALUES(2,1);
CREATE INDEX ix1 ON t1(x,y);
EXPLAIN QUERY PLAN SELECT x,max(y) FROM t1 GROUP BY x ORDER BY x;
EXPLAIN QUERY PLAN SELECT x,max(y) FROM t1 GROUP BY x ORDER BY x DESC; -- This query constructs a TEMP B-TREE, why?
When running above code you will see query 1 simply running a index scan, while query 2 in addition to running a index scan, also makes a TEMP B-TREE to order the result, destroying performance.
The index created supports traversing both direction so I would expect same performance for both ASC and DESC ordering.
Is this a known limitation in sqlite and aggregates, or am I expecting/doing something wrong?
The code for looping over an index goes backwards only when needed. For implementing GROUP BY itself, going backwards is never needed, so it is never tried.
In reaction to your report on the sqlite-users mailing list, SQLite version 3.30.0 will have code to handle this case:
/* The GROUP BY processing doesn't care whether rows are delivered in
** ASC or DESC order - only that each group is returned contiguously.
** So set the ASC/DESC flags in the GROUP BY to match those in the
** ORDER BY to maximize the chances of rows being delivered in an
** order that makes the ORDER BY redundant. */

How can I understand the sqlite query plan?

I executed a query on SQLite and the plan part is
0|1|5|SCAN TABLE edges AS e1 (~250000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE dihedral USING AUTOMATIC
COVERING INDEX (TYPE=? AND EDGE=?) (~7 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE bounds USING AUTOMATIC
COVERING INDEX (FACE=? AND EDGE=?) (~7 rows)
where the query in WHERE is
exists (select dihedral.edge from dihedral where ihedral.type=2 and dihedral.edge=e1.edge) and
exists (select bounds.edge from bounds where bounds.face=f1.face and bounds.edge=e1.edge) and
I understand this is not a high effeciency query,Ijust want to increase the performance.
This is my guess:
There is no subquery flattening, right?
The two exist subquery introduce the correlated subquery, and as they are acctually executed as indexed nested loop, right?
Read the query, because table dihedral and bounds are independent, both are correlated with the outer edge table, so the computational complexity is O(n^2) for no index. However, as there are covering index, the performance should be much better, right? I found on wiki, index has performance of O(log(N)) even better,so the overall performance should be O(n*log(N)), is this right?
Could anyone help me to understand what happened? thanks.
SQLite does support subquery flattening, but this it not possible for an EXISTS subquery like here.
The AUTOMATIC shows that the database creates a temporary index just for this query.
This is a strong indication that you should create these indexes permanently:
CREATE INDEX dihedral_type_edge ON dihedral(type, edge);
CREATE INDEX bounds_face_edge ON bounds(face, edge);
The outer query goes through all edge rows, and for each row, searches in the indexes.
This would result in O(edge * (log(dihedral) + log(bounds))).
The temporary index creation requires sorting these tables, so the entire runtime ends up being O(dihedral*log(dihedral) + bounds*log(bounds) + edge*(log(dihedral)+log(bounds))).

What is the fastest way of selecting by a list of strings in sqlite database?

I have database with with roughly following structure:
table1 (name) -< table2 -< table3 (score)
where -< means 1 to many relationship. What i need to do is for every string in a given list find the linked entry from table3 with a maximum score value. The way i do it now is quite slow, and i wonder of it could be sped up.
How i am doing this:
SELECT k.score,k.yaw,k.pitch,k.roll,k.kp_number,k.ke_number,k.points,k.elems --various fields of third table
FROM File
JOIN FaceDetection AS d ON d.f_id=File.file_id --joining second table
JOIN FaceKey AS k ON k.face_det=d.fd_id --joining third table
WHERE name=:fld
ORDER BY k.score DESC
I open transaction, prepare query with the above text, and in cycle retrieve the entries i am interested in from the database, then commit transaction. What are better, faster ways?
Indexes can be used for all the columns that are used for lookups or sorting, but a query cannot use more than one index per table.
Check the EXPLAIN QUERY PLAN output to see whether this query does table scans or uses indexes.
You are not returning values from any table but FaceKey, so you do not actually need to do a join.
However, rewriting the query as below might or might not help:
SELECT score,
yaw,
pitch,
roll,
kp_number,
ke_number,
points,
elems
FROM FaceKey
WHERE face_det IN (SELECT fd_id
FROM FaceDetection
WHERE f_id IN (SELECT file_id
FROM File
WHERE name = :fld))
ORDER BY score DESC

Explanation on index on a datetime field and included columns

I have a sqlserver table with the usual
intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation
99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables.
select * because more or less I need all the fields.
I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.
I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.
I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)
And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?
Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?
Thanks!
Mattia
We will need more information on your table examples of your queries to address this fully, but:
DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .
Edit
Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.
The basic starting point would be:
CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,
CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);
And the basic indexes will be
CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);
Clustering Consideration / Option
If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).
However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.
Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.
Covering Indexes
As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.
SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.
Then altering your index on TimeOperation to include field1
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);
OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.
Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?
e.g.
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
will not be useful for the query
SELECT ... FROM MyTable WHERE Field1 = 'xyz';
The index will only be useful for the queries which have TimeOperation
SELECT ... FROM MyTable WHERE TimeOperation between x and y;
OR
SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';
Hope this helps?
An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).
An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.
NEVER simply create an index including all columns of a table, especially a wide one such as this.

Sqlite Query Optimization (using Limit and Offset)

Following is the query that I use for getting a fixed number of records from a database with millions of records:-
select * from myTable LIMIT 100 OFFSET 0
What I observed is, if the offset is very high like say 90000, then it takes more time for the query to execute. Following is the time difference between 2 queries with different offsets:
select * from myTable LIMIT 100 OFFSET 0 //Execution Time is less than 1sec
select * from myTable LIMIT 100 OFFSET 95000 //Execution Time is almost 15secs
Can anyone suggest me how to optimize this query? I mean, the Query Execution Time should be same and fast for any number of records I wish to retrieve from any OFFSET.
Newly Added:-
The actual scenario is that I have got a database having > than 1 million records. But since it's an embedded device, I just can't do "select * from myTable" and then fetch all the records from the query. My device crashes. Instead what I do is I keep fetching records batch by batch (batch size = 100 or 1000 records) as per the query mentioned above. But as i mentioned, it becomes slow as the offset increases. So, my ultimate aim is that I want to read all the records from the database. But since I can't fetch all the records in a single execution, I need some other efficient way to achieve this.
As JvdBerg said, indexes are not used in LIMIT/OFFSET.
Simply adding 'ORDER BY indexed_field' will not help too.
To speed up pagination you should avoid LIMIT/OFFSET and use WHERE clause instead. For example, if your primary key field is named 'id' and has no gaps, than your code above can be rewritten like this:
SELECT * FROM myTable WHERE id>=0 AND id<100 //very fast!
SELECT * FROM myTable WHERE id>=95000 AND id<95100 //as fast as previous line!
By doing a query with a offset of 95000, all previous 95000 records are processed. You should make some index on the table, and use that for selecting records.
As #user318750 said, if you know you have a contiguous index, you can simply use
select * from Table where index >= %start and index < %(start+size)
However, those cases are rare. If you don't want to rely on that assumption, use a sub-query, for example using rowid, which is always indexed,
select * from Table where rowid in (
select rowid from Table limit %size offset %start)
This speeds things up especially if you have "fat" rows (e.g. that contain blobs).
If maintaining the record order is important (it usually isn't), you need to order the indices first:
select * from Table where rowid in (
select rowid from Table order by rowid limit %size offset %start)
select * from data where rowid = (select rowid from data limit 1 offset 999999);
With SQLite, you don't need to get all rows returned at once in a big fat array, you can get called back for every row. This way, you can process the results as they come in, which should address both your crashing and performance issues.
I guess you're not using C as you would already be using a callback, but this technique should be available in any other language.
Javascript example (from : https://www.npmjs.com/package/sqlite3 )
db.each("SELECT rowid AS id, info FROM lorem", function(err, row) {
console.log(row.id + ": " + row.info);
});

Resources