I executed a query on SQLite and the plan part is
0|1|5|SCAN TABLE edges AS e1 (~250000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE dihedral USING AUTOMATIC
COVERING INDEX (TYPE=? AND EDGE=?) (~7 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE bounds USING AUTOMATIC
COVERING INDEX (FACE=? AND EDGE=?) (~7 rows)
where the query in WHERE is
exists (select dihedral.edge from dihedral where ihedral.type=2 and dihedral.edge=e1.edge) and
exists (select bounds.edge from bounds where bounds.face=f1.face and bounds.edge=e1.edge) and
I understand this is not a high effeciency query,Ijust want to increase the performance.
This is my guess:
There is no subquery flattening, right?
The two exist subquery introduce the correlated subquery, and as they are acctually executed as indexed nested loop, right?
Read the query, because table dihedral and bounds are independent, both are correlated with the outer edge table, so the computational complexity is O(n^2) for no index. However, as there are covering index, the performance should be much better, right? I found on wiki, index has performance of O(log(N)) even better,so the overall performance should be O(n*log(N)), is this right?
Could anyone help me to understand what happened? thanks.
SQLite does support subquery flattening, but this it not possible for an EXISTS subquery like here.
The AUTOMATIC shows that the database creates a temporary index just for this query.
This is a strong indication that you should create these indexes permanently:
CREATE INDEX dihedral_type_edge ON dihedral(type, edge);
CREATE INDEX bounds_face_edge ON bounds(face, edge);
The outer query goes through all edge rows, and for each row, searches in the indexes.
This would result in O(edge * (log(dihedral) + log(bounds))).
The temporary index creation requires sorting these tables, so the entire runtime ends up being O(dihedral*log(dihedral) + bounds*log(bounds) + edge*(log(dihedral)+log(bounds))).
Related
I have a simple core sql query that gets a count of rows. If i do the EXISTS and the IN separately, it's around 2/3RUs, but if i do a (EXISTS "OR" IN) -- I can even do (EXISTS "OR" TRUE), then it jumps up to 45RU. It makes more since for me to do 2 different queries than 1. Why does the OR cause the RU consuption to go up?
These are my queries that I've tried and I've experimented on.
SELECT VALUE COUNT(1) FROM ROOT r. -- 850 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) -- 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND r.id IN (......). 830 rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) -- 840rows, 2-3RUs
SELECT VALUE COUNT(1) FROM ROOT r WHERE IS_NULL(r.deletedAt) AND (EXISTS(SELECT s FROM s IN r.shared WHERE s.id = ID) OR r.id IN (...)) -- 840rows, 45RUs
This is also cross-listed on Microsoft Q/A as well.
Disclaimer: I have no internal view on CosmosdB engine and below is just a general guess.
While there may be tricks involved regarding data cardinality, how your index is set up and if/how the predicate tree could be pruned, but overall it is not too surprising that OR is a harder query. You can't have a covering index for OR-predicate and that requires data lookups.
For index-covered ANDs only, basically:
get matching entries from indexes for indexable predicates and take intersection.
return count
With OR-s you can't work on indexes alone:
get matching entries from indexes for indexable predicates and take intersection.
look up documents (or required parts)
Evaluate non-indexable predicates (like A OR B) on all matching documents
return count
Obviously the second requires a lot more computation and memory. Hence, higher RU. Query engine can do all kind of tricks, but the fact is that they must get extra data to make sure your "hard" predicates are taken into account.
BTW, if unhappy with RU, then you should always check which/how indexes were applied and if you can improve anything by setting up different indexes.
See: Indexing metrics in Azure Cosmos DB.
Having more complex queries have higher RU is still to be expected, though.
In my android application, I use Cursor c = db.rawQuery(query, null); to query data from a local sqlite database, and one of the query string looks like the following:
SELECT t1.* FROM table t1
WHERE NOT EXISTS (
SELECT 1 FROM table t2
WHERE t2.start_time = t1.start_time AND t2.stop_time > t1.stop_time
)
however, the issue is that the query gets very slow when the database gets huge. Trying to look into introducing indexing to speed up the query, but so far, not been very successful, therefore, would be great to have some help here, as it's also hard to find examples for this for android applications.
You can create a composite index for the columns start_time and stop_time:
CREATE INDEX idx_name ON table_name(start_time, stop_time);
You can read in The SQLite Query Optimizer Overview:
The ON and USING clauses of an inner join are converted into
additional terms of the WHERE clause prior to WHERE clause analysis
...
and:
If an index is created using a statement like this:
CREATE INDEX idx_ex1 ON ex1(a,b,c,d,e,...,y,z);
Then the index might be used if the initial columns of the index
(columns a, b, and so forth) appear in WHERE clause terms. The initial
columns of the index must be used with the = or IN or IS operators.
The right-most column that is used can employ inequalities.
You may have to uninstall the app from the device so that the db is deleted and rerun to recreate it, or increase the version number of the db so that you can create the index in the onUpgrade() method.
I have a aggregate query properly indexed to return ordered results fast (simple index scan). This works as expected when ordering is ascending (ASC), but reversing the order (DESC) results in sqlite creating a TEMP B-TREE
sqlite version 3.26.0
CREATE TABLE t1(x,y);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t1 VALUES(1,2);
INSERT INTO t1 VALUES(2,1);
CREATE INDEX ix1 ON t1(x,y);
EXPLAIN QUERY PLAN SELECT x,max(y) FROM t1 GROUP BY x ORDER BY x;
EXPLAIN QUERY PLAN SELECT x,max(y) FROM t1 GROUP BY x ORDER BY x DESC; -- This query constructs a TEMP B-TREE, why?
When running above code you will see query 1 simply running a index scan, while query 2 in addition to running a index scan, also makes a TEMP B-TREE to order the result, destroying performance.
The index created supports traversing both direction so I would expect same performance for both ASC and DESC ordering.
Is this a known limitation in sqlite and aggregates, or am I expecting/doing something wrong?
The code for looping over an index goes backwards only when needed. For implementing GROUP BY itself, going backwards is never needed, so it is never tried.
In reaction to your report on the sqlite-users mailing list, SQLite version 3.30.0 will have code to handle this case:
/* The GROUP BY processing doesn't care whether rows are delivered in
** ASC or DESC order - only that each group is returned contiguously.
** So set the ASC/DESC flags in the GROUP BY to match those in the
** ORDER BY to maximize the chances of rows being delivered in an
** order that makes the ORDER BY redundant. */
There is no join in the query, it is a simple query with two count distinct. But it is consuming more than 9k cpu.
I have taken the necessary stats, but unable to reduce the CPU. please suggest some good methods to reduce the CPU
can you please let me know what is the best way to reduce the impact CPU
I think the target table is a SET table so your query is taking a lot of CPU (duplicate row elimination).
1) Test your select query on a MULTISET table.
insert into multiset_table
select count(distinct col1) from source_table.
And I believe that your primary index is skewed, the reason for high impact CPU.
2) Make sure your primary index is unique.
select hashamp(hashbucket(hashrow(<primary index columns>))), count(*) (bigint) cnt from target_table group by 1 order by 2 desc;
If the cnt column is not distributed evenly, then change primary index of the table with more unique columns.
Only 2 things can cause merge to run slow,
1) Target table is SET table
2) Primary index of the target table is badly skewed
I have a sqlserver table with the usual
intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation
99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables.
select * because more or less I need all the fields.
I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.
I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.
I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)
And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?
Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?
Thanks!
Mattia
We will need more information on your table examples of your queries to address this fully, but:
DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .
Edit
Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.
The basic starting point would be:
CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,
CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);
And the basic indexes will be
CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);
Clustering Consideration / Option
If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).
However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.
Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.
Covering Indexes
As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.
SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.
Then altering your index on TimeOperation to include field1
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);
OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.
Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?
e.g.
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
will not be useful for the query
SELECT ... FROM MyTable WHERE Field1 = 'xyz';
The index will only be useful for the queries which have TimeOperation
SELECT ... FROM MyTable WHERE TimeOperation between x and y;
OR
SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';
Hope this helps?
An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).
An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.
NEVER simply create an index including all columns of a table, especially a wide one such as this.