SQLite - Selecting not indexed column in GROUP BY - sqlite

I have similar situation like question below.
Mysql speed up max() group by
SELECT MAX(id) id, cid FROM table GROUP BY cid
To optimize above query (shown in the question), creating index(cid, id) does the trick.
However, when I add a column that is not indexed to SELECT, query speed drastically slows down.
For example,
SELECT MAX(id) id, cid, newcolumn FROM table GROUP BY cid
If I create index(cid, id, newcolumn), query time comes back to minimal. It seems I should index all the columns I select while using GROUP BY.
Is there any way other than indexing all the columns to be select?

When all the columns used in the query are part of the index (which is then called a covering index), SQLite can get all values from the index and does not need to access the table itself.
When adding a column that is not indexed, each record must be looked up in both the index and the table.
Furthermore, the order of the records in the table is unlikely to be the same as the order in the index, so the table's pages are not read in order, and are read multiple times, which means that caching will not work as well.
The newcolumn values must be read from either the table or an index; there is no other mechanism to store data.
tl;dr: no

Related

Keep a maximum of 5 rows in a table (Room/Sqlite)

I want to store the last 5 searches a user performed in a SQLite table using Room. How can I always delete the oldest entry when there are more than 5 entries?
I don't want to add a date column and sort by date, as for privacy reasons I don't want to store the time when a user performed the search
I don't want to use an autoincrement id column, as it's theoretically limited at some given maximum that the ID can be
Could I maybe use the rowid? So checking if the number of entries in the table is larger than 5, then sort by rowid ascending and delete the first entry? Any other ideas?
I don't want to use an autoincrement id column, as it's theoretically
limited at some given maximum that the ID can be
On that logic one shouldn't use autoincrement values in database at all. I doubt you really have such a data that Long with its maximum value (9223372036854775807) wouldn't be enough or could be achieved.
Well, as one more alternative - you can use next schema if there is 5 rows in your table and you have int id field for example:
Delete row with minimal id (I guess it would be 0)
Update all the rows with one query, decreasing their id by 1.
#Query("update search_table set id = id - 1")
fun reorderData()
insert new row.

sql query for extracting one column from many tables

I need your support for a query in SQLite Studio.
I am dealing with a database made by 1,000 different tables.
Half of them (all named "news" + an identification number, like 04AD86) contain the column "category" which I am interested in. This column can have from 100 to 200 records for each table.
Could you suggest me a query that extracts "category" from every table and returns a list of all possible categories (without duplicates records)?
Thanks a lot
You will probably need dynamic SQL to handle this in a single query. If you don't mind doing this over several queries, then here is one option. First do a query to obtain all the tables which contain the category column:
SELECT name
FROM sqlite_master
WHERE type = 'table' AND name LIKE 'news%'
Next, for the actual queries to obtain the unique categories, you can perform a series of unions to get your list. Here is what it would look like:
SELECT DISTINCT category
FROM news04AD86
UNION
SELECT DISTINCT category
FROM news 05BG34
UNION
...
The DISTINCT keyword will remove duplicates within any given name table, and UNION will remove duplicates which might occur between one table and another.

How can I replace certain values by their average in an sqlite database?

I have an sqlite database with a table that logs electric power values over time, i.e. there is a timestamp column and one for the associated power value.
With a value coming in roughly every second, this table grows significantly over time. Which is why I want to thin out old values, for example by replacing all 60 values in a minute with their average.
I know how to query for the average.
I know how to insert the query's result back into the table.
But how do I delete the original values without also deleting the newly inserted average value (which has a timestamp within the same range)?
Note that I would like to perform the operation entirely inside sqlite query language, i.e. without storing for example row ids in the C code that is executing the queries.
The easiest way would be to use a temporary table:
BEGIN;
CREATE TEMP TABLE Averages AS
SELECT MIN(Timestamp), AVG(Value)
FROM MyTable
WHERE (old)
GROUP BY (minute);
DELETE FROM MyTable WHERE (old);
INSERT INTO MyTable(Timestamp, Value) SELECT * FROM Averages;
DROP TABLE Averages;
COMMIT;

Explanation on index on a datetime field and included columns

I have a sqlserver table with the usual
intID(primary key),field1,field2,manyotherfields..., datetime TimeOperation
99% of my different kind of queries start with a TimeOperation BETWEEN startTime AND endTime, and then select * (or count(*)) where fieldA=xxx, and join with other smaller tables.
select * because more or less I need all the fields.
I obviusly created an index on TimeOperation ... but performance are not good enough, so I want to add some index key columns or index included columns, but I'm a little bit confused.
I get the difference between the two, but I don't get how much adding a column in each case impacts on speed and on size.
I guess that the biggest improvement would be to create an index including ALL the columns, is it right? (but I can't afford it in terms of space)
And if I often use field1=xxx for example, adding field1 to the index key columns (after TimeOperation) would give better performance right?
Also...just to be sure how an index with included columns works: if I select rows with TimeOperation in a certain range, sql seeks my TimeOperation index for the rows I'm interested in, and it is faster than scanning all the table because in the index the TimeOperation values are in ascending order, is it right? But then I need all the data now I need all the rest of the data fields of those rows...how does sql acts to retrieve the data? I guess it has a sort of bookmark to those rows in the index, right? But it has to hit the table multiple times then... so including all the columns in the index will save the time to hit the table, it it correct?
Thanks!
Mattia
We will need more information on your table examples of your queries to address this fully, but:
DateTime columns should be highly selective by themselves, so an index with TimeOperation as the first column should address the bulk of queries against TimeOperation.
Do not add all columns blindly to an index, or even on included indexes - this will make the index page density worse and be counter productive (you would be duplicating your table in an index).
If all data in your database centres around TimeOperation, you might consider building your clustered index around it.
If you have queries just on field1 = x then you need a separate index just for field1 (assuming that it is suitably selective), i.e. no TimeOperation on the index if its not in the WHERE clause of your query.
Yes, you are right, when SQL locates a record in an index, it needs to do a key (or RID) lookup back into the cluster to retrieve the rest of the columns. If your non clustered index Includes the other columns in your select statement, the lookup can be avoided. But since you are using SELECT(*), covering indexes are unlikely to help .
Edit
Explanation - Selectivity and density are explained in detail here. e.g. iff your queries against TimeOperation return only a small number of rows (rule of thumb is < 5%, but this isn't always), will the index be used, i.e. your query is selective enough for SQL to choose the index on TimeOperation.
The basic starting point would be:
CREATE TABLE [MyTable]
(
intID INT ID identity(1,1) NOT NULL,
field1 NVARCHAR(20),
-- .. More columns, which may be selected, but not filtered
TimeOperation DateTime,
CONSTRAINT PK_MyTable PRIMARY KEY (IntId)
);
And the basic indexes will be
CREATE NONCLUSTERED INDEX IX_MyTable_1 ON [MyTable](TimeOperation);
CREATE NONCLUSTERED INDEX IX_MyTable_2 ON [MyTable](Field1);
Clustering Consideration / Option
If most of your records are inserted in 'serial' ascending TimeOperation order, i.e. intId and TimeOperation will both increase in tandem, then I would leave the clustering on intID (the default) (i.e. table DDL is PRIMARY KEY CLUSTERED (IntId), which is the default anyway).
However, if there is NO correlation between IntId and TimeOperation, and IF most of your queries are of the form SELECT * FROM [MyTable] WHERE TimeOperation between xx and yy then CREATE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) (and changing PK to PRIMARY KEY NONCLUSTERED (IntId)) should improve this query (Rationale: since contiguous times are kept together, fewer pages need to be read, and the bookmark lookup will be avoided). Even better, if values of TimeOperation are guaranteed to be unique, then CREATE UNIQUE CLUSTERED INDEX CL_MyTable ON MyTable(TimeOperation) will improve density as it will avoid the uniqueifier.
Note - for the rest of this answer, I'm assuming that your IntId and TimeOperations ARE strongly correlated and hence the clustering is by IntId.
Covering Indexes
As others have mentioned, your use of SELECT (*) is bad practice and inter alia means covering indexes won't be of any use (the exception being COUNT(*)).
If your queries weren't SELECT(*), but instead e.g.
SELECT TimeOperation, field1
FROM
WHERE TimeOperation BETWEEN x and y -- and returns < 5% data.
Then altering your index on TimeOperation to include field1
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation) INCLUDE(Field1);
OR adding both to the index (with the most common filter first, or the most selective first if both filters are always present)
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
Either will avoid the rid / key lookup. The second (,) option will address your query where BOTH TimeOperation and Field1 are filtered in a WHERE or HAVING clause.
Re : What's the difference between index on (TimeOperation, Field1) and separate indexes?
e.g.
CREATE NONCLUSTERED INDEX IX_MyTable ON [MyTable](TimeOperation, Field1);
will not be useful for the query
SELECT ... FROM MyTable WHERE Field1 = 'xyz';
The index will only be useful for the queries which have TimeOperation
SELECT ... FROM MyTable WHERE TimeOperation between x and y;
OR
SELECT ... FROM MyTable WHERE TimeOperation between x and y AND Field1 = 'xyz';
Hope this helps?
An index, at its most basic, creates a layer of the "hypertree" structure behind the scenes, which allows the SQL engine to more easily find rows with particular values for indexed columns. Each index creates a different way to "drill down" into the table's data using a binary search (logN performance). Each index you add makes selecting by that index faster, at the cost of slowing insertions/updates (the data must be put in and then indexes must be created).
An index, therefore, should normally be created for combinations of columns that are commonly used to filter records. I would indeed create an index on TimeOperation, and TimeOperation alone.
NEVER simply create an index including all columns of a table, especially a wide one such as this.

How can I insert a new table row into every other row in an existing table?

Ok I have a sqlite db, that has roughly 100 rows. It is kind of a strange thing that I'm trying to do, but I need to insert a new row between each of the existing rows.
I have been trying to use the Insert statement as follows, but haven't had any luck:
insert into t1(column1) values("hello") where id%2 == 0
So I'm basically trying to use the %-operator to tell me if the id is even or odd. For every even id number, I'd like to insert a new row.
What am I missing? What can I do differently? How can I insert a new row into every other row and have the index updated as well?
Thanks
Your question assumes that the rows have some kind of built-in order to them, and that you can insert rows between other rows. That's not true.
It is true that rows have an order on disk, and that the id column is usually assigned in order, but that's an implementation detail. When you perform a query, the database is free to return the rows in any order it chooses, unless you specify what you want with an ORDER BY clause.
Now, I'm assuming what you really want is to insert rows between the existing rows in id order. One way to get what you want would look like this:
UPDATE t1 SET id = id * 2
INSERT INTO t1 (id, column) SELECT id+1, "hello" FROM t1
The UPDATE would double the ids of all the existing rows (so 1,2,3 becomes 2,4,6); then the INSERT would perform a query on t1 and use the result to insert a new set of rows with id values one more than the existing rows (so 2,4,6 becomes 3,5,7).
I haven't tested the above statements, so I don't know if they would work or if they require some extra trickery (like a temporary table) since we are querying and updating the same table in one statement. Also I may have made a syntax error.
Don't consider the rows as pre-ordered in the database. A database will store them as they come in, or according to an index. It's your task to order them on retrieval (i.e. when you query for data) according to your needs.

Resources