Deleting duplicate rows - sqlite

I am learning SQLite and constructed a line which I thought would delete dups but it deletes all rows instead.
DELETE from tablename WHERE rowid not in (SELECT distinct(timestamp) from tablename);
I expected this to delete rows with a duplicate (leaving one). I know I can simply create a new table with the distinct rows, but why does what I have done not work? Thanks

If timestamp is a column in the table and this is what you want to compare so to delete duplicates then do this:
delete from tablename
where exists (
select 1 from tablename t
where t.rowid < tablename.rowid and t.timestamp = tablename.timestamp
)

With recent versions of sqlite, the following is an alternative:
DELETE FROM tablename
WHERE rowid IN (SELECT rowid
FROM (SELECT rowid, row_number() OVER (PARTITION BY timestamp) AS rownum
FROM tablename)
WHERE rownum >= 2);

why does what I have done not work?
Consider the WHERE condition:
rowid not in (SELECT distinct(timestamp) from tablename)
The simple answer is that you are not comparing data in the same columns, nor are they columns with the same type of data. rowid is an automatically-incremented integer column and I assume that timestamp column is either a numeric or string column containing time values, or perhaps custom-generated sequential numeric values. Because rowid likely never matches a value in timestamp, then the NOT IN operation will always return true. Thus each row of the table will be deleted.
SQL is rather explicit and so there are no hidden/mysterious column comparisons. It will not automatically compare the rowid's from one query with another. Notice that the various alternative statements do something to distinguish rows with duplicate key values (timestamp in your case), either by direct comparison between main query and subquery, or using windowing functions to uniquely label rows with duplicate values, etc.
Just for kicks, here's another alternative that uses NOT IN like your original code.
DELETE FROM tablename
WHERE rowid NOT IN (
SELECT max(t.rowid) FROM tablename t
GROUP BY t.timestamp )
First notice that this is comparing rowid with max(t.rowid), values which derive from the same column.
Because the subquery groups on t.timestamp, the aggregate function max() will return the greatest/last t.rowid separately for each set of rows with the same t.timestamp value. The resultant list will exclude t.rowid values that are less than the maximum. Thus, the NOT IN operation will not find those lesser values and will return true so they will be deleted.
It also uses basic SQL (no window functions... the OVER keyword). It will likely be more efficient than the alternative that references the outer query from the subquery, because this statement can execute the subquery just once and then use an efficient index to match individual records... it doesn't need to rerun the query for each row. For that matter, it should also be more efficient than the windowing function, because the window partition essentially "groups" on the partitioned columns, but must then execute the windowing function for each row, an extra step not present in the basic aggregate query. Efficiency is not always critical, but something important to consider.
By the way, the distinct keyword is not a function and does not need/accept parenthesis. It is a directive that applies to the entire select statement. The subquery is being interpreted as
SELECT DISTINCT (timestamp) FROM tablename
where DISTINCT is interpreted in isolation and the parenthesis are interpreted as a separate expression.
Update
These two queries will return the same data:
SELECT DISTINCT timestamp FROM tablename;
SELECT timestamp FROM tablename GROUP BY timestamp;
Both results eliminate duplicate rows from the output by showing only unique/distinct values, but neither has a "handle" (other data column) which indicates which rows to keep and which rows to eliminate. In other words, these queries return distinct values, but the results loose all relationship to the source rows and so have no use in specifying which source rows to delete (or keep). To understand better, you should run subqueries separately to inspect what they return so that you can understand and verify what data you're working with.
To make those queries useful, we need to do something to distinguish rows with duplicate key values. The rows need a "handle"--some other key value to select for either deleting or keeping those rows. Try this...
SELECT DISTINCT rowid, timestamp FROM tablename;
But that won't work, because it applies the DISTINCT keyword to ALL returned columns, but since rowid is already unique it will necessarily output each row separately and so there is no use to the query.
SELECT max(rowid), timestamp FROM tablename GROUP BY timestamp;
That query preserves the unique grouping, but provides just one rowid per timestamp as the "handle" to include/exclude for deletion.

try this
DELETE liens from liens where
id in
( SELECT * FROM (SELECT min(id) FROM liens group by lkey having count(*) > 1 ) AS c)
you can do this many times

Related

SELECT rows from SQLite Database where condition met with limit return rows with smallest IDs

I have a database with an PRIMARY KEY INTEGER AUTOINCREMENT column named id and a condition flag column, call it condition which is an INTEGER.
I would like to be able to SELECT a given number of rows N where conditon=1. That is easy enough to query (for example if N=10):
SELECT data FROM table_name WHERE condition=1 LIMIT 10;
However I would like to be guaranteed that the rows I receive are also those rows with the smallest values of id from the full set of rows where condition=1. For example if rows with id between 1 and 20 have condition=1 I would like my query to be guaranteed to return rows with id=1 - 10.
My understanding is that ORDER BY is completed after the query so I don't think including ORDER BY id would make this a guarantee. Is there a way to guarantee this?
Well you are wrong:
SELECT data FROM table_name WHERE condition=1 ORDER BY id LIMIT 10;
is what you need.
It will sort the rows you need and then the limit is applied.
From http://www.sqlitetutorial.net/sqlite-limit/
SQLite LIMIT and ORDER BY clause
We typically use the LIMIT clause with ORDER BY clause, because we are
interested in getting the numberof rows in a specified order, not in
unspecified order.
The ORDER BY clause appears before the LIMIT clause in the SELECT
statement.
SQLite sorts the result set before getting the number of
rows specified in the LIMIT clause.

Date difference between two separate rows in SQLite with no ID

I have data in SQLite like this (a few thousands of rows):
1536074432|startRecording
1536074434|stopRecording
1536074443|startRecording
1536074447|stopRecording
1536074458|startRecording
1536074462|stopRecording
And I'd like to get the amounts of seconds passed between two consecutive distinct events (basically how many seconds of video I've recorded).
I know about another similar question (
Date Difference between consecutive rows ), but in my case it's different because I cannot get the "next" row by ID, but I have to get it based on a different event name.
There is an answer that works magic, but it's specific to SQL Server ( Query to find the time difference between successive events ), and I need this for SQLite.
I could do this in Oracle with the LAG / LEAD functions, but no idea how to do it in SQLite.
I could also do this with a separate parsing script, but I think it would be more efficient to be able to do this directly from a query.
Even though there is no id in the table, sqlite stores a rowid (from sqlite CREATE_TABLE doc):
ROWIDs and the INTEGER PRIMARY KEY
Except for WITHOUT ROWID tables, all rows within SQLite tables have a 64-bit signed integer key that uniquely identifies the row within its table. This integer is usually called the "rowid". The rowid value can be accessed using one of the special case-independent names "rowid", "oid", or "rowid" in place of a column name. If a table contains a user defined column named "rowid", "oid" or "rowid", then that name always refers the explicitly declared column and cannot be used to retrieve the integer rowid value.
Assuming perfectly clean data as described :) how about:
select a.rowid,a.time,a.event,b.rowid,b.time,b.event,b.time - a.time as elapsed --,sum(b.time-a.time)
from t2 a, t2 b
where a.rowid % 2 = 1
and b.rowid = a.rowid + 1

Make select query return in order of arguments

I have a relatively simple select query which asks for rows by an column value (this is not controlled by me). I pass in a variable argument of id values to be returned. Here's an example:
select * from team where id in (2, 1, 3)
I'm noticing that as the database changes its order over time, my results are changing order as well. Is there a way to make SQLite guarantee results in the same order as the arguments?
If you could have so many IDs that the query becomes unwieldy, use a temporary table to store them:
CREATE TEMPORARY TABLE SearchIDs (
ID,
OrderNr INTEGER PRIMARY KEY
);
(The OrderNr column is autoincrementing so that it automatically gets proper values when you insert values.)
To do the search, you have to fill this table:
INSERT INTO SearchIDs(ID) VALUES (2), (1), (3) ... ;
SELECT Team.*
FROM Team
JOIN SearchIDs USING (ID)
ORDER BY SearchIDs.OrderNr;
DELETE FROM SearchIDs;
Try this!
select * from team order by
case when 2 then 0
when 1 then 1
when 3 then 2
end

SQLITE Insert Multiple Rows Using Select as Value

I have a sqlite statement that will only insert one row.
INSERT INTO queue (TransKey, CreateDateTime, Transmitted)
VALUES (
(SELECT Id from trans WHERE Id != (SELECT TransKey from queue)),
'2013-12-19T19:47:33',
0
)
How would I have it insert every row where Id from trans != (SELECT TransKey from queue) in one statement?
INSERT INTO queue (TransKey, CreateDateTime, Transmitted)
SELECT Id, '2013-12-19T19:47:33', 0
FROM trans WHERE Id != (SELECT TransKey from queue)
There are two different "flavors" of INSERT. The one you're using (VALUES) inserts one or more rows that you "create" in the INSERT statement itself. The other flavor (SELECT) inserts a variable number of rows that are retrieved from one or more other tables in the database.
While it's not immediately obvious, the SELECT version allows you to include expressions and simple constants -- as long as the number of columns lines up with the number of columns you're inserting, the statement will work (in other databases, the types of the values must match the column types as well).

How do I find out if a SQLite index is unique? (With SQL)

I want to find out, with an SQL query, whether an index is UNIQUE or not. I'm using SQLite 3.
I have tried two approaches:
SELECT * FROM sqlite_master WHERE name = 'sqlite_autoindex_user_1'
This returns information about the index ("type", "name", "tbl_name", "rootpage" and "sql"). Note that the sql column is empty when the index is automatically created by SQLite.
PRAGMA index_info(sqlite_autoindex_user_1);
This returns the columns in the index ("seqno", "cid" and "name").
Any other suggestions?
Edit: The above example is for an auto-generated index, but my question is about indexes in general. For example, I can create an index with "CREATE UNIQUE INDEX index1 ON visit (user, date)". It seems no SQL command will show if my new index is UNIQUE or not.
PRAGMA INDEX_LIST('table_name');
Returns a table with 3 columns:
seq Unique numeric ID of index
name Name of the index
unique Uniqueness flag (nonzero if UNIQUE index.)
Edit
Since SQLite 3.16.0 you can also use table-valued pragma functions which have the advantage that you can JOIN them to search for a specific table and column. See #mike-scotty's answer.
Since noone's come up with a good answer, I think the best solution is this:
If the index starts with "sqlite_autoindex", it is an auto-generated index for a single UNIQUE column
Otherwise, look for the UNIQUE keyword in the sql column in the table sqlite_master, with something like this:
SELECT * FROM sqlite_master WHERE type = 'index' AND sql LIKE '%UNIQUE%'
you can programmatically build a select statement to see if any tuples point to more than one row. If you get back three columns, foo, bar and baz, create the following query
select count(*) from t
group by foo, bar, baz
having count(*) > 1
If that returns any rows, your index is not unique, since more than one row maps to the given tuple. If sqlite3 supports derived tables (I've yet to have the need, so I don't know off-hand), you can make this even more succinct:
select count(*) from (
select count(*) from t
group by foo, bar, baz
having count(*) > 1
)
This will return a single row result set, denoting the number of duplicate tuple sets. If positive, your index is not unique.
You are close:
1) If the index starts with "sqlite_autoindex", it is an auto-generated index for the primary key . However, this will be in the sqlite_master or sqlite_temp_master tables depending depending on whether the table being indexed is temporary.
2) You need to watch out for table names and columns that contain the substring unique, so you want to use:
SELECT * FROM sqlite_master WHERE type = 'index' AND sql LIKE 'CREATE UNIQUE INDEX%'
See the sqlite website documentation on Create Index
As of sqlite 3.16.0 you could also use pragma functions:
SELECT distinct il.name
FROM sqlite_master AS m,
pragma_index_list(m.name) AS il,
pragma_index_info(il.name) AS ii
WHERE m.type='table' AND il.[unique] = 1;
The above statement will list all names of unique indexes.
SELECT DISTINCT m.name as table_name, ii.name as column_name
FROM sqlite_master AS m,
pragma_index_list(m.name) AS il,
pragma_index_info(il.name) AS ii
WHERE m.type='table' AND il.[unique] = 1;
The above statement will return all tables and their columns if the column is part of a unique index.
From the docs:
The table-valued functions for PRAGMA feature was added in SQLite version 3.16.0 (2017-01-02). Prior versions of SQLite cannot use this feature.

Resources