How to randomly delete 20% of the rows in a SQLite table - sqlite

Good afternoon, We were wondering how to randomly delete 20% of the rows in a sqlite table with 15000 rows. We noticed that this question was solved in Stack Overflow using SQL Server Select n random rows from SQL Server table.
But the SQL Server script does not appear to function properly in sqlite. How can we convert the SQL Server script to an sqlite equivalent script? Thank you.

Alternatively, since the random() function in sqlite returns a signed 64-bit integer, we can calculate a point within this space as (2^63) * 0.6 . Signed integers greater than this will be 40% of the set of positive signed 64-bit integers, so 20% of the whole set.
Truncate to the integer below, this is 5534023222112865484 .
Therefore you should be able to get 20% of your rows with a simple:
SELECT * FROM table WHERE random() > 5534023222112865485
Or in your case, since you want to delete that many:
DELETE FROM table WHERE random() > 5534023222112865485
I hope you enjoy this approach. It may actually be suitable if you want high performance from such an operation, but it could be hardware dependent / version dependent, so probably is not worth the risk.

Not quite 'random' - but if you've an identity column on the table you could DELETE FROM mytable WHERE ID % 5 = 0 which should statistically delete very close to a fifth of the rows.

Try:
DELETE FROM TABLE
WHERE ROWID IN (SELECT ROWID FROM TABLE ORDER BY RANDOM() LIMIT 3000)
If you want to calculate 20% in a subquery:LIMIT (SELECT CAST( ( COUNT(id) * 0.2 ) AS INT )

SQLite - ORDER BY RAND() provides a hint. Thus this may work?
DELETE FROM table WHERE id IN(
SELECT id FROM table ORDER BY RANDOM() LIMIT (
SELECT CAST( ( COUNT(id) * 0.2 ) AS INT ) FROM table
)
);

Related

Why does this SQLite query not use an index for the correlated subquery?

Consider a SQLite database for things with parts, containing the following tables
CREATE TABLE thing (id integer PRIMARY KEY, name text, total_cost real);
CREATE TABLE part (id integer PRIMARY KEY, cost real);
CREATE TABLE thing_part (thing_id REFERENCES thing(id), part_id REFERENCES part(id));
I have an index to find the parts of a thing
CREATE INDEX thing_part_idx ON thing_part (thing_id);
To illustrate the problem, I'm using the following queries to fill the tables with random data
INSERT INTO thing(name)
WITH RECURSIVE
cte(x) AS (
SELECT 1
UNION ALL
SELECT 1 FROM cte LIMIT 10000
)
SELECT hex(randomblob(4)) FROM cte;
INSERT INTO part(cost)
WITH RECURSIVE
cte(x) AS (
SELECT 1
UNION ALL
SELECT 1 FROM cte LIMIT 10000
)
SELECT abs(random()) % 100 FROM cte;
INSERT INTO thing_part (thing_id, part_id)
SELECT thing.id, abs(random()) % 10000 FROM thing, (SELECT 1 UNION ALL SELECT 1), (SELECT 1 UNION ALL SELECT 1);
So each thing is associated with a small number of parts (4 in this example).
At this point, I have not yet set the total cost of the things. I thought I could use the following query
UPDATE thing SET total_cost = (
SELECT sum(part.cost)
FROM thing_part, part
WHERE thing_part.thing_id = thing.id
AND thing_part.part_id = part.id);
but it is extremely slow (I did not have the patience to wait for it to complete).
EXPLAIN QUERY PLAN shows that both thing and thing_part are being scanned over, only the lookup in part is done using the rowid:
SCAN TABLE thing
EXECUTE CORRELATED SCALAR SUBQUERY 0
SCAN TABLE thing_part
SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)
If I look at the query plan for the inner query with a fixed thing_id, i.e.
SELECT sum(part.cost)
FROM thing_part, part
WHERE thing_part.thing_id = 1000
AND thing_part.part_id = part.id;
it does use the thing_part_idx:
SEARCH TABLE thing_part USING INDEX thing_part_idx (thing_id=?)
SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)
I would expect the first query to be equivalent to iterating over all rows of thing and executing the inner query each time, but obviously that's not the case. Why? Should I use a different index or rewrite my query or maybe do the iteration in the client to generate multiple queries instead?
In case it matters, I'm using SQLite version 3.22.0
SQLite might use dynamic typing, but column types still matter for affinity, and indexes can be used only when the database can prove that index lookups behave the same as comparisons with the actual table values, which often requires the affinities to be compatible.
So when you tell the database that the thing_part values are integers:
CREATE TABLE thing_part (
thing_id integer REFERENCES thing(id),
part_id integer REFERENCES part(id)
);
then the index on that will have the correct affinity, and will be used:
QUERY PLAN
|--SCAN TABLE thing
`--CORRELATED SCALAR SUBQUERY
|--SEARCH TABLE thing_part USING INDEX thing_part_idx (thing_id=?)
`--SEARCH TABLE part USING INTEGER PRIMARY KEY (rowid=?)
I would rewrite your query as:
-- calculating sum for each thing_id at once
WITH cte AS (
SELECT thing_part.thing_id, sum(part.cost) AS s
FROM thing_part
JOIN part
ON thing_part.part_id = part.id
GROUP BY thing_part.thing_id
)
UPDATE thing
SET total_cost = (SELECT s FROM cte WHERE thing.id = cte.thing_id);

Teradata performance impacted due to count distinct

There is no join in the query, it is a simple query with two count distinct. But it is consuming more than 9k cpu.
I have taken the necessary stats, but unable to reduce the CPU. please suggest some good methods to reduce the CPU
can you please let me know what is the best way to reduce the impact CPU
I think the target table is a SET table so your query is taking a lot of CPU (duplicate row elimination).
1) Test your select query on a MULTISET table.
insert into multiset_table
select count(distinct col1) from source_table.
And I believe that your primary index is skewed, the reason for high impact CPU.
2) Make sure your primary index is unique.
select hashamp(hashbucket(hashrow(<primary index columns>))), count(*) (bigint) cnt from target_table group by 1 order by 2 desc;
If the cnt column is not distributed evenly, then change primary index of the table with more unique columns.
Only 2 things can cause merge to run slow,
1) Target table is SET table
2) Primary index of the target table is badly skewed

How to read the last record in SQLite table?

Is there a way to read the value of the last record inserted in an SQLite table without going through the previous records ?
I ask this question for performance reasons.
There is a function named sqlite3_last_insert_rowid() which will return the integer key for the most recent insert operation. http://www.sqlite.org/c3ref/last_insert_rowid.html
This only helps if you know the last insert happened on the table you care about.
If you need the last row on a table, regardless of wehter the last insert was on this table or not, you will have to use a SQL query
SELECT * FROM mytable WHERE ROWID IN ( SELECT max( ROWID ) FROM mytable );
When you sort the records by ID, in reverse order, the last record will be returned first.
(Because of the implicit index on the autoincrementing column, this is efficient.)
If you aren't interested in any other records, use LIMIT:
SELECT *
FROM MyTable
ORDER BY _id DESC
LIMIT 1

Calculating the percentage of dates (SQL Server)

I'm trying to add an auto-calculated field in SQL Server 2012 Express, that stores the % of project completion, by calculating the date difference by using:
ALTER TABLE dbo.projects
ADD PercentageCompleted AS (select COUNT(*) FROM projects WHERE project_finish > project_start) * 100 / COUNT(*)
But I am getting this error:
Msg 1046, Level 15, State 1, Line 2
Subqueries are not allowed in this context. Only scalar expressions are allowed.
What am I doing wrong?
Even if it would be possible (it isn't), it is anyway not something you would want to have as a caculated column:
it will be the same value in each row
the entire table would need to be updated after every insert/update
You should consider doing this in a stored procedure or a user defined function instead.Or even better in the business logic of your application,
I don't think you can do that. You could write a trigger to figure it out or do it as part of an update statement.
Are you storing "percentageCompleted" as a duplicated column value in the same table as your project data?
If this is the case, I would not recommend this, because it would duplicate the data.
If you don't care about duplicate data, try something separating the steps out like this:
ALTER TABLE dbo.projects
ADD PercentageCompleted decimal(2,2) --You could also store it as a varchar or char
declare #percentageVariable decimal(2,2)
select #percentageVariable = (select count(*) from projects where Project_finish > project_start) / (select count(*) from projects) -- need to get ratio by completed/total
update projects
set PercentageCompleted = #percentageVariable
this will give you a decimal value in that table, then you can format it on select if you desire to % + PercentageCompleted * 100

Sqlite Query Optimization (using Limit and Offset)

Following is the query that I use for getting a fixed number of records from a database with millions of records:-
select * from myTable LIMIT 100 OFFSET 0
What I observed is, if the offset is very high like say 90000, then it takes more time for the query to execute. Following is the time difference between 2 queries with different offsets:
select * from myTable LIMIT 100 OFFSET 0 //Execution Time is less than 1sec
select * from myTable LIMIT 100 OFFSET 95000 //Execution Time is almost 15secs
Can anyone suggest me how to optimize this query? I mean, the Query Execution Time should be same and fast for any number of records I wish to retrieve from any OFFSET.
Newly Added:-
The actual scenario is that I have got a database having > than 1 million records. But since it's an embedded device, I just can't do "select * from myTable" and then fetch all the records from the query. My device crashes. Instead what I do is I keep fetching records batch by batch (batch size = 100 or 1000 records) as per the query mentioned above. But as i mentioned, it becomes slow as the offset increases. So, my ultimate aim is that I want to read all the records from the database. But since I can't fetch all the records in a single execution, I need some other efficient way to achieve this.
As JvdBerg said, indexes are not used in LIMIT/OFFSET.
Simply adding 'ORDER BY indexed_field' will not help too.
To speed up pagination you should avoid LIMIT/OFFSET and use WHERE clause instead. For example, if your primary key field is named 'id' and has no gaps, than your code above can be rewritten like this:
SELECT * FROM myTable WHERE id>=0 AND id<100 //very fast!
SELECT * FROM myTable WHERE id>=95000 AND id<95100 //as fast as previous line!
By doing a query with a offset of 95000, all previous 95000 records are processed. You should make some index on the table, and use that for selecting records.
As #user318750 said, if you know you have a contiguous index, you can simply use
select * from Table where index >= %start and index < %(start+size)
However, those cases are rare. If you don't want to rely on that assumption, use a sub-query, for example using rowid, which is always indexed,
select * from Table where rowid in (
select rowid from Table limit %size offset %start)
This speeds things up especially if you have "fat" rows (e.g. that contain blobs).
If maintaining the record order is important (it usually isn't), you need to order the indices first:
select * from Table where rowid in (
select rowid from Table order by rowid limit %size offset %start)
select * from data where rowid = (select rowid from data limit 1 offset 999999);
With SQLite, you don't need to get all rows returned at once in a big fat array, you can get called back for every row. This way, you can process the results as they come in, which should address both your crashing and performance issues.
I guess you're not using C as you would already be using a callback, but this technique should be available in any other language.
Javascript example (from : https://www.npmjs.com/package/sqlite3 )
db.each("SELECT rowid AS id, info FROM lorem", function(err, row) {
console.log(row.id + ": " + row.info);
});

Resources