Sqlite3: How to interrupt a long running update without roll back? - sqlite

I have a long running multirow update such as:
UPDATE T set C1 = calculation(C2) where C1 is NULL
If table is large this update may take many seconds or even minutes.
During this time all other queries on this table fail with "database is locked"
after connection timeout expires (currently my timeout is 5 seconds).
I would like to stop this update query after, say, 3 seconds,
then restart it. Hopefully, after several restarts entire table will be updated.
Another option is to stop this update query before making any other request
(this will require inter-process cooperation, but it may be doable).
But I cannot find a way to stop update query without rolling back
all previously updated records.
I tried calling interrupt and returning non-0 from progress_handler.
Both these approaches abort the update command
and roll back all the changes.
So, it appears that sqlite treats this update as a transaction,
which does not make much sense in this case because all rows are independent.
But I cannot start a new transaction for each row, can I?
If interrupt and progress_handler cannot help me, what else I can do?
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1).
These approaches do allow me to terminate update earlier,
but they are significantly slower than regular update
and they cannot terminate the query at specific time
(before another connection timeout expires).
Any other ideas?
This multirow update is such a common operation
that, I hope, other people have a good solution for it.

So, it appears that sqlite treats this update as a transaction, which does not make much sense in this case because all rows are independent.
No, that actually makes perfect sense, because you're not executing multiple, independent updates. You're executing a single update statement. The fine manual says
No changes can be made to the database except within a transaction.
Any command that changes the database (basically, any SQL command
other than SELECT) will automatically start a transaction if one is
not already in effect. Automatically started transactions are
committed when the last query finishes.
If you can determine the range of keys involved, you can execute multiple update statements. For example, if a key is an integer, and you determine the range to be from 1 to 1,000,000, you can write code to execute this series of updates.
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 1 and 100000;
commit;
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 100001 and 200000;
commit;
Other possibilities . . .
You can sleep for a bit between transactions to give other queries a chance to execute.
You can also time execution using application code, and calculate a best guess at range values that will avoid timeouts and still give good performance.
You can select the keys for the rows that will be updated, and use their values to optimize the range of keys.
In my experience, it's unusual to treat updates this way, but it sounds like it fits your application.
But I cannot start a new transaction for each row, can I?
Well, you can, but it's probably not going to help. It's essentially the same as the method above, using a single key instead of a range. I wouldn't fire you for testing that, though.
On my desktop, I can insert 100k rows in 1.455 seconds, and update 100k rows with a simple calculation in 420 ms. If you're running on a phone, that's probably not relevant.

You mentioned poor performance with LIMIT. Do you have a lastupdated column with an index on it? At the top of your procedure you would get the COMMENCED_DATETIME and use it for every batch in the run:
update foo
set myvalue = 'x', lastupdated = UPDATE_COMMENCED
where id in
(
select id from foo where lastupdated < UPDATE_COMMENCED
limit SOME_REASONABLE_NUMBER
)
P.S. With respect to slowness:
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1). These approaches do allow me to terminate update earlier, but they are significantly slower than regular update...
If you're willing to give other processes access to stale data, and your update is designed so as not to hog system resources, why is there a need to have the update complete within a certain amount of time? There seems to be no need to worry about perfomance in absolute terms. The concern should be relative to other processes -- make sure they're not blocked.

I also posted this question at
http://thread.gmane.org/gmane.comp.db.sqlite.general/81946
and got several interesting answers, such as:
divide range of rowid into slices and update one slice at a time
use AUTOINCREMENT feature to start new update at the place where the previous update ended (by LIMIT 10000)
create a trigger that calls select raise(fail, ...) to abort update without rollback

Related

Running COUNT in clickhouse immediately after INSERT returns 0

I'm running an INSERT query into a Distributed table of ReplicatedMergeTree with 2 nodes (single shard).
After the INSERT, I want to check the number of INSERTED records, so I run a COUNT query on the Distributed table.
At first, the COUNT returns 0. After several seconds (it can take more than a minute) the count returns the correct number.
I've checked using SHOW PROCESSLIST that the INSERT query has finished running.
Is there a way to verify that everything is in order before executing the COUNT?
It seems you may need to use the FINAL keyword. It is mentioned that one should try to avoid it, so you might be better off checking the table design and storage engine, but it could be good interim solution.
https://clickhouse.com/docs/en/sql-reference/statements/select/from/

Why does this transaction produce a deadlock?

My application runs on Maria DB using a master-master Galera replication setup.
The application can handle deadlocks, but I've been working to minimize those that occur as they fill up my log files. There remains one transaction that gets regular deadlocks and I don't know how to avoid it.
The process deletes a record from one table, does a couple of operations on other tables and then finally inserts a record into the original table.
The transaction looks broadly like this:
1. DELETE FROM table_a WHERE `id` = 'Foo'
2. REPLACE INTO table_b ( ... )
3. UPDATE table_c SET ....
4. INSERT INTO table_a (id,...) VALUES ('Bar',...)
The final insert regularly gets a deadlock although retrying the transaction fixes it. What is it about this pattern that causes a deadlock? What can I do to reduce the occurrence?
Question: Is the 'deadlock' in the node you are writing to? Or does the deadlock not occur until COMMIT; that is, when trying to reconcile across the cluster?
If on the writing node...
As soon as possible in the transaction, do
SELECT id FROM table_a WHERE ... FOR UPDATE;
to signal what row(s) you will be inserting in step 4.
Also, consider changing REPLACE to an equivalent INSERT .. ON DUPLICATE KEY UPDATE ... I don't know if it will directly help with the deadlock, but at least it is (probably) more efficient.
If on the cluster...
Are you touching lots of different rows? Are you using roundrobin for picking which node to write to?
In any case, speeding up the transaction will help. Is there anything that can be pulled out of the transaction. Some thoughts there:
Normalization can generally be done in its own transaction.
If you have an "if" in the transaction, it might be worth it to do a tentative test beforehand. (But you probably need to keep the "if" in the transaction.)

Controlling read locks on table for multithreaded plsql execution

I have a driver table with a flag that determines whether that record has been processed or not. I have a stored procedure that reads the table, picks a record up using a cursor, does some stuff (inserts into another table) and then updates the flag on the record to say it's been processed. I'd like to be able to execute the SP multiple times to increase processing.
Obvious answer seemed to be to use 'for update skip locked' in the select for the cursor but it seems this means I cannot commit within the loop (to update the processed flag and commit my inserts) without getting the fetch out of sequence error.
Googling tells me Oracle's AQ is the answer but for the time being this option is not available to me.
Other suggestions? This must be a pretty common request but I've been unable to find anything that useful.
TIA!
A

Inserting Result of Time-Consuming Stored Procedures to Tables

I have a website that runs a stored procedure when you open home page. That stored procedure process data from 4 relational table and gives a result. Since DB records increased, completion of the stored procedure can take more than 10 seconds and it is too much for a home page.
So I think, inserting result of the stored procedure into a new table regularly and using that table for home page can be a good idea to solve the problem but I am not sure if it is a good practice for SQL Server.
Is there any better solution for my case?
Edit: Those 4 tables are updated every 15 minutes with about 30 insert.
If you are willing to have a "designated victim" update the cache as needed (which may also cause other users to wait) you can do something like this in a stored procedure (SP):
Start a transaction to block access to the cache.
Check the date/time of the cache entries. (This requires either adding a CacheUpdated column to the cache table or storing the value elsewhere.)
If the cached data is sufficiently recent then return the data and end the transaction.
Delete the cached data and run a new query to refill it with an appropriate CacheUpdated date/time.
Return the cached data and end the transaction.
If the update time becomes too long for users to wait, or the cache rebuild blocks too many users, you can run a stored procedure at a scheduled interval by creating a job in SQL Server Agent. The SP would:
Save the current date/time, e.g. as #Now.
Run the query to update the cache marking each row with CacheUpdated = #Now.
Delete any cache rows where CacheUpdate != #Now.
The corresponding SP for users would simply return the oldest set of data, i.e. Min( CacheUpdated ) rows. If there is only one set, that's what they get. If an update is in progress then they'll get the older complete set, not the work in progress.
As far as you have explained your issue I see no problem in doing that, but you must explain more, since we don't know what type of data you collecting and how it is increases every time, so as to provide you a better solution

SQLite query delay

I have a very simple query:
SELECT count(id), min(id), max(id), sum(size), sum(frames), sum(catalog_size + file_size)
FROM table1
table1 holds around 3000 to 4000 records.
My problem is that it takes around 20 seconds to this query to run. And since it is called more than once, the delay is pretty obvious to the customer.
Is it normal to this query to take 20 seconds? Is there any way to improve the run time?
If I run this same query from SQLite Manager it takes milliseconds to execute. The delay only occurs if the query is called from our software. EXPLAIN and EXPLAIN QUERY PLAN didn't help much. We use SQLite 3.7.3 version and Windows XPe.
Any thoughts how to troubleshoot this issue or improve the performance of the query?
All the sums require that every single record in the table must be read.
If the table contains more columns than those shown above, then the data read from the disk contains both useful and useless values (especially if you have big blobs). In that case, you can try to reduce the data needed to be read for this query by creating a covering index over exactly those columns needed for this query.
In SQLite versions before 3.7.15, you need to add an ORDER BY for the first index field to force SQLite to use that index, but this doesn't work for all queries. (For your query, try updating to this beta, or wait for 3.7.15.)

Resources