I have a very simple query:
SELECT count(id), min(id), max(id), sum(size), sum(frames), sum(catalog_size + file_size)
FROM table1
table1 holds around 3000 to 4000 records.
My problem is that it takes around 20 seconds to this query to run. And since it is called more than once, the delay is pretty obvious to the customer.
Is it normal to this query to take 20 seconds? Is there any way to improve the run time?
If I run this same query from SQLite Manager it takes milliseconds to execute. The delay only occurs if the query is called from our software. EXPLAIN and EXPLAIN QUERY PLAN didn't help much. We use SQLite 3.7.3 version and Windows XPe.
Any thoughts how to troubleshoot this issue or improve the performance of the query?
All the sums require that every single record in the table must be read.
If the table contains more columns than those shown above, then the data read from the disk contains both useful and useless values (especially if you have big blobs). In that case, you can try to reduce the data needed to be read for this query by creating a covering index over exactly those columns needed for this query.
In SQLite versions before 3.7.15, you need to add an ORDER BY for the first index field to force SQLite to use that index, but this doesn't work for all queries. (For your query, try updating to this beta, or wait for 3.7.15.)
Related
I'm running an INSERT query into a Distributed table of ReplicatedMergeTree with 2 nodes (single shard).
After the INSERT, I want to check the number of INSERTED records, so I run a COUNT query on the Distributed table.
At first, the COUNT returns 0. After several seconds (it can take more than a minute) the count returns the correct number.
I've checked using SHOW PROCESSLIST that the INSERT query has finished running.
Is there a way to verify that everything is in order before executing the COUNT?
It seems you may need to use the FINAL keyword. It is mentioned that one should try to avoid it, so you might be better off checking the table design and storage engine, but it could be good interim solution.
https://clickhouse.com/docs/en/sql-reference/statements/select/from/
I have a long running multirow update such as:
UPDATE T set C1 = calculation(C2) where C1 is NULL
If table is large this update may take many seconds or even minutes.
During this time all other queries on this table fail with "database is locked"
after connection timeout expires (currently my timeout is 5 seconds).
I would like to stop this update query after, say, 3 seconds,
then restart it. Hopefully, after several restarts entire table will be updated.
Another option is to stop this update query before making any other request
(this will require inter-process cooperation, but it may be doable).
But I cannot find a way to stop update query without rolling back
all previously updated records.
I tried calling interrupt and returning non-0 from progress_handler.
Both these approaches abort the update command
and roll back all the changes.
So, it appears that sqlite treats this update as a transaction,
which does not make much sense in this case because all rows are independent.
But I cannot start a new transaction for each row, can I?
If interrupt and progress_handler cannot help me, what else I can do?
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1).
These approaches do allow me to terminate update earlier,
but they are significantly slower than regular update
and they cannot terminate the query at specific time
(before another connection timeout expires).
Any other ideas?
This multirow update is such a common operation
that, I hope, other people have a good solution for it.
So, it appears that sqlite treats this update as a transaction, which does not make much sense in this case because all rows are independent.
No, that actually makes perfect sense, because you're not executing multiple, independent updates. You're executing a single update statement. The fine manual says
No changes can be made to the database except within a transaction.
Any command that changes the database (basically, any SQL command
other than SELECT) will automatically start a transaction if one is
not already in effect. Automatically started transactions are
committed when the last query finishes.
If you can determine the range of keys involved, you can execute multiple update statements. For example, if a key is an integer, and you determine the range to be from 1 to 1,000,000, you can write code to execute this series of updates.
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 1 and 100000;
commit;
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 100001 and 200000;
commit;
Other possibilities . . .
You can sleep for a bit between transactions to give other queries a chance to execute.
You can also time execution using application code, and calculate a best guess at range values that will avoid timeouts and still give good performance.
You can select the keys for the rows that will be updated, and use their values to optimize the range of keys.
In my experience, it's unusual to treat updates this way, but it sounds like it fits your application.
But I cannot start a new transaction for each row, can I?
Well, you can, but it's probably not going to help. It's essentially the same as the method above, using a single key instead of a range. I wouldn't fire you for testing that, though.
On my desktop, I can insert 100k rows in 1.455 seconds, and update 100k rows with a simple calculation in 420 ms. If you're running on a phone, that's probably not relevant.
You mentioned poor performance with LIMIT. Do you have a lastupdated column with an index on it? At the top of your procedure you would get the COMMENCED_DATETIME and use it for every batch in the run:
update foo
set myvalue = 'x', lastupdated = UPDATE_COMMENCED
where id in
(
select id from foo where lastupdated < UPDATE_COMMENCED
limit SOME_REASONABLE_NUMBER
)
P.S. With respect to slowness:
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1). These approaches do allow me to terminate update earlier, but they are significantly slower than regular update...
If you're willing to give other processes access to stale data, and your update is designed so as not to hog system resources, why is there a need to have the update complete within a certain amount of time? There seems to be no need to worry about perfomance in absolute terms. The concern should be relative to other processes -- make sure they're not blocked.
I also posted this question at
http://thread.gmane.org/gmane.comp.db.sqlite.general/81946
and got several interesting answers, such as:
divide range of rowid into slices and update one slice at a time
use AUTOINCREMENT feature to start new update at the place where the previous update ended (by LIMIT 10000)
create a trigger that calls select raise(fail, ...) to abort update without rollback
Let's say i have a table in a database with 10k records. I dont need to actually use those 10k records anymore, but i still need to keep them in the database. That very table is now going to be used to store new data. So there's gonna be more records coming on top of the 10K records already present in the table. As opposed to the "old" 10K records, i do need to work with the newly inserted data. Right now im doing this to get the data i need:
List<Stuff> l = (from x in db.Table
where x.id > id
select x).ToList();
My question now is: how does the where clause in LINQ (or in SQL in general) work under the covers? Is the ENTIRE table going to be searched until (x.id > id) is true? Because let's say the table will increase from 10k records to 20K. It'd be a little silly to look through the entire 20 k records, if i know that i only have to start looking from a certain point.
I've had performance problems (not dramatic, but bad enough to be agitated by it) with this while using LINQ to entities, which i kinda don't understand because it should be no problem at all for a modern computer to sift through a mere 20 k records. I've been advised to use a stored procedure instead of a LINQ query, but i dont know whether or not this will boost performance?
Any feedback will be appreciated.
It's going to behave just like a similarly worded SQL query would. The question is whether the overhead you're experiencing is happening in the query or in the conversion of the query to a list. The query itself as you've written should equate literally to:
Select ID, Column1, Column2, Column3, ... , Column(n+1)
From db.Table
Where ID > id
This query should be fairly fast depending on the nature of the data. The query itself will not be executed until it is acted upon, however. In this case, you're converting it to a list, which is the equivalent of acting upon it. I can't find the comment someone made to me about this practice, but I've found it too be quite helpful in keeping performance clean. Unless you have some very specific need, you should leave your queries as IQueryable. Converting them to lists doubles the effort because first the query must be executed and then the result set must be converted into an appropriate IEnumerable (List in this case).
So you have 2 potential bottlenecks. The simple query could be taking a long time to query a massive collection of data, or the number of records could be bottenecking at the poing where the List is created. Another possibility is the nature of ID in this case. If it is numeric, that will save you some time. If it's performing a text-based search then it's going to be heavier.
To answer your specific question, yes, it's going to search every record in the database and return all of the records that match the expression. Edit: If the database has a proper index on the column in question, it will not search EVERY record but rather will use the index to perform the search. From comment from #Pleun.
As for using a stored procedure, that's a load of hogwash, but it's a perfectly acceptable alternative. I have several programs that routinely run similar queries against a database with over 40 million records, and the only performance issue I've run into so far has been CPU usage when multiple users are performing rapid firing queries. To solve your specific issue, I'd recommend that you tune it a little in SQL Management Studio until the query you want returns to your interface with an acceptable speed. Then you can convert that query into a compatible Linq statement. As long as you leave it as an IQueryable it should exhibit similar results.
I'm writing an application which produces a lot of data to store in a database.
The DB schema is very simple: it's a table with just 4 columns, but I must fill it with more than 30000 rows.
I'm using SQLite and QSql as API.
Data is produced very fast (no sleeps) and I'm using QSqlQuery to insert a row at time.
However it seems that it takes 7-8 seconds to store 100 rows (I'm using QTime for time counting).
I tried using QSqlTableModel but I noticed no performance improvements, even calling QSqlTableModel::submitAll every 1000 rows (QTime shows 70-80 seconds for 1000 rows).
Is there any way to store rows faster? What is the fastest way to fill a table with SQLite?
You could try looking at whether you've got transactions set up correctly; they're expensive because they have to sync to disk to commit.
Also bear in mind that SQLite is more heavily optimized for reading anyway.
You might try dropping any indexes at the start and then adding them back after all records have been imported. Results will vary of course if you're emptying the table first or just appending new records.
We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)