Running COUNT in clickhouse immediately after INSERT returns 0 - count

I'm running an INSERT query into a Distributed table of ReplicatedMergeTree with 2 nodes (single shard).
After the INSERT, I want to check the number of INSERTED records, so I run a COUNT query on the Distributed table.
At first, the COUNT returns 0. After several seconds (it can take more than a minute) the count returns the correct number.
I've checked using SHOW PROCESSLIST that the INSERT query has finished running.
Is there a way to verify that everything is in order before executing the COUNT?

It seems you may need to use the FINAL keyword. It is mentioned that one should try to avoid it, so you might be better off checking the table design and storage engine, but it could be good interim solution.
https://clickhouse.com/docs/en/sql-reference/statements/select/from/

Related

OpenEdge Database Row Version

I am attempting to implement a row version strategy for tables in our OpenEdge database.
The simple solution i have come up with would be to add an integer iRowVersion field to each table and have the write trigger validate and increment the field as follows:
TRIGGER PROCEDURE FOR WRITE OF Customer OLD BUFFER oldCustomer.
IF Customer.iRowVersion < oldCustomer.iRowVersion THEN
RETURN ERROR "RowVersion Out Of Date".
ASSIGN Customer.iRowVersion = Customer.iRowVersion + 1.
This will prevent any concurrent changes being overwritten, however i am unsure the increment by one per row is the best.
SQL ROWVERSION is incremented accross the entire database, and to emulate that approach would use a sequence instead:
ASSIGN Customer.iRowVersion = NEXT-VALUE(rowVersionSequence).
In our large database where many records will be changing, this has the potential to increase the sequence very quickly. Having a sequence per table would curtail this but seems over the top and the +1 approach keeps it simple.
To clarify the question - would it be better to increment a row version number based on the rows last version, or should the SQL like approach be taken - making every row version unique to the database.
Additionally if going down the SQL style route, would the create trigger need to assign an initial row version? (otherwise all new unmodified records initialise at 0).
To version control records in the OpenEdge database I now have a solution that should work well, and is fairly simple.
Each table that needs to have a row version will have a RowVersion field, of type Integer.
We have a program that generates write triggers when we create new tables, so updating this to add some new code has been simple. The write trigger now checks the record to see if the table has a RowVersion field, and if so it then increments the version by 1.
Checking to make sure the row version matches before updating is the responsibility of the programmer in the code / script they are running.
There were several reasons for this method, but it keeps things simple:
Integers are simple and easy to read when running queries and debugging the database. Given our application uses, it is unlikely we would ever overflow an integer either.
A sequence is not needed to keep rowversions unique. They don't need to be. Each record just increments its own row version.
Although ProDataSets can do optimistic locking, there is no guarantee that the records in use will always be read / written using these, and therefore a field gives us the flexibility to write different code depending on the use.
Usually row versions should be checked before updating, if there was data issues, then fix scripts might need to be run to overwrite data regardless. For this we leave the checking to be done in a calling procedure (and not the trigger) for a write operation to a record.

Sqlite3: How to interrupt a long running update without roll back?

I have a long running multirow update such as:
UPDATE T set C1 = calculation(C2) where C1 is NULL
If table is large this update may take many seconds or even minutes.
During this time all other queries on this table fail with "database is locked"
after connection timeout expires (currently my timeout is 5 seconds).
I would like to stop this update query after, say, 3 seconds,
then restart it. Hopefully, after several restarts entire table will be updated.
Another option is to stop this update query before making any other request
(this will require inter-process cooperation, but it may be doable).
But I cannot find a way to stop update query without rolling back
all previously updated records.
I tried calling interrupt and returning non-0 from progress_handler.
Both these approaches abort the update command
and roll back all the changes.
So, it appears that sqlite treats this update as a transaction,
which does not make much sense in this case because all rows are independent.
But I cannot start a new transaction for each row, can I?
If interrupt and progress_handler cannot help me, what else I can do?
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1).
These approaches do allow me to terminate update earlier,
but they are significantly slower than regular update
and they cannot terminate the query at specific time
(before another connection timeout expires).
Any other ideas?
This multirow update is such a common operation
that, I hope, other people have a good solution for it.
So, it appears that sqlite treats this update as a transaction, which does not make much sense in this case because all rows are independent.
No, that actually makes perfect sense, because you're not executing multiple, independent updates. You're executing a single update statement. The fine manual says
No changes can be made to the database except within a transaction.
Any command that changes the database (basically, any SQL command
other than SELECT) will automatically start a transaction if one is
not already in effect. Automatically started transactions are
committed when the last query finishes.
If you can determine the range of keys involved, you can execute multiple update statements. For example, if a key is an integer, and you determine the range to be from 1 to 1,000,000, you can write code to execute this series of updates.
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 1 and 100000;
commit;
begin transaction;
UPDATE T set C1 = calculation(C2)
where C1 is NULL and your_key between 100001 and 200000;
commit;
Other possibilities . . .
You can sleep for a bit between transactions to give other queries a chance to execute.
You can also time execution using application code, and calculate a best guess at range values that will avoid timeouts and still give good performance.
You can select the keys for the rows that will be updated, and use their values to optimize the range of keys.
In my experience, it's unusual to treat updates this way, but it sounds like it fits your application.
But I cannot start a new transaction for each row, can I?
Well, you can, but it's probably not going to help. It's essentially the same as the method above, using a single key instead of a range. I wouldn't fire you for testing that, though.
On my desktop, I can insert 100k rows in 1.455 seconds, and update 100k rows with a simple calculation in 420 ms. If you're running on a phone, that's probably not relevant.
You mentioned poor performance with LIMIT. Do you have a lastupdated column with an index on it? At the top of your procedure you would get the COMMENCED_DATETIME and use it for every batch in the run:
update foo
set myvalue = 'x', lastupdated = UPDATE_COMMENCED
where id in
(
select id from foo where lastupdated < UPDATE_COMMENCED
limit SOME_REASONABLE_NUMBER
)
P.S. With respect to slowness:
I also tried UPDATE with LIMIT and also WHERE custom_condition(C1). These approaches do allow me to terminate update earlier, but they are significantly slower than regular update...
If you're willing to give other processes access to stale data, and your update is designed so as not to hog system resources, why is there a need to have the update complete within a certain amount of time? There seems to be no need to worry about perfomance in absolute terms. The concern should be relative to other processes -- make sure they're not blocked.
I also posted this question at
http://thread.gmane.org/gmane.comp.db.sqlite.general/81946
and got several interesting answers, such as:
divide range of rowid into slices and update one slice at a time
use AUTOINCREMENT feature to start new update at the place where the previous update ended (by LIMIT 10000)
create a trigger that calls select raise(fail, ...) to abort update without rollback

oracle does full table scan but returns resutls so quickly

When I open up TOAD and do a select * from table,the results (first 500 rows) come back almost instantly. But the explain plan shows full table scan and the table is very huge.
How come the results are so quick?
In general, Oracle does not need to materialize the entire result set before it starts returning the data (there are, of course, cases where Oracle has to materialize the result set in order to sort it before it can start returning data). Assuming that your query doesn't require the entire result set to be materialized, Oracle will start returning the data to the client process whether that client process is TOAD or SQL*Plus or a JDBC application you wrote. When the client requests more data, Oracle will continue executing the query and return the next page of results. This allows TOAD to return the first 500 rows relatively quickly even if it would ultimately take many hours for Oracle to execute the entire query and to return the last row to the client.
Toad only returns the first 500 rows for performance, but if you were to run that query through an Oracle interface, JDBC for example, it would return the entire result. My best guess is that the explain plan shows you the results in the case it doesn't get a subset of the records; that's how i use it. I don't have a source for this other than my own experience with it.

SQLite query delay

I have a very simple query:
SELECT count(id), min(id), max(id), sum(size), sum(frames), sum(catalog_size + file_size)
FROM table1
table1 holds around 3000 to 4000 records.
My problem is that it takes around 20 seconds to this query to run. And since it is called more than once, the delay is pretty obvious to the customer.
Is it normal to this query to take 20 seconds? Is there any way to improve the run time?
If I run this same query from SQLite Manager it takes milliseconds to execute. The delay only occurs if the query is called from our software. EXPLAIN and EXPLAIN QUERY PLAN didn't help much. We use SQLite 3.7.3 version and Windows XPe.
Any thoughts how to troubleshoot this issue or improve the performance of the query?
All the sums require that every single record in the table must be read.
If the table contains more columns than those shown above, then the data read from the disk contains both useful and useless values (especially if you have big blobs). In that case, you can try to reduce the data needed to be read for this query by creating a covering index over exactly those columns needed for this query.
In SQLite versions before 3.7.15, you need to add an ORDER BY for the first index field to force SQLite to use that index, but this doesn't work for all queries. (For your query, try updating to this beta, or wait for 3.7.15.)

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

Resources