Can MariaDB return incomplete data? - mariadb

I am using MySQL Connector to connect to a MariaDB server.
A function in my program periodically retrieves all entries in a table (with a select * from ... without any wheres, limits, etc.).
After it gets the data, it checks if these rows (using an auto-incremented id) are already present in its memory, and if not it adds them. But if a row does not exist in the retrieved list but presents in the memory-list, then that row must be deleted from memory.
Deleting that row from memory is not the only thing that's gonna happen. It also deletes a bunch of other tables/files linked to that row. So, if the connector somehow fails, does not retrieve the full list, and does not report this, then I'll get into trouble.
It might be a tad stupid question but I couldn't make sure if I needed any additional safety measures.

Related

Delete all keys from rocksdb (drop all)

I have a rocksdb instance with multithreaded read/write access. At some point an arbitrary thread needs to process a request to clear the whole database, basically delete all keys. How can I do it with the smallest disturbance to the other threads? Obviously, as everything is parallel, there is no need for a definite moment at which the database gets cleared and the new writes go to an empty one, and it is okay if some parallel reads are still getting the old data for some time.
I see DeleteRange, but my keys are irregular, there is no such thing as an upper bound
I see DeleteFile, but the comment says it will be gone in rocksdb 7.0. Also, this looks like a bad idea in a multithreaded environmnet
Interestingly, I could not find a recipe for such seemingly common use case
I see DeleteRange, but my keys are irregular, there is no such thing as an upper bound
Do they have a common prefix? Generally you would prefix the keys with the 'table name' like 'users' or 'messages' and then you can drop the entire messages range which would be like dropping the entire table
If you don't - then I would suggest rereading the docs to make sure you are using rocksdb correctly but the only other option is to loop over and delete each entry
An alternative, is to grab a lock and swap out to a new clean DB and delete the old data folder entirely

OpenEdge Database Row Version

I am attempting to implement a row version strategy for tables in our OpenEdge database.
The simple solution i have come up with would be to add an integer iRowVersion field to each table and have the write trigger validate and increment the field as follows:
TRIGGER PROCEDURE FOR WRITE OF Customer OLD BUFFER oldCustomer.
IF Customer.iRowVersion < oldCustomer.iRowVersion THEN
RETURN ERROR "RowVersion Out Of Date".
ASSIGN Customer.iRowVersion = Customer.iRowVersion + 1.
This will prevent any concurrent changes being overwritten, however i am unsure the increment by one per row is the best.
SQL ROWVERSION is incremented accross the entire database, and to emulate that approach would use a sequence instead:
ASSIGN Customer.iRowVersion = NEXT-VALUE(rowVersionSequence).
In our large database where many records will be changing, this has the potential to increase the sequence very quickly. Having a sequence per table would curtail this but seems over the top and the +1 approach keeps it simple.
To clarify the question - would it be better to increment a row version number based on the rows last version, or should the SQL like approach be taken - making every row version unique to the database.
Additionally if going down the SQL style route, would the create trigger need to assign an initial row version? (otherwise all new unmodified records initialise at 0).
To version control records in the OpenEdge database I now have a solution that should work well, and is fairly simple.
Each table that needs to have a row version will have a RowVersion field, of type Integer.
We have a program that generates write triggers when we create new tables, so updating this to add some new code has been simple. The write trigger now checks the record to see if the table has a RowVersion field, and if so it then increments the version by 1.
Checking to make sure the row version matches before updating is the responsibility of the programmer in the code / script they are running.
There were several reasons for this method, but it keeps things simple:
Integers are simple and easy to read when running queries and debugging the database. Given our application uses, it is unlikely we would ever overflow an integer either.
A sequence is not needed to keep rowversions unique. They don't need to be. Each record just increments its own row version.
Although ProDataSets can do optimistic locking, there is no guarantee that the records in use will always be read / written using these, and therefore a field gives us the flexibility to write different code depending on the use.
Usually row versions should be checked before updating, if there was data issues, then fix scripts might need to be run to overwrite data regardless. For this we leave the checking to be done in a calling procedure (and not the trigger) for a write operation to a record.

riak - unable to delete keys in a bucket

I am using riak version 1.4.10 and it is in a ring with two hosts. I am unable to get rid of keys left over from previous operations using simple delete operations on keys. When I list the keys for a bucket, it shows me the old keys, however if I try to retrieve the data associated with a key, no data is found. When I try to delete the key, it still persists. What could be the cause of this? Is there a way to wipe the keys in the bucket so it starts from a clean slate? I don't care about any of the data in riak, but I would rather not have to reinstall everything again.
You are probably seeing the tombstones of the old data. Since Riak is an eventually consistent data store, it needs to keep track of deletes as if they were ordinary writes, at least for a little while.
If data is present on one node, but not another, how do you tell if it is a PUT that hasnt' propagated yet, or a DELETE?
Riak solves this by using a tombstone. Whenever you delete something, instead of just wiping the data immediately, Riak replaces the existing value with a special value that it knows means deleted. This special value contains a vclock that is descended from the previous value, and metadata indicating deleted. So when it comes time to decide the above question, Riak simply compares the vclock of the value with that of the tombstone. Whichever descends from the other must be the correct one.
To solve the problem of an ever growing data size that contains mostly tombstones, tombstones are reaped after a time. The time is set using the delete_mode setting. After the DELETE is processed, and the tombstone has been written to the primary vnodes, the delete process issues a GET request for the key. Whenever the GET process encounters a tombstone, and all of the primary vnodes responded with the same tombstone, it schedules the tombstone to be reaped according to the delete_mode setting.
So if you want to actually get rid of the tombstones, check your delete_mode setting to make sure it is not set to 'keep', and issue a get for each one to make sure it is really gone.
Or if you are just wiping the data store to restart your tests, stop Riak, delete all the files under the data_root for the backend you are using, and restart.

Attaching two memory databases

I am collecting data every second and storing it in a ":memory" database. Inserting data into this database is inside a transaction.
Everytime one request is sending to server and server will read data from the first memory, do some calculation, store it in the second database and send it back to the client. For this, I am creating another ":memory:" database to store the aggregated information of the first db. I cannot use the same db because I need to do some large calculation to get the aggregated result. This cannot be done inside the transaction( because if one collection takes 5 sec I will lose all the 4 seconds data). I cannot create table in the same database because I will not be able to write the aggregate data while it is collecting and inserting the original data(it is inside transaction and it is collecting every one second)
-- Sometimes I want to retrieve data from both the databses. How can I link both these memory databases? Using attach database stmt, I can attach the second db to the first one. But the problem is next time when a request comes how will I check the second db is exist or not?
-- Suppose, I am attaching the second memory db to first one. Will it lock the second database, when we write data to the first db?
-- Is there any other way to store this aggregated data??
As far as I got your idea, I don't think that you need two databases at all. I suppose you are misinterpreting the idea of transactions in sql.
If you are beginning a transaction other processes will be still allowed to read data. If you are reading data, you probably don't need a database lock.
A possible workflow could look as the following.
Insert some data to the database (use a transaction just for the
insertion process)
Perform heavy calculations on the database (but do not use a transaction, otherwise it will prevent other processes of inserting any data to your database). Even if this step includes really heavy computation, you can still insert and read data by using another process as SELECT statements will not lock your database.
Write results to the database (again, by using a transaction)
Just make sure that heavy calculations are not performed within a transaction.
If you want a more detailed description of this solution, look at the documentation about the file locking behaviour of sqlite3: http://www.sqlite.org/lockingv3.html

Reindexing a large SQL Server database to Lucene

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.
These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):
i = 0
while (true)
{
SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
.... if result is empty 20 times in a row, break ....
.... otherwise send result to web service to reindex ....
i = i + 1000
}
This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.
Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)
For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:
reindex items by given id (or set of ids)
reindex items by given period of time
The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.
EDIT: added few explanations as per question author's comment.
If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.
Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?
I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

Resources