Delete all keys from rocksdb (drop all) - rocksdb

I have a rocksdb instance with multithreaded read/write access. At some point an arbitrary thread needs to process a request to clear the whole database, basically delete all keys. How can I do it with the smallest disturbance to the other threads? Obviously, as everything is parallel, there is no need for a definite moment at which the database gets cleared and the new writes go to an empty one, and it is okay if some parallel reads are still getting the old data for some time.
I see DeleteRange, but my keys are irregular, there is no such thing as an upper bound
I see DeleteFile, but the comment says it will be gone in rocksdb 7.0. Also, this looks like a bad idea in a multithreaded environmnet
Interestingly, I could not find a recipe for such seemingly common use case

I see DeleteRange, but my keys are irregular, there is no such thing as an upper bound
Do they have a common prefix? Generally you would prefix the keys with the 'table name' like 'users' or 'messages' and then you can drop the entire messages range which would be like dropping the entire table
If you don't - then I would suggest rereading the docs to make sure you are using rocksdb correctly but the only other option is to loop over and delete each entry
An alternative, is to grab a lock and swap out to a new clean DB and delete the old data folder entirely

Related

How to delete many of data in "Realtime Database"

I want to delete all data in "Realtime Database", without increasing "Usage Load" in "Realtime Database".
any idea for deleting that data?
is 420,000+ data in realtime database
Here is image
If you can help me, its very usefull..
Image Usage Load
There is support for deleting large nodes built into the Firebase CLI these days as explained in this blog How to Perform Large Deletes in the Realtime Database:
If you want to delete a large node, the new recommended approach is to use the Firebase CLI (> v6.4.0). The CLI automatically detects a large node and performs a chunked delete efficiently.
$ firebase database:remove /path/to/delete
My initial write-up is below. 👇 I'm pretty sure the CLI mentioned above implements precisely this approach, so that's likely a faster way to accomplish this, but I'm still leaving this explanation as it may be useful as background.
Deleting data is a write operation, so it's by definition going to put load on the database. Deleting a lot of data causes a lot of load, either as a spike in a short period or (if you spread it out) as a lifted load for a longer period. Spreading the load out is the best way to minimize impact for your regular users.
The best way to delete a long, flat list of keys (as you seem to have) is to:
Get a list of that keys, either from a backup of your database (which happens out of band), or by using the shallow parameter on the REST API.
Delete the data in reasonable batches, where reasonable depends on the amount of data you store per key. If each key is just a few properties, you could start deleting 100 keys per batch, and check how that impacts the load to determine if you can ramp up to more keys per batch.

Can you get a log file of 'reads' on specific RECID(Tablename) in Progress-4GL/Openedge at RunTime without access to Source Code?

I want to know which tables are being read by a query.
for each Customer where CustomerID = 12345.
Eventually this customer will be found in the following example, but progress must 'read' many tables before getting to customer 12345.
How do I know exactly which tables are read (By CustomerID), prior to getting to customer 12345?
*NOTE: I do not have access to modify the code being run for this selection. Ideally I would run a separate set of code that is executed at the same time as the customer query above to track the reads.
EDIT: More clearly - Can you track reads from a given program (.p) OR ProcessID and output either a RECID or the PrimaryKey to a file?
I understand the information is being read off the Disk and probably stored in a database buffer. So how would I get at the information in the database buffer?
You seem to be mixing up a few different things.
In a situation like your example where you FIND a specific record in one, and only one table then there is just a single record read. Progress will find that record by first scanning a relevant index. That might be 2 or 3 "logical reads" of the b-tree to get to the proper node. The record block and index blocks may, or may not be read from disk - that depends on what has happened previously.
There are "Virtual System Tables" available that can tell you how many READ operations take place against a particular table or index. But they do not trace the specific ROWID or other identifying data. _TableStat and _IndexStat are aggregates for all users on the system, _UserTableStat and _UserIndexStat are specific to a particular user's activity. You do need to set the -tablerangesize and -indexrangesize parameters adequately to take advantage of these.
If you have enabled the table and index statistics then you can use a tool like ProTop - http://protop.wss.com to get insight into this activity. Or you can write your own code.
OpenEdge Auditing does not track reads. That would be prohibitively expensive.
It's probably not really a good idea but, in theory, you could write FIND triggers for the tables you are interested in. That doesn't require access to the application source but you would need a development license. It will probably kill performance to do this though - so unless this is a non-production test environment that you just want to fiddle with I wouldn't really do that.
You mention wanting to know how you got to that point. That sounds more like you might need to have a "4gl trace". One easy way to get the stack trace of a running process is to execute:
$DLC/bin/proGetStack PID (UNIX)
or
%DLC%\bin\proGetStack PID (Windows)
This command will generate a "protrace.pid" file containing a 4gl stack trace and other interesting information.
There are also more complicated ways to get that info like using PROMON and the "client statement cache" or setting various log entry types at session startup. But proGetStack is pretty convenient and requires no code or scripting changes.
Some great options from Tom above. And all of them may be relevant to you. The option he only skirts around is the logging options. I feel obliged to expand on this because I'm giving a talk on it in a couple of weeks!
Assuming you are running a modern version of Progress, or even 10.2B08, then you have client logging available to you. Start your session with these additional options:
-clientlog "\somefolder\somefile.txt"
-logentrytypes "QryInfo:3"
This will log all the info of all the queries in your session to the file you specified above. If you navigate to the point in the system where you want to analyse your query and empty the logfile and save it, you can then run the offending query and see all the detail you need.
The output tells you all sorts of useful info, including the number of reads on each table, compared with the number returned to the user. You also get the index selected.
Using Tom's advice and/or this will get you what you need.

Teradata DELETE ALL vs DROP+CREATE

I've been recently assigned on a project using Teradata.
I've been told to strictly use DROP+CREATE instead of DELETE ALL, because the latter "leaves some space allocated someway". This is counter-intuitive to me, and I think it's probably wrong. I searched the web for a comparison between the two methods, but I found nothing.
This only reinforces my belief that DELETE ALL doesn't suffer from the issue above.
However, if this is the case, I must prove it (both practically and theoretically).
So, my question is: is there a difference in space allocation between the two methods? If not, is there an official document (user guide, technical specification, whatever else) that proves it?
Thank you!
There's a discussion here: http://teradataforum.com/teradata/20120403_105705.htm about the very same subject (although it does not really answer the "leaves some space allocated someway" part). They actually recommend DELETE ALL but for other (performance) reasons:
I'll quote just in case the link goes dead:
"Delete all" will be quicker, although being practical there often isn't a lot of difference in the performance of them.
However, especially for a process that is run regularly (say a daily batch process) then I recommend the "delete all" approach. This will do less work as it only removes the data and leaves the definition in place. Remember that if you remove the definition then this requires accessing multiple dictionary tables, and of course you then have to access those same tables (typically) when you re-create the object.
Apart from the performance aspect, the downside of the drop/create approach is that every time you create an object Teradata inserts "default rows" into the AccessRights table, even if subsequent access to the object is controlled via Role security and/or database level security. As you may well know the AccessRights table can easily get large and very skewed. In my experience many sites have a process which cleans this table on a regular basis, removing redundant rows. If your (typically batch) processes regularly drop/create objects, then you're simply adding rows into the table which have previously been removed by a clean process, and which will be removed in the future by the same process. This all sounds like a complete waste of time to me.
Your impression is correct, you didn't find any reference to "DELETE leaves some space allocated" in any place, because it's simply wrong :-)
DELETE ALL is similar to a TRUNCATE in other DBMSes and in most cases use fastpath processing:
First of all, you cannot do DROP/CREATE in one transaction in Teradata (in Oracle there are other problems with everyday DDL) so when ETL processes become complicated you might end up with the dependence where more important business processes depend on less important (like you might see the customers table empty just because the interests rates were not refreshed
or you have an exceeding varchar value in just one minor column)
My opinion: Use transactions and modular programming. In Teradata this means avoiding DDL where possible and using DELETE/UPDATE/MERGE/INSERT instead of DROP/CREATE.
We have a slightly different situation in Postgres where DDL statements are transactional.

riak - unable to delete keys in a bucket

I am using riak version 1.4.10 and it is in a ring with two hosts. I am unable to get rid of keys left over from previous operations using simple delete operations on keys. When I list the keys for a bucket, it shows me the old keys, however if I try to retrieve the data associated with a key, no data is found. When I try to delete the key, it still persists. What could be the cause of this? Is there a way to wipe the keys in the bucket so it starts from a clean slate? I don't care about any of the data in riak, but I would rather not have to reinstall everything again.
You are probably seeing the tombstones of the old data. Since Riak is an eventually consistent data store, it needs to keep track of deletes as if they were ordinary writes, at least for a little while.
If data is present on one node, but not another, how do you tell if it is a PUT that hasnt' propagated yet, or a DELETE?
Riak solves this by using a tombstone. Whenever you delete something, instead of just wiping the data immediately, Riak replaces the existing value with a special value that it knows means deleted. This special value contains a vclock that is descended from the previous value, and metadata indicating deleted. So when it comes time to decide the above question, Riak simply compares the vclock of the value with that of the tombstone. Whichever descends from the other must be the correct one.
To solve the problem of an ever growing data size that contains mostly tombstones, tombstones are reaped after a time. The time is set using the delete_mode setting. After the DELETE is processed, and the tombstone has been written to the primary vnodes, the delete process issues a GET request for the key. Whenever the GET process encounters a tombstone, and all of the primary vnodes responded with the same tombstone, it schedules the tombstone to be reaped according to the delete_mode setting.
So if you want to actually get rid of the tombstones, check your delete_mode setting to make sure it is not set to 'keep', and issue a get for each one to make sure it is really gone.
Or if you are just wiping the data store to restart your tests, stop Riak, delete all the files under the data_root for the backend you are using, and restart.

Solution for previewing user changes and allowing rollback/commit over a period of time

I have asked a few questions today as I try to think through to the solution of a problem.
We have a complex data structure where all of the various entities are tightly interconnected, with almost all entities heavily reliant/dependant upon entities of other types.
The project is a website (MVC3, .NET 4), and all of the logic is implemented using LINQ-to-SQL (2008) in the business layer.
What we need to do is have a user "lock" the system while they make their changes (there are other reasons for this which I won't go into here that are not database related). While this user is making their changes we want to be able to show them the original state of entities which they are updating, as well as a "preview" of the changes they have made. When finished, they need to be able to rollback/commit.
We have considered these options:
Holding open a transaction for the length of time a user takes to make multiple changes stinks, so that's out.
Holding a copy of all the data in memory (or cached to disk) is an option but there is heck of a lot of it, so seems unreasonable.
Maintaining a set of secondary tables, or attempting to use session state to store changes, but this is complex and difficult to maintain.
Using two databases, flipping between them by connection string, and using T-SQL to manage replication, putting them back in sync after commit/rollback. I.e. switching on/off, forcing snapshot, reversing direction etc.
We're a bit stumped for a solution that is relatively easy to maintain. Any suggestions?
Our solution to a similar problem is to use a locking table that holds locks per entity type in our system. When the client application wants to edit an entity, we do a "GetWithLock" which gets the client the most up-to-date version of the entity's data as well as obtaining a lock (a GUID that is stored in the lock table along with the entity type and the entity ID). This prevents other users from editing the same entity. When you commit your changes with an update, you release the lock by deleting the lock record from the lock table. Since stored procedures are the api we use for interacting with the database, this allows a very straight forward way to lock/unlock access to specific entities.
On the client side, we implement IEditableObject on the UI model classes. Our model classes hold a reference to the instance of the service entity that was retrieved on the service call. This allows the UI to do a Begin/End/Cancel Edit and do the commit or rollback as necessary. By holding the instance of the original service entity, we are able to see the original and current data, which would allow the user to get that "preview" you're looking for.
While our solution does not implement LINQ, I don't believe there's anything unique in our approach that would prevent you from using LINQ as well.
HTH
Consider this:
Long transactions makes system less scalable. If you do UPDATE command, update locks last until commit/rollback, preventing other transaction to proceed.
Second tables/database can be modified by concurent transactions, so you cannot rely on data in tables. Only way is to lock it => see no1.
Serializable transaction in some data engines uses versions of data in your tables. So after first cmd is executed, transaction can see exact data available in cmd execution time. This might help you to show changes made by user, but you have no guarantee to save them back into storage.
DataSets contains old/new version of data. But that is unfortunatelly out of your technology aim.
Use a set of secondary tables.
The problem is that your connection should see two versions of data while the other connections should see only one (or two, one of them being their own).
While it is possible theoretically and is implemented in Oracle using flashbacks, SQL Server does not support it natively, since it has no means to query previous versions of the records.
You can issue a query like this:
SELECT *
FROM mytable
AS OF TIMESTAMP
TO_TIMESTAMP('2010-01-17')
in Oracle but not in SQL Server.
This means that you need to implement this functionality yourself (placing the new versions of rows into your own tables).
Sounds like an ugly problem, and raises a whole lot of questions you won't be able to go into on SO. I got the following idea while reading your problem, and while it "smells" as bad as the others you list, it may help you work up an eventual solution.
First, have some kind of locking system, as described by #user580122, to flag/record the fact that one of these transactions is going on. (Be sure to include some kind of periodic automated check, to test for lost or abandoned transactions!)
Next, for every change you make to the database, log it somehow, either in the application or in a dedicated table somewhere. The idea is, given a copy of the database at state X, you could re-run the steps submitted by the user at any time.
Next up is figuring out how to use database snapshots. Read up on these in BOL; the general idea is you create a point-in-time snapshot of the database, do whatever you want with it, and eventually throw it away. (Only available in SQL 2005 and up, Enterprise edition only.)
So:
A user comes along and initiates one of these meta-transactions.
A flag is marked in the database showing what is going on. A new transaction cannot be started if one is already in process. (Again, check for lost transactions now and then!)
Every change made to the database is tracked and recorded in such a fashion that it could be repeated.
If the user decides to cancel the transaction, you just drop the snapshot, and nothing is changed.
If the user decides to keep the transaction, you drop the snapshot, and then immediately re-apply the logged changes to the "real" database. This should work, since your requirements imply that, while someone is working on one of these, no one else can touch the related parts of the database.
Yep, this sure smells, and it may not apply to well to your problem. Hopefully the ideas here help you work something out.

Resources