I have been using c++ and work with sqlite. In python, I have an executemany operation in the library but the c++ library I am using does not have that operation.
I was wondering how the executemany operation optimizes queries to make them faster.
I was looking at the sqlite c/c++ api and saw that there were two commands, sqlite3_reset and sqlite3_clear_bindings, that can be used to clear and reuse prepared statements.
Is this what python does to batch and speedup executemany queries (at least for inserts)? Thanks for your time.
executemany just binds the parameters, executes the statements, and calls sqlite3_reset, in a loop.
Python does not give you direct access to the statement after it has been prepared, so this is the only way to reuse it.
However, SQLite does not take much time for preparing statements, so this is unlikely to have much of an effect on performance.
The most important thing for performance is to batch statements in a transaction; Python tries to be clever and to do this automatically (independently from executemany).
I looked into some of the related posts and found the folowing which was very detailed on ways to improve sqlite batch insert performace. These principles could effectively be used to create an executemany function.
Improve INSERT-per-second performance of SQLite?
The biggest improvement changes were indeed as #CL. said, turning it all into one transaction. The author of the other post also found significant improvement by using and reusing prepared statements and playing with some pragma settings.
Related
I have a big gremlin query that is basically to filter results, is made of many has() and where() steps that can be written in any order and gives the same result, some of them are expensive and some of them are cheaper.
If i call the cheaper steps first I guess the expensive ones are going to be executed with less iterations because many vertices were filtered, this is true when coding in any language but in a database implementation I don't know if the Gremlin steps are executed in the order that are written.
I know this kind of things usually depends on the Gremlin database implementation but maybe you can give me some kind of general answer. Also I've tried to make some benchmarks but to build good ones in my specific case is too time consuming, so maybe you can help me with your knowledge of how databases are implemented internally.
As you mention, it really does depend on the query engine and the way optimized query plans are developed. Some engines will try to reorder parts of queries based on the estimated cardinality of elements being tested. Amazon Neptune works that way for example. In general it is best to filter out as much as possible as soon as possible. So in a social network you would not want to start with something like g.V().hasLabel(‘person’) unless you are confident the query engine is able to reorder such queries.
I'm working with some SQLite code in a C++ project that has several hundred prepared statements compiled at once performing operations on a comparable number of tables. All of the statements are simple selects and updates, but the individualized nature of the tables necessitates correspondingly specific SQL, so attempting the reuse of fewer statements for multiple tables is unrealistic. The statements are generally compiled once for the lifetime of the program and finalized on exit. Insofar as concurrency is concerned, at most two or three statements will ever be executed simultaneously on their own threads.
With the number of tables (and therefore, statements) expected to grow continually throughout development, I'd like to be aware of any potential problems with this design before things get any more complex. Having so many statements feels like code smell to me, not to mention a potential debugging nightmare.
I haven't found anything in the docs about prepared statement limits. Are there any practical limits to the number of prepared statements for a single SQLite database connection? Can high numbers of prepared statements cause performance issues?
Prepared statements do not need much memory.
While optimizing away the SQL parsing overhead is probably not worth the effort, it will not hurt.
According to the documentation for zumero_sync:
If a large amount of information needs to be pulled from the server,
this function may need to be called more than once.
In my Android app that uses Zumero that's no problem; I just keep calling zumero_sync until the return value doesn't start with "0;".
However, now I'm trying to write an admin script that also syncs with my server dbfiles. I'd like to use the sqlite3 shell, and have the script pass the SQL to execute via command line arguments. I need to call zumero_sync in a loop (which SQLite doesn't support) to make sure the db is fully synced. If I had to, I could invoke sqlite3 in a loop (reading its output, looking for "0;"), or even write a C++ app to call the SQLite/Zumero functions natively. But it certainly would be easier if a single zumero_sync was enough.
I guess my real question is: could zumero_sync be changed so it completes the sync before returning? If there are cases where the existing behavior is more useful, maybe there could be a parameter for specifying which mode to use?
I see two basic questions here:
(1) Why does zumero_sync() work the way it does?
(2) Can it work differently?
I'll answer (2) first, since it's easier: Yes, it could work differently. Rather, we could (and probably will, soon, you brought this up) implement an additional function, named something like zumero_sync_complete(), which performs [the guts of] zumero_sync() in a loop and returns after the sync is complete.
We didn't implement zumero_sync_complete() because it doesn't add much value. It's a simple loop, so you can darn well write it yourself. :-)
Er, except in scripting environments which don't support loops. Like the sqlite3 shell.
Answer to (1):
The Zumero sync protocol is designed to give the server the flexibility to return partial results if it wants to do so. And for the sake of reducing load on the server (and increasing its scalability) it often does want to do exactly that.
Given that, one reason to expose this to the client is to increase the client's flexibility as well. As long we're making multiple roundtrips, we might as well give the client an opportunity to do something (like, maybe, update a progress bar) in between them.
Another thing a client might want to do in between loop iterations is handle an error.
Or, in the case of a multithreaded client, it might want to deal with changes that happened on the client while the sync is going on.
Which raises the question of how locking should be managed? Do we hold the sqlite write lock during the entire loop? Or only when absolutely necessary?
Bottom line: A robust app would probably want to implement the loop itself so that it can make its own decisions and retain full control over things.
But, as you observe, the sqlite3 shell doesn't have loops. And it's not an app. And it doesn't have threads. Or progress bars. So it's a use case where a simpler-and-less-powerful form of zumero_sync() would make sense.
I am reworking a .NET application that so far has been running slowly. Our databases are Oracle, and the code is written in VB. When writing queries, I typically pass the parameters to a middle tier function which builds the raw SQL. I have a database class that has a function ExecuteQuery which takes in a SQL string and returns a DataTable. This uses an OleDbDataAdapter to run the query on the database.
I found some existing code that sends the SQL and a parameter to a stored procedure which as far as I can tell, opens the query and ouputs it to a SYS_REFCURSOR / DataSet.
I don't know why it's set up this way, but could someone tell me which is better performance-wise? Or the pros/cons to doing it this way?
Thanks in advance
Stored Procedures vs dynamic SQL have the exact same performance. In other words there is no performance advantage of one over the other. (Incidentally, I am a HUGE believer in using stored procs for everything for a host of other reasons but that's not the topic on hand).
Bottle necks can occur for many reasons.
For one, if you are actually code generating select statements it is highly probable that those statements are very unoptimized for the data the app needs. For example, doing a SELECT * which pulls 50 columns back versus a SELECT ID, Description which just pulls the two you need in your application at that point. In this example, the amount of data that has to be read from disk, transferred over the network wire, and pushed into objects in memory of the web server isn't trivial.
These will have to be evaluated on a case by case basis.
I would highly suggest that if you have a "slow" application that you need to improve the performance of the very first thing you ought to do is profile the application. What part of it is running slow? It might be inside the database server, it might be in your middle tier, it may even be a function of your network bandwidth or memory / load limitations on your web server. Heck, there might even be a WAIT command lurking somewhere in there placed by some previous programmer that left the company...
In short, you have at this point absolutely no idea on where to begin. So looking at actual code is premature. Go profile the app and see where things are slowing down. You might find that performance may radically improve simply by putting more memory in the database server.... Which is a much cheaper alternative than rewriting, testing and deploying vast amounts of code.
a stored procedure will definitely have better performance over building a raw query in code and executing it, but the important thing to realize is that, that difference in performance won't be your performance issue, there are many other things that will affect performance much more than just changing just query to be a stored procedure, even if you run a stored procedure and process the results using adapters, data tables, data sets, you're still incurring in a lot of performance, specially if you pass those large objects around (I have seen cases where datasets are returned wrapped in web service calls), so, don't focus on that, focus on caching data, having a good query, create the proper indexes, minimize the use of datasets, datatables, that will yield better benefits than just moving queries to stored procedures
Using SQLite in a memory-constrained embedded system with a fixed set of queries, it seems that code and data savings could be made if the queries could be 'pre-prepared'. That is, the prepared statement is produced by (an equivalent of) sqlite3_prepare_v2() at build time, and only _bind(), _step() etc need to be called at runtime, referencing one or more sqlite3_stmt* pointers that are effectively static data. The entire SQL parsing (and query planning?) engine could be eliminated from the target.
I realise that there is considerable complexity hidden behind the sqlite3_stmt* pointer, and that this is highly unlikely to be practical with the current sqlite3 implementation - but is the concept feasible?
This was discussed on the SQLite-users mailing list in 2006. At that time D. Richard Hipp supported a commercial version of SQLite that ran compiled statements on a stripped down target, which did not have any SQL parser. Perhaps you could check with hwaci to see if this product is still available.