Known reasons why sqlite3_open_v2 can take over 60s on windows? - sqlite

I will start with TL;DR version as this may be enough for some of you:
We are trying to investigate an issue that we see in diagnostic data of our C++ product.
The issue was pinpointed to be caused by timeout on sqlite3_open_v2 which supposedly takes over 60s to complete (we only give it 60s).
We tried multiple different configurations, but never were able to reproduce even 5s delay on this call.
So the question is if maybe there are some known scenarios in which sqlite3_open_v2 can take that long (on windows)?
Now to the details:
We are using version 3.10.2 of SQLite. We went through changelogs from this version till now and nothing we've found in the bugfixes section seems to suggest that there was some issue that was addressed in consecutive SQLite releases and may have caused our problem.
The issue we see affects around 0.1% unique user across all supported versions of windows (Win 7, Win 8, Win 10). There are no manual user complaints/reports about that - this can suggest that a problem happens in the context where something serious enough is happening with the user machine/system that he doesn't expect anything to work. So something that indicates system-wide failure is a valid possibility as long as it can possibly happen for 0.1% of random windows users.
There are no data indicating that the same issue ever occurred on Mac which is also supported platform with large enough sample of diagnostic data.
We are using Poco (https://github.com/pocoproject/poco, version: 1.7.2) as a tool for accessing our SQLite database, but we've analyzed the Poco code and it seems that failure on this code level can only (possibly) explain ~1% of all collected samples. This is how we've determined that the problem lies in sqlite3_open_v2 taking a long time.
This happens on both DELETE journal mode as well as on WAL.
It seems like after this problem happens the first time for a particular user each consecutive call to sqlite3_open_v2 takes that long until the user restarts whole application (possibly machine, no way to tell from our data).
We are using following flags setup for sqlite3_open_v2 (as in Poco):
sqlite3_open_v2(..., ..., SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_URI, NULL);
This usually doesn't happen on startup of the application so it's not likely to be caused by something happening while our application is not running. This includes power cuts offs causing data destruction (which tends to return SQLITE_CORRUPT anyway, as mentioned in https://www.sqlite.org/howtocorrupt.html).
We were never able to reproduce this issue locally even though we tried different things:
Multiple threads writing and reading from DB with synchronization required by a particular journaling system.
Keeping SQLite connection open for a long time and working on DB normally in a meanwhile.
Trying to hit HDD hard with other data (dumping /dev/rand (WSL) to multiple files from different processes while accessing DB normally).
Trying to force antivirus software to scan DB on every file access (tested with Avast with basically everything enabled including "scan on open" and "scan on write").
Breaking our internal synchronization required by particular journaling systems.
Calling WinAPI CreateFile with all possible combinations of file sharing options on DB file - this caused issues but sqlite3_open_v2 always returned fast - just with an error.
Calling WinAPI LockFile on random parts of DB file which is btw. nice way of reproducing SQLITE_IOERR, but no luck with reproducing the discussed issue.
Some additional attempts to actually stretch the Poco layer and double-check if our static analysis of codes is right.
We've tried to look for similar issues online but anything somewhat relevant we've found was here sqlite3-open-v2-performance-degrades-as-number-of-opens-increase. This doesn't seem to explain our case though, as the numbers of parallel connections are way beyond what we have as well as what would typical windows users have (unless there is some somewhat popular app exploiting SQLite which we don't know about).
It's very unlikely that this issue is caused by db being accessed through network share as we are putting the DB file inside %appdata% unless there is some pretty standard windows configuration which sets %appdata% to be a remote share.
Do you have any ideas what can cause that issue?
Maybe some hints on what else should we check or what additional diagnostic data that we can collect from users would be useful to pinpoint the real reason why that happens?
Thanks in advance

Related

Mgo-based app code structure dealing with connection pool and tcp timeouts

I'm curious how should I structure JSON REST API server in Go language with Mgo library. I have dozens of collections related with each other. I've created the gist with sample part of file structure in my current approach.
It works great, but from time to time I encounter downtime caused by this error: "read tcp 10.168.30.100:37288: i/o timeout". I suppose that I handle mgo connection pool inapropriately. Are there any examples showing how should I create big applications based on mgo?
This error message implies a roundtrip to the database took longer than the timeout period you defined. Just increasing that timeout should get rid of the problem, assuming you don't have any real issues that are causing the application to behave in a sluggish manner.
In general, this error doesn't imply you have any kind of scale issues, other than the fact maybe you have an increasing amount of data in some collections and certain queries may be getting too slow and need re-thinking (indexes, etc).
There's also no need to restart the application. You can either Refresh the problematic session, or Close and re-create the session in case you're using copies of a master session. The state of mgo and the pool of connections is still fine. It's just warning you that this specific session observed an issue on the wire, and so you have to acknowledge it before the session will be valid again.
As usual, also make sure to be using the latest release to avoid problems that have already been fixed, if any.

Avoid sqlite vacuum "database is locked" by application level retrying

I'm using a simple sqlite DB as persistent msg-queueing mechanism between processes. To reduce the file size after exceeding a certain limit I wanted to use the "vacuum" command. Generally all this works nicely, only that every now and then I get a "database is locked" error when vacuuming.
After reading through various resources on the web I understand that there is nothing I can do on sqlite-level.
However, besides the side-questions "Why is that the case? What would be the problem with retrying to obtain the required lock using the regular busyHandler mechanism?" I came up with the idea to just implement the very same busyHandler mechanism just on application level.
Now the essential question: Anything wrong with this?
Many thx!!
There isn't really much of a difference between using SQLite's built-in retrying mechanism, and implementing it yourself.
However, writing code that you do not need to write is useless effort and increases the chance of introducing bugs.
After working on this for another while, I changed my application logic by removing the vacuuming altogether but rather use "pragma max_page_count" to limit the DB size. The only thing to watch out for is that this seems to be session-specific, i.e. the pragma must be set again after each DB connect.
Regarding the original problem: I found that #CL was mostly right - vacuuming indeed involves the standard busy handler. And on Linux this also works nicely, just on Windows it does not reliably (might be by accident though and rather caused by system speed, ...). Using a little printf() in my custom busyHandler I can see that it is invoked in most cases, however sometimes it is not and the "pragma vacuum" just returns "DB locked".
Possibly caused by concurrent processes trying to vacuum at the same time (?) ... anyway, the reworked design is much cleaner/easier anyway.

ORACLE/ASP.NET: ORA-2020 - Too many database links... what's causing this?

Here's the scenario...
We have an internal website that is running the latest version of the ODAC (Oracle Client). It opens database connections, runs a stored procedure or packaged method, then disconnects. Connection pooling is turned on, and we are currently under version 11g in both our development and test environments, but under 10gR2 in our production environment. This happens on Production.
A few days ago, a process began firing off a ORA-2020 error. The process is called from a webpage on our internal website. The user simply sets a date, hits a button, and a job is started on another system that is separate from the website. The call itself, however, uses a database link to run a function.
We've scoured the SQL to find that it only uses that one database link. And since these links are on a per session basis and the user isn't exceeding the default limit of 4, how is it possible that we are getting a ORA-2020 error.
We have ran a number of tests to try to push over the default limit of 4. ODAC, from what I recall, runs a commit after each connection, and I can't seem to run 4 DB links, then run a piece of SQL with 1 DB link directly after with any errors. The only way I can bring up this error is if I run a query with 4 DB links, then a function or piece of dynamic SQL with a database link within it. We don't have that problem as this issue is sporadic. It isn't ALWAYS happening.
Questions
Is it possible that connection pooling is allowing User B to use User A's connection after the initial process was run, thus adding to the open links number if User B runs a SQL statement with more database links?
Is this a scenario where we should up our limit past 4? What are the disadvantages of increasing the number?
Do I need to explicitly close open database links before disconnecting from the database? Oracle documentation seems to suggest it should automatically happen, but "on occasion"... doesn't.
Firstly, the simple solution: I'd double check that in the production database the number of default links is actually 4.
select *
from v$system_parameter
where name = 'OPEN_LINKS'
Assuming you're not going to get off that lightly:
Is it possible that connection pooling is allowing User B to use User
A's connection after the initial process was run, thus adding to the
open links number if User B runs a SQL statement with more database
links?
You say that you explicitly close the session, which, according to the documentation, should mean that all links associated with that session are closed. Other than that I confess complete ignorance on this point.
Is this a scenario where we should up our limit past 4? What are the
disadvantages of increasing the number?
There aren't any disadvantages that I can think of. Tom Kyte suggests, albeit a long time ago, that each open database link uses 500k of PGA memory. If you don't have any then this will obviously cause a problem but it should be more than fine for most situations.
There are, however, unintended consequences: Imagine that you up this number to 100. Somebody codes something that continually opens links and draws a lot of data through all them select * from my_massive_table or similar. Instead of 4 sessions doing this you have 100, which is attempting to transfer hundreds of gigabytes simultaneously. Your network dies under the strain...
There's probably more but you get the picture.
Do I need to explicitly close open database links before disconnecting
from the database? Oracle documentation seems to suggest it should
automatically happen, but "on occasion"... doesn't.
As you've noted the best answer is "probably not", which isn't much help. You don't mention exactly how you're terminating the session but if you're killing it rather than closing gracefully then definitely.
Using a database link spawns a child process on the remote server. Because your server is no longer in absolute charge of this process there's a myriad of things that could cause it to become orphaned or otherwise not close on termination of the parent process. By no means does this happen the whole time but it can and does.
I would do two things.
In your process, if an exception is encountered, e-mail the results of the following query to yourself.
select *
from v$dblink
As a minimum at least you will know what database links are open in the session and give you some way of tracing them.
Follow the documentations advice; specifically the following:
"You may have occasion to close the link manually. For example, close
links when:
The network connection established by a link is used infrequently in an application.
The user session must be terminated."
The first seems to exactly fit your situation. Unless your process is time-sensitive, which doesn't seem to be the case, then what have you got to lose? The syntax is:
alter session close database link <linkname>
We ended up increasing the link amount, but we never did find the root cause.

sporadic ASP.NET data error: "Cannot find table 0"

Having deployed a new build of an ASP.NET site in a production environment, I am logging dozens of data errors every second, almost always with the error "Cannot find table 0." We use datasets and frequently refer to Table[0], and while I understand the defensive coding practice of checking the dataset for tables before accessing Table[0], it's never been a problem in the past. A certain page will load fine one second, and then be missing one of its data-driven components the next. Just seeing if this rings a bell for anyone.
More detail: I used a different build server this time, and while I imagine the compiler settings are the same on both, I have a hard time thinking that there's a switch that makes 50% of my database calls come back with no tables. I also switched the project to VS 2008, but then reverted all of those changes when I switched back to VS 2005. I notice that the built assembly has a new MyLibrary.XmlSerializers.dll, where it didn't used to, but I also can't imagine that that's causing all the trouble. (It also doesn't fall down on calls to MyLibrary, or at least no more than any other time.)
Updated to add: I've discovered that the troublesome build is a "Release" build, where the working build was compiled as "Debug". Could that explain it?
Rolling back to the build before these changes fixed it. (Rebooting the SQL Server, the step we tried before that, did not.)
The trouble also seems to be load-based - this cruised through our integration and QA environments without a problem, and even our smoke test environment - the one that points to production data - is fine under light load.
Does this have the distinguishing characteristics of anything you might have seen in the past?
Bumping this old question because we have encountered the same issue and perhaps our solution would give more insight in what causes this.
Essentially this problem occurs in a production environment that is under very heavy load in a Windows service that uses multiple threads to process several jobs simultaneously (100 users use the same DB via ASP.NET web app and there are about 60 transactions/second on older hardware with SQL Server 2000).
No variables are shared, that is connections are opened anew, transaction is started, operations executed, transaction committed and connection closes.
Under heavy load sometimes one of the following exceptions occurs:
NullReferenceException: Object reference not set to an instance of an
object.
at System.Data.SqlClient.SqlInternalConnectionTds.get_IsLockedForBulkCopy()
or
System.Data.SqlClient.SqlException:
The server failed to resume the transaction. Desc:3400000178
or
New request is not allowed to start because it should come with valid transaction descriptor
or
This SqlTransaction has completed; it is no longer usable
It seems somehow the connection that is within the pool becomes corrupted and remains associated with previously used transactions. Furthermore, if such connection is retrieved from pool then sqlAdapter.Fill(dataset) results in an empty dataset, causing "Cannot find table 0". Because our service would retry the operation (reading job list) on failure and it would always get the same corrupt connection from the pool it would fail with this error until restarted.
We removed the issue by using SqlConnection.ClearPool(connection) on exception to make sure this connection is discarded from the pool and restructuring the application so less threads access the same resources simultaneously.
I have no clue who exactly caused this issue so I am not sure we have really fixed that, maybe just made it so rare it had not occurred again yet.
I've fought precisely this error message before. The key is that an underlying data method is swallowing a timeout exception.
You're probably doing something like this:
var table = GetEmployeeDataSet().Tables[0];
GetEmployeeDataSet is swallowing an exception, probably a timeout exception, which is why it only happens sporadically - it happens under load. You need to do the following to fix it:
Modify the underlying code to not swallow the exception, but rather let it bubble up to the next level so you can identify it properly.
Identify the query(s) causing the problem, and then rewrite, reindex, denormalize or throw hardware at the problem. See this for more info: System.Data.SqlClient.SqlException: Timeout expired
I've seen something similar. I believe our problem had to do with failed sessions being re-used (once the session object failed it went into a poor state and could not recover.) We fixed it by increasing the memory for the session pool and increasing the frequency of the web application recycling.
It also was "caused" by a new version that at first blush did not seem to have any change to cause such an effect. However, eventually it became clear that the logic of the program was opening and closing a lot more connections (maybe 20% more) than it used to. This small change pushed the limit of our prior configuration.
You might check the SQL Server logs for errors. Or, the Web server event log. It sounds like your connection pool could be out of open connections or your db could be out.
Which database calls changed between versions?
The error is obviously telling you one of your database calls isn't returning any data on occasion; I can't think of any cases where a code/assembly issue would cause it.
I have seen something like this when doing something with nHibernate Sessions in a non-thread-safe manner. That would explain why you only see it under load. Would need to see your code to guess at what isn't thread-safe though.

SQL Server query runs slower from ADO.NET than in SSMS

I have a query from a web site that takes 15-30 seconds while the same query runs in .5 seconds from SQL Server Management studio. I cannot see any locking issues using SQL Profiler, nor can I reproduce the delay manually from SSMS. A week ago, I detached and reattached the database which seemed to miraculously fix the problem. Today when the problem reared its ugly head again, I tried merely rebuilding the indexes. This also fixed the problem. However, I don't think it's necessarily an index problem since the indexes wouldn't be automatically rebuilt on a simple detach/attach, to my knowledge.
Any idea what could be causing the delay? My first thought was that perhaps some parameter sniffing on the stored procedure being called (said stored proc runs a CTE, if that matters) was causing a bad query plan, which would explain the intermittent nature of the problem. Since both detaching / reattaching and an index rebuild should theoretically invalidate the cached query plan, this makes sense, but I'm unsure how to verify this. Additionally, why wouldn't the same query (copied directly from SQL Profiler with the exact same parameters) exhibit the same delay when run manually through SSMS?
Any thoughts?
I know I am weighing in on this topic very late, but I wanted to post a solution that I found when having a similar issue. In brief, adding the SET ARITHABORT ON command at the outset of my procedures brought website query performance in line with performance seen from SQL Server tools. This option is typically being set on the connection when you run a query from QA or SSMS (you can change that option, but it is the default).
In my case, I had about 15 different stored procs doing mathematical aggregates (SUMs, COUNTs, AVGs, STDEVs) across a fairly sizeable set of data (10s to 100s of thousands of rows) - adding the SET ARITHABORT ON option moved them all from running in 3-5 seconds each to 20-30ms.
So, hopefully that helps someone else out there.
If a bad plan is cached then the same bad plan should be used from SSMS too, if you run the very same query with identical arguments.
There cannot be better solution that finding the root cause. Trying to peek and poke various settings in the hope it fixes the problem will never give you the confidence it is actually fixed. Besides, next time the system may have a different problem and you'll believe this same problem re-surfaced and apply a bad solution.
The best thing to try is to capture the bad execution plan. Showplan XML Event Class Profiler event is your friend, you can get the plan of the ADO.Net call. This is a very heavy event, so you should attach profiler and capture it only when the problem manifests itself, in a short session.
Query IO statistics can also be of help. RPC:Completed and SQL: Batch Completed events both include Reads and Writes so you can compare the amount of logical IO performed by ADO.Net invocation vs. SSMS one. Large difference (for exactly the same query and params) indicate different plans.sys.dm_exec_query_stats is another avenue of investigation. You can find your query plan(s) in there and inspect the execution stats.
All these should help establish with certitude if the problem is a bad plan or something else, to start with.
I have been having the same problem.
The only way i can fix this is setting ARITHABORT ON.
but unfortunatley when it occurs again i Have to set ARITHABORT OFF.
I have no clue what ARITHABORT has to do with this but it works, I have been having this problem for over 2 years now with still no solution. The databses i am working with are over 300GB so maybe it is a size issue...
The closest i got to resolving this problem was from an earlier post
Google Groups post
Let me know if you have managed to completely solve this problem as it is very frustrating!
Is it possible that your ADO.NET query is running after the system has been busy doing other things, so that the data it needs is no longer in RAM? And when you test on SSMS, it is?
You can check for that by running the following two commands from SSMS before you run the query:
CHECKPOINT
DBCC DROPCLEANBUFFERS
If that causes the SSMS query to run slowly, then there are some tricks you can play on the ADO.NET side to help it run faster.
Simon Sabin has a great session on "when a query plan goes wrong" ( http://sqlbits.com/Sessions/Event5/When_a_query_plan_goes_wrong ) that discusses how to address this issue within procs by using various "optimize for" hints and such to help a proc generate a consistent plan and not use the default parameter sniffing.
However I've got an issue with and ad-hoc query (not in a proc) where the SSMS plan and the ASP plan are exactly the same - clustered index / table scan - and yet the ASP query takes 3+ minutes instead of 1 second. (In this case table-scan happens to be a decent answer for fetching the results.)
Anyone care to explain that one?

Resources