Mgo-based app code structure dealing with connection pool and tcp timeouts

Mgo-based app code structure dealing with connection pool and tcp timeouts - tcp

I'm curious how should I structure JSON REST API server in Go language with Mgo library. I have dozens of collections related with each other. I've created the gist with sample part of file structure in my current approach.
It works great, but from time to time I encounter downtime caused by this error: "read tcp 10.168.30.100:37288: i/o timeout". I suppose that I handle mgo connection pool inapropriately. Are there any examples showing how should I create big applications based on mgo?

This error message implies a roundtrip to the database took longer than the timeout period you defined. Just increasing that timeout should get rid of the problem, assuming you don't have any real issues that are causing the application to behave in a sluggish manner.
In general, this error doesn't imply you have any kind of scale issues, other than the fact maybe you have an increasing amount of data in some collections and certain queries may be getting too slow and need re-thinking (indexes, etc).
There's also no need to restart the application. You can either Refresh the problematic session, or Close and re-create the session in case you're using copies of a master session. The state of mgo and the pool of connections is still fine. It's just warning you that this specific session observed an issue on the wire, and so you have to acknowledge it before the session will be valid again.
As usual, also make sure to be using the latest release to avoid problems that have already been fixed, if any.

Related

ASP.NET: Session state has created a session id, but cannot save it because the response was already flushed by the application

in an old ASP.NET Web Forms application, developed in Visual Studio 2010,
suddenly does not run anymore, and in the log file appears this message:
Session state has created a session id,
but cannot save it because the response was already flushed by the application.
No new deployment has been made, and no code modifications take place.
Until now I didn't find any solution to this.
What I have to check?
I state that the source code is no longer available, and therefore it would be very difficult to change the code and proceed with a new deployment.
Thanks in advance.
Luis

This would suggest that someone might be hitting the site and jumping directly to some URL (and thus code) that say does some response redirect to another page or some such.
Remember, when code behind runs, and say re-directs to another page, in most cases the running code for the current page is terminated, and that is normal behaviors.
However, the idea that you going to debug code and debug a web site when you don't have the code to debug? Gee, I don't see how that's going to work at all. As noted, if this just started, then it sounds like incoming requests are to pages that don't expect to be hit "first", but some pages that expect to be ONLY called from other pages in the site when some session() and imporant values are setup BEFORE such pages are to be hit.
It also not clear if the site is using sql based sessions, or just in-memory sessions. In memory can (and is) faster, but it also not particually relaible. Now, if you deployed to a new web server or new hosting, then often session errrors can now start to appear, and this is due to the MASSIVE HUGE LARGE DIFFERENT of using cloud based hosting vs that of older hosting soluions that run on a single server.
Clould computing is real utility computing, and thus when you host a web site on such systems, then in-memory session() cannot be used anymore, since multiple servers can and will be used to "dish out" web pages. Since more then one server might be used, then obvisouly in-memory sesson() can't work, since a few web pages might be served out by one server, and then a few more pages might be served out by another server. And using shared memory for a session is limited to ONE server, since multiplel servers don't and can't transfer their memory to other servers.
So, this suggests that you want to be sure that sql server based sessions are being used here - and for any kind of server farm, or any kind of system that does load balances between more then one server, then of course you HAVE to use sql server based sessions, since in memory can't work in that kind of environment.
The error could also be due to excessive server loads - often the session database is "locked" for a short period of time, and can thus often be a bottleneck. So, for say years you might not see a issue, but then as load and use of the web site increases, then this can become noticed where as in the past it was not. I suppose the database used for storing sessions could be checked, or looked at, since as you note, you don't have the ability to test + debug the code. I would REALLY but REALLY work towards solving and fixing this lack of source code for the web site, since without that, you have really no means to manage, maintain, and fix issues for that web site.
But, abrupt terminations of web pages? As noted, this could be a error triggered in code, and thus the page never finished what it was supposed to do. And as noted, perhaps a page that expects some session() values but does not have them as explained above could be the problem. It not clear if your errors also shows what URL this was occurring for.
While nothing seems to have changed - something obviously did.
Ultimate, you need to get that source code, or deal with the people + vendor that supplies the code for that site. If you don't have a vendor, and you don't have source code, you quite much attempting to work on a car that you cant even open the hood to check what's going on under that hood.
so, one suggestion here? Someone is hitting a page that expected some value(s) in session to exist. Often the simple solution is to shove ANY simple or dummy value into session so session REALLY does get created, and then when the page attempts to save the session(), there is one to save!!!
In other words, this error often occurs when session is attempted to be saved, but no sesison exists. For such pages, as noted, a simple tiny small code change of doing this session("zoozoo") = "my useless text" will fix this error. So, it sounds like session is being lost.
As noted, a error on a web page can also trigger a app-pool re-start. If app-pool re-starts, then session is lost (in memory session). Now, with session being lost, then any page that decides to terminate early AND ALSO having used session() will trigger this error.
So, this sounds like app-pool is being re-started and session is being lost. (you can google why app-pool restarts and for the many reasons). However, critical to this issue would be are you using sql based sessions, or in-memory (server) sessions? So, this sounds like some code is triggering a error, and with a error triggered, app-pool re-starts. And with app-pool being restarted, then in-memory session is blown away. And now, without ANY session at all, then attempts to save the session trigger the exact error message you see. (and this is why shoving a dummy value into the session allows and can fix some pages - since you can't save a "nothing" session, and if you do, then you get that exact error message.
but, as noted, you can't make these simple changes to code anyway, right?
But, first on this issue - are you using memory based sessions or not? And that feature can be setup and configured in IIS, and without changes to the code base. So, one quick fix might be to turn on sql server based sessions. It will cost web site performance (10%), but the increased reliability is more then worth the performance hit.
Another area to look at? Are AJAX calls being made to a page, and again without any previous session having been created? So, once again, we down to a change in end user behaviors, and possible those hitting a page first before having logged in, or done other things - and again one would see this error crop up.

Known reasons why sqlite3_open_v2 can take over 60s on windows?

I will start with TL;DR version as this may be enough for some of you:
We are trying to investigate an issue that we see in diagnostic data of our C++ product.
The issue was pinpointed to be caused by timeout on sqlite3_open_v2 which supposedly takes over 60s to complete (we only give it 60s).
We tried multiple different configurations, but never were able to reproduce even 5s delay on this call.
So the question is if maybe there are some known scenarios in which sqlite3_open_v2 can take that long (on windows)?
Now to the details:
We are using version 3.10.2 of SQLite. We went through changelogs from this version till now and nothing we've found in the bugfixes section seems to suggest that there was some issue that was addressed in consecutive SQLite releases and may have caused our problem.
The issue we see affects around 0.1% unique user across all supported versions of windows (Win 7, Win 8, Win 10). There are no manual user complaints/reports about that - this can suggest that a problem happens in the context where something serious enough is happening with the user machine/system that he doesn't expect anything to work. So something that indicates system-wide failure is a valid possibility as long as it can possibly happen for 0.1% of random windows users.
There are no data indicating that the same issue ever occurred on Mac which is also supported platform with large enough sample of diagnostic data.
We are using Poco (https://github.com/pocoproject/poco, version: 1.7.2) as a tool for accessing our SQLite database, but we've analyzed the Poco code and it seems that failure on this code level can only (possibly) explain ~1% of all collected samples. This is how we've determined that the problem lies in sqlite3_open_v2 taking a long time.
This happens on both DELETE journal mode as well as on WAL.
It seems like after this problem happens the first time for a particular user each consecutive call to sqlite3_open_v2 takes that long until the user restarts whole application (possibly machine, no way to tell from our data).
We are using following flags setup for sqlite3_open_v2 (as in Poco):
sqlite3_open_v2(..., ..., SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE | SQLITE_OPEN_URI, NULL);
This usually doesn't happen on startup of the application so it's not likely to be caused by something happening while our application is not running. This includes power cuts offs causing data destruction (which tends to return SQLITE_CORRUPT anyway, as mentioned in https://www.sqlite.org/howtocorrupt.html).
We were never able to reproduce this issue locally even though we tried different things:
Multiple threads writing and reading from DB with synchronization required by a particular journaling system.
Keeping SQLite connection open for a long time and working on DB normally in a meanwhile.
Trying to hit HDD hard with other data (dumping /dev/rand (WSL) to multiple files from different processes while accessing DB normally).
Trying to force antivirus software to scan DB on every file access (tested with Avast with basically everything enabled including "scan on open" and "scan on write").
Breaking our internal synchronization required by particular journaling systems.
Calling WinAPI CreateFile with all possible combinations of file sharing options on DB file - this caused issues but sqlite3_open_v2 always returned fast - just with an error.
Calling WinAPI LockFile on random parts of DB file which is btw. nice way of reproducing SQLITE_IOERR, but no luck with reproducing the discussed issue.
Some additional attempts to actually stretch the Poco layer and double-check if our static analysis of codes is right.
We've tried to look for similar issues online but anything somewhat relevant we've found was here sqlite3-open-v2-performance-degrades-as-number-of-opens-increase. This doesn't seem to explain our case though, as the numbers of parallel connections are way beyond what we have as well as what would typical windows users have (unless there is some somewhat popular app exploiting SQLite which we don't know about).
It's very unlikely that this issue is caused by db being accessed through network share as we are putting the DB file inside %appdata% unless there is some pretty standard windows configuration which sets %appdata% to be a remote share.
Do you have any ideas what can cause that issue?
Maybe some hints on what else should we check or what additional diagnostic data that we can collect from users would be useful to pinpoint the real reason why that happens?
Thanks in advance

Classic ASP 'Requests Executing' never greater than 1

We have a complex app that serves AJAX JSON streams (using ADO to grab the data) using brief ASP servlets. Any given session can fire up from 10-20 of these requests simultaneously. We encountered a significant performance problem way earlier than we expected as load built. (Server is a dual-XEON, RAID 5, 4gb, etc). Sleuthing around in perfmon we noticed that the 'Requests Executing' figure is perpetually stuck at 1. Never gets any higher. Research indicates that numbers of 20-50 are not uncommon. Requests Queued will hover around 10-20 and Wait Time climbs as well.
We have fiddled with ASPProcessorThreadMax set to 40 from default of 25 with no effect. It seems to be only able to work a single request at a time, which, needless to say, won't work. I can't find anything that describes this particular problem. Anny help is greatly appreciated.

ASP Session object is constrained to a Single Threaded Apartment (STA). As a result requests to ASP scripts for the same session can only be processed sequentially.
An additional reason why you might only ever see 1 executing ASP script even across multiple sessions is where debugging has be enabled for ASP. This causes the ASP processing to ignore ASPProcessorThreadMax and pretend it were set to 1.
To eliminate the problem ensure debugging is not enabled and turn off "Enable Session State". If you are using the Session object in your code you will need to find an alternative, like DB backed state.
However, how many active concurrent sessions are you expecting in the live production? Perhaps the overall user experience will not truely be impacted by the serialisation of requests per session.

ORACLE/ASP.NET: ORA-2020 - Too many database links... what's causing this?

Here's the scenario...
We have an internal website that is running the latest version of the ODAC (Oracle Client). It opens database connections, runs a stored procedure or packaged method, then disconnects. Connection pooling is turned on, and we are currently under version 11g in both our development and test environments, but under 10gR2 in our production environment. This happens on Production.
A few days ago, a process began firing off a ORA-2020 error. The process is called from a webpage on our internal website. The user simply sets a date, hits a button, and a job is started on another system that is separate from the website. The call itself, however, uses a database link to run a function.
We've scoured the SQL to find that it only uses that one database link. And since these links are on a per session basis and the user isn't exceeding the default limit of 4, how is it possible that we are getting a ORA-2020 error.
We have ran a number of tests to try to push over the default limit of 4. ODAC, from what I recall, runs a commit after each connection, and I can't seem to run 4 DB links, then run a piece of SQL with 1 DB link directly after with any errors. The only way I can bring up this error is if I run a query with 4 DB links, then a function or piece of dynamic SQL with a database link within it. We don't have that problem as this issue is sporadic. It isn't ALWAYS happening.
Questions
Is it possible that connection pooling is allowing User B to use User A's connection after the initial process was run, thus adding to the open links number if User B runs a SQL statement with more database links?
Is this a scenario where we should up our limit past 4? What are the disadvantages of increasing the number?
Do I need to explicitly close open database links before disconnecting from the database? Oracle documentation seems to suggest it should automatically happen, but "on occasion"... doesn't.

Firstly, the simple solution: I'd double check that in the production database the number of default links is actually 4.
select *
from v$system_parameter
where name = 'OPEN_LINKS'
Assuming you're not going to get off that lightly:
Is it possible that connection pooling is allowing User B to use User
A's connection after the initial process was run, thus adding to the
open links number if User B runs a SQL statement with more database
links?
You say that you explicitly close the session, which, according to the documentation, should mean that all links associated with that session are closed. Other than that I confess complete ignorance on this point.
Is this a scenario where we should up our limit past 4? What are the
disadvantages of increasing the number?
There aren't any disadvantages that I can think of. Tom Kyte suggests, albeit a long time ago, that each open database link uses 500k of PGA memory. If you don't have any then this will obviously cause a problem but it should be more than fine for most situations.
There are, however, unintended consequences: Imagine that you up this number to 100. Somebody codes something that continually opens links and draws a lot of data through all them select * from my_massive_table or similar. Instead of 4 sessions doing this you have 100, which is attempting to transfer hundreds of gigabytes simultaneously. Your network dies under the strain...
There's probably more but you get the picture.
Do I need to explicitly close open database links before disconnecting
from the database? Oracle documentation seems to suggest it should
automatically happen, but "on occasion"... doesn't.
As you've noted the best answer is "probably not", which isn't much help. You don't mention exactly how you're terminating the session but if you're killing it rather than closing gracefully then definitely.
Using a database link spawns a child process on the remote server. Because your server is no longer in absolute charge of this process there's a myriad of things that could cause it to become orphaned or otherwise not close on termination of the parent process. By no means does this happen the whole time but it can and does.
I would do two things.
In your process, if an exception is encountered, e-mail the results of the following query to yourself.
select *
from v$dblink
As a minimum at least you will know what database links are open in the session and give you some way of tracing them.
Follow the documentations advice; specifically the following:
"You may have occasion to close the link manually. For example, close
links when:
The network connection established by a link is used infrequently in an application.
The user session must be terminated."
The first seems to exactly fit your situation. Unless your process is time-sensitive, which doesn't seem to be the case, then what have you got to lose? The syntax is:
alter session close database link <linkname>

We ended up increasing the link amount, but we never did find the root cause.

sporadic ASP.NET data error: "Cannot find table 0"

Having deployed a new build of an ASP.NET site in a production environment, I am logging dozens of data errors every second, almost always with the error "Cannot find table 0." We use datasets and frequently refer to Table[0], and while I understand the defensive coding practice of checking the dataset for tables before accessing Table[0], it's never been a problem in the past. A certain page will load fine one second, and then be missing one of its data-driven components the next. Just seeing if this rings a bell for anyone.
More detail: I used a different build server this time, and while I imagine the compiler settings are the same on both, I have a hard time thinking that there's a switch that makes 50% of my database calls come back with no tables. I also switched the project to VS 2008, but then reverted all of those changes when I switched back to VS 2005. I notice that the built assembly has a new MyLibrary.XmlSerializers.dll, where it didn't used to, but I also can't imagine that that's causing all the trouble. (It also doesn't fall down on calls to MyLibrary, or at least no more than any other time.)
Updated to add: I've discovered that the troublesome build is a "Release" build, where the working build was compiled as "Debug". Could that explain it?
Rolling back to the build before these changes fixed it. (Rebooting the SQL Server, the step we tried before that, did not.)
The trouble also seems to be load-based - this cruised through our integration and QA environments without a problem, and even our smoke test environment - the one that points to production data - is fine under light load.
Does this have the distinguishing characteristics of anything you might have seen in the past?

Bumping this old question because we have encountered the same issue and perhaps our solution would give more insight in what causes this.
Essentially this problem occurs in a production environment that is under very heavy load in a Windows service that uses multiple threads to process several jobs simultaneously (100 users use the same DB via ASP.NET web app and there are about 60 transactions/second on older hardware with SQL Server 2000).
No variables are shared, that is connections are opened anew, transaction is started, operations executed, transaction committed and connection closes.
Under heavy load sometimes one of the following exceptions occurs:
NullReferenceException: Object reference not set to an instance of an
object.
at System.Data.SqlClient.SqlInternalConnectionTds.get_IsLockedForBulkCopy()
or
System.Data.SqlClient.SqlException:
The server failed to resume the transaction. Desc:3400000178
or
New request is not allowed to start because it should come with valid transaction descriptor
or
This SqlTransaction has completed; it is no longer usable
It seems somehow the connection that is within the pool becomes corrupted and remains associated with previously used transactions. Furthermore, if such connection is retrieved from pool then sqlAdapter.Fill(dataset) results in an empty dataset, causing "Cannot find table 0". Because our service would retry the operation (reading job list) on failure and it would always get the same corrupt connection from the pool it would fail with this error until restarted.
We removed the issue by using SqlConnection.ClearPool(connection) on exception to make sure this connection is discarded from the pool and restructuring the application so less threads access the same resources simultaneously.
I have no clue who exactly caused this issue so I am not sure we have really fixed that, maybe just made it so rare it had not occurred again yet.

I've fought precisely this error message before. The key is that an underlying data method is swallowing a timeout exception.
You're probably doing something like this:
var table = GetEmployeeDataSet().Tables[0];
GetEmployeeDataSet is swallowing an exception, probably a timeout exception, which is why it only happens sporadically - it happens under load. You need to do the following to fix it:
Modify the underlying code to not swallow the exception, but rather let it bubble up to the next level so you can identify it properly.
Identify the query(s) causing the problem, and then rewrite, reindex, denormalize or throw hardware at the problem. See this for more info: System.Data.SqlClient.SqlException: Timeout expired

I've seen something similar. I believe our problem had to do with failed sessions being re-used (once the session object failed it went into a poor state and could not recover.) We fixed it by increasing the memory for the session pool and increasing the frequency of the web application recycling.
It also was "caused" by a new version that at first blush did not seem to have any change to cause such an effect. However, eventually it became clear that the logic of the program was opening and closing a lot more connections (maybe 20% more) than it used to. This small change pushed the limit of our prior configuration.

You might check the SQL Server logs for errors. Or, the Web server event log. It sounds like your connection pool could be out of open connections or your db could be out.

Which database calls changed between versions?
The error is obviously telling you one of your database calls isn't returning any data on occasion; I can't think of any cases where a code/assembly issue would cause it.

I have seen something like this when doing something with nHibernate Sessions in a non-thread-safe manner. That would explain why you only see it under load. Would need to see your code to guess at what isn't thread-safe though.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex