How do .NET SQL connection pools scale in a multi-node environment? - asp.net

I have some fairly typical SQL calls in an app that look something like this (Dapper typically in the middle), .NET 6:
var connection = new SqlConnection("constring");
using (connection)
{
await connection.OpenAsync();
var command = new SqlCommand("sql");
await command.ExecuteAsync();
await connection.CloseAsync();
connection.Dispose();
}
A request to the app probably generates a half-dozen calls like this, usually returning in <0 to 10ms. I almost never see any SQL usage (it's SQL Azure) beyond a high of 5%.
The problem comes when a bot hits the app with 50+ simultaneous requests, coming all within the same 300 or so milliseconds. This causes the classic error
InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached
I have the following things in place:
I have the connection string set to a max pool size of 250.
I'm running three nodes as an Azure App Service.
The call stacks are all async.
I do have ARR Affinity on because I'm using SignalR, but I assume the load balancer would spread out the requests as the bot likely isn't sending ARR cookies.
The app services and SQL Server do not break a sweat even with these traffic storms.
Here's the question: How do I scale this? I assume human users don't see this and the connection pool exhaustion heals quickly, but it creates a lot of logging noise. The App Service and SQL Server instance are not at all stressed or working beyond their limits, so it appears it's the connection pool mechanics that are a problem. They're kind of a black box abstraction, but a leaky abstraction since I clearly need to know more about them to make them work right.

Here's the question: How do I scale this?
.NET 6 introduced Rate Limiting, which is really the right solution here. Test how many concurrent requests your API app and database can comfortably handle, and stall or reject additional requests.
Take the analogy of an Emergency Room. Do you want to let everyone into the back who walks in the door? No once all the rooms are full, you make them wait in the waiting room, or send them away.
So put in a request throttle like:
builder.Services.AddRateLimiter(options =>
{
options.GlobalLimiter = PartitionedRateLimiter.Create<HttpContext, string>(httpContext =>
RateLimitPartition.GetFixedWindowLimiter(
partitionKey: httpContext.Request.QueryString.Value!,
factory: partition => new FixedWindowRateLimiterOptions
{
AutoReplenishment = true,
PermitLimit = 50,
QueueLimit = 10,
Window = TimeSpan.FromSeconds(10)
}));
options.OnRejected = (context, cancellationToken) =>
{
context.HttpContext.Response.StatusCode = StatusCodes.Status429TooManyRequests;
return new ValueTask();
};
});

Related

Firebase: First write is slow

Currently developing a hybrid mobile app using ionic. When the app starts up, and a user writes to the Realtime Database for the first time, it's always delayed by around 10 or more seconds. But any subsequent writes are almost instantaneous (less than 1 second).
My calculation of delay is based on watching the database in the Firebase console.
Is this a known issue, or maybe I am doing something wrong. Please share your views.
EDIT:
The write is happening via Firebase Cloud Function.
This is the call to the Firebase Cloud function
this.http.post(url+"/favouritesAndNotes", obj, this.httpOptions)
.subscribe((data) => {
console.log(data);
},(error)=>{
console.log(error);
});
This is the actual function
app.post('/favouritesAndNotes', (request, response) => {
var db = admin.database().ref("users/" + request.body.uid);
var favourites = request.body.favourites;
var notes = request.body.notes;
if(favourites!==undefined){
db.child("favourites/").set(favourites);
}
if(notes!==undefined){
db.child("notes/").set(notes);
}
console.log("Write successfull");
response.status(200).end();
});
The first time you interact with the Firebase Database in a client instance, the client/SDK has to do quite some things:
If you're using authentication, it needs to check if the token that it has is still valid, and if not refresh it.
It needs to find the server that the database is currently hosted on.
It needs to establish a web socket connection.
Each of these may take multiple round trips, so even if you're a few hundred ms from the servers, it adds up.
Subsequent operations from the same client don't have to perform these steps, so are going to be much faster.
If you want to see what's actually happening, I recommend checking the Network tab of your browser. For the realtime database specifically, I recommend checking the WS/Web Socket panel of the Network tab, where you can see the actual data frames.

Hitting 100 active connections limit in test env with only two users

I have a single web client and a few Lambda functions which use the Admin SDK. I've noticed recently that I've bumped into the 100 simultaneous connection limit but I really shouldn't be anywhere near that limit. Also it would appear that the connections established by my Lamba functions are not dropping off even after the function has completed.
Any idea on:
how I can prevent this run-up on connections from happening?
how I can release connections established by past Lambda scripts?
how can I monitor which processes/threads/stacks are holding connections?
Note: this is a testing environment I'm working out of so I'd prefer to keep this in the free tier and my requirements should definitely not be running into the 100 active limit. I am on a paid plan in prod.
I attempt to avoid calling initializeApp more than once by using the following connection code. In the example I'm talking about I only have a single database as a backend and so the default "name" of DEFAULT is used each time.
const runningApps = new Set(firebase.apps.map(i => i.name));
this.app = runningApps.has(name)
? firebase.app()
: firebase.initializeApp({
credential: firebase.credential.cert(serviceAccount),
databaseURL: config.databaseUrl
});
I'm now trying to explicitly close connections with goOffline but that leads to another issue where on the second connection -- aka, where the DEFAULT application is already setup and it just reuses the connection already established I get the following logging:
# Generated as result of `goOnline`
Connecting to Firebase: [https://xyz.firebaseio.com]
appears to be already connected
# Listening on ".info/connected" comes back as true, resulting in:
AbstractedAdmin: connected to [DEFAULT]
# but then I get this error
NotAllowed: You must first connect before using the database() API at Object._getFirebaseType
The fact that you have unexpected incoming connections to the database, makes it seem like the stale instances keep an open connection.
Best I can think off is to call goOffline() in your function before it completes to explicitly disconnect. That would probably also mean you have to call goOnline at the start of the function, since it might be running on an instance that previously went offline. Both goOnline and goOffline are synchronous calls afaik, but there's definitely going to be some time between going online and the data becoming available in your app.
If Lambda has a way for you to detect life-cycle events of its instances, that would be the preferred place to call goOffline and goOnline.
admin.initializeApp should only get called once in your script/node app.
The Firebase SDK's talks HTTP2 to the Firebase cloud system, so I'm not sure why you would encounter max connection issues as unique sockets are not stood up per call.
One thing to look out for is that calls to 3rd part API's (such as sendgrid) are not supported on the free tier.

F# Http.AsyncRequestStream just 'hangs' on long queries

I am working with:
let callTheAPI = async {
printfn "\t\t\tMAKING REQUEST at %s..." (System.DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss"))
let! response = Http.AsyncRequestStream(url,query,headers,httpMethod,requestBody)
printfn "\t\t\t\tREQUEST MADE."
}
And
let cts = new System.Threading.CancellationTokenSource()
let timeout = 1000*60*4//4 minutes (4 mins no grace)
cts.CancelAfter(timeout)
Async.RunSynchronously(callTheAPI,timeout,cts.Token)
use respStrm = response.ResponseStream
respStrm.Flush()
writeLinesTo output (responseLines respStrm)
To call a web API (REST) and the let! response = Http.AsyncRequestStream(url,query,headers,httpMethod,requestBody) just hangs on certain queries. Ones that take a long time (>4 minutes) particularly. This is why I have made it Async and put a 4 minute timeout. (I collect the calls that timeout and make them with smaller time range parameters).
I started Http.RequestStream from FSharp.Data first, but I couldn't add a timeout to this so the script would just 'hang'.
I have looked at the API's IIS server and the application pool Worker Process active requests in IIS manager and I can see the requests come in and go again. They then 'vanish' and the F# script hangs. I can't find an error message anywhere on the script side or server side.
I included the Flush() and removed the timeout and it still hung. (Removing the Async in the process)
Additional:
Successful calls are made. Failed calls can be followed by successful calls. However, it seems to get to a point where all the calls time out and the do so without even reaching the server any more. (Worker Process Active Requests doesn't show the query)
Update:
I made the Fsx script output the queries and ran them through IRM with now issues (I have timeout and it never locks up). I have a suspicion that there is an issue with FSharp.Data.Http.
Async.RunSynchronously blocks. Read the remarks section in the docs: RunSynchronously. Instead, use Async.AwaitTask.

SignalR Issues with multiple requests

In my application, I'm hosting a fairly CPU-intensive engine on a web server, which is connected to clients via SignalR. From the client, the server will be signalled to do some work (via an AJAX request), and every 200ms will send down a queue of "animation events" which describe the work being done.
This is the code used to set up the connection on the client:
$.connection.hub.start({ transport: ['webSockets', 'serverSentEvents', 'longPolling'] })
And here's the related code in the backend:
private const int PUSH_INTERVAL = 200;
private ManualResetEvent _mrs;
private void SetupTimer(bool running)
{
if (running)
{
UpdateTimer = new Timer(PushEventQueue, null, 0, PUSH_INTERVAL);
}
else
{
/* Lock here to prevent race condition where the final call to PushEventQueue()
* could be followed by the timer calling PushEventQueue() one last time and
* thus the End event would not be the final event to arrive clientside,
* which causes a crash */
_mrs = new ManualResetEvent(false);
UpdateTimer.Dispose(_mrs);
_mrs.WaitOne();
Observer.End();
PushEventQueue(null);
}
}
private void PushEventQueue(object state)
{
SentMessages++;
SignalRConnectionManager<SimulationHub>.PushEventQueueToClient(ConnectionId, new AnimationEventSeries { AnimationPackets = SimulationObserver.EventQueue.FlushQueue(), UpdateTime = DateTime.UtcNow });
}
public static void PushEventQueueToClient(string connectionId, AnimationEventSeries series)
{
HubContext.Clients.Client(connectionId).queue(series);
}
And for completeness' sake, the related Javascript method:
self.hub.client.queue = function(data) {
self.eventQueue.addEvents(data);
};
When testing this functionality on localhost, it works absolutely smoothly, with no delay (as you would expect), using serverSentEvents as a transport method.
However, when used in production, this more often than not takes a very long time to complete. Using SignalR's logging and a bit of my own instrumentation, it can be seen that the first series of events reaches the client within a couple of seconds, which is totally acceptable. However, after that SignalR often gives the following error:
Keep alive has been missed, connection may be dead/slow.
Followed soon after by:
Keep alive timed out. Notifying transport that connection has been lost.
This will happen a few times, and then eventually, up to a minute later, the events will arrive, with my own instrumentation showing that they were sen from the server approximately 200ms apart, as expected. It can also be seen that in production, they were sent with the primary transport method, web sockets.
Is anyone aware of any issues that sending multiple SignalR requests on a timer might cause? Like I say, this primarily seems to happen with web sockets. I've been told that using web sockets is best practice, so I'm keen to keep using them, but if there isn't a workaround to these kinds of issues, then I'm afraid I'll have to remove them permanently.
Edit
I've now removed the option to use web sockets on the live site, and I'm running into the same issues with server sent events - several failed attempts to reconnect after the first queue update arrives.
Summing up our discussion, I don't think there are specific issues with websockets/signalr on azure.
I've sample code here: https://github.com/jonegerton/SignalR.StockTicker which can be used for testing, with some minor tweaks (I'll probably develop it as a test platform at some point).
Its based on the sample project from MS which can be found here: https://github.com/SignalR/SignalR-StockTicker.
I've put an example in azure here (http://stockticker.azurewebsites.net) for testing purposes. It has the default transport configurations enabled (ie websockets >> serversentevents >> longpolling)

HSM - cryptoki - Sessions - Timeout

My application access the HSM via a ASP.NET web service through PKCS#11. I initialise the cryptoki library and obtain a session handle. Web-service hold on to this handle to perform encryption/decryption/signing/verifying in a batch mode.
The problem i am facing is
The ASP.NET web service time-outs' after 20 minutes. This act- i think, unloads the cryptoki library and the session handle held by the web-service becomes invalid. Yes, i agree that the ASP.NET web-service can be reconfigured not to time-out, which will keep the cryptoki library always loaded.
My question is What happens to the session handle which i obtained in the first place from the HSM?. Will it be lost or will it be there unused? I am asking this because, i am not closing the opened session properly by calling c_closeSession.
The web-service is implemented via a Thread pool
Thanks
You are supposed to call C_Finalize() when you are done using the cryptoki library. A well-written implementation might be robust against you not doing so, but there are no guarantees. Your open sessions may be kept alive on the HSM and perhaps in the driver.
Strongly consider calling C_Finalize() from your Application_End().
From the theoretical perspective, you should read the PKCS#11 spec, it is all written there, from section 6.6 onwards
From the practical perspecgive, an application becomes a cryptoki application after it calls C_Initialize. The concept of a session and its identifier may be relayed by a small wrapper library to a longrunning PKCS#11 process, that actually talks to the HSM, but may not. If the process that was a cryptoki application dies, so will do all the virtual resources (what a session is).
Where exactly is the problem? Opening a session could be a pretty cheap operation most of the time, unless you are sure (have measured) that it is the bottleneck, don't optimize and open and close a session for a request, if you can't control the lifespan of the cryptoki process.
if i understood that, you need to create a "global" login for that session.
Furthermore you need to open/close session for each local session.
So,
- Global variable with "Login" (Once on startup or when u want)
- Check global login status when you will create a new sessión.
- Create Individual sessions for each action (closing the "local" sessión not the global login)
With this you obtain a global variable with a logged session and individual session using that global login.
Good luck
I have also this problem and year is 2020 :S
.Net Framework + Rest Api couple have this problem this time.
I'm using HSM for decrypt method. I have a login method interactive channel, and we need to make performance test. The service has an instance from Pkcs11
pkcs11 = new Pkcs11(hsmPath, true);
slot = GetUsableSlot(pkcs11);
TokenInfo tokenInfo = slot.GetTokenInfo();
session = slot.OpenSession(true);
session.Login(CKU.CKU_USER, userLoginPin);
secretKey = GenerateKey(session);
And this is the Decrypt method.
public byte[] Decrypt(byte[] encryptedTextByteArray)
{
Mechanism mechanism = new Mechanism(CKM.CKM_AES_ECB);
byte[] sourceData = encryptedTextByteArray;
byte[] decryptedData = null;
using (MemoryStream inputStream = new MemoryStream(sourceData), outputStream = new MemoryStream())
{
try
{
session.Decrypt(mechanism, secretKey, inputStream, outputStream, 4096);
}
catch (Pkcs11Exception ex)
{
throw;
}
decryptedData = outputStream.ToArray();
}
return decryptedData;
}
When I try to make performance test using Postman runner, there is no problem for one thread.
If I increase thread count, It appears these errors.
First error: CKR_OPERATION_ACTIVE
Next error: CKR_DEVICE_MEMORY
I tried these methods.
-For every request closed session. And also opened session for new request. But not succeeed. The same errors appeared. (Of course request and response time increased)
-For evey request closed the conenction. And also opened new connection for new request. The same errors appeared. (Of course request and response time increased)
Anyone helps me? :)

Resources