Okay, so "async all the way down" is the mandate. But when is it problematic?
For example, if you have limited access to a resource, as in a DbConnection or a file, when do you stop using async methods in favor of synchronous?
Let's review the complexity of an asynchronous database call:
(Not putting .ConfigureAwait(false) for readability.)
// Step 1: Ok, no big deal, our connection is closed, let's open it and wait.
await connection.OpenAsync();
// Connection is open! Let's do some work.
// Step 2: Acquire a reader.
using(var reader = await command.ExecuteReaderAsync())
{
// Step 3: Start reading results.
while(await reader.ReadAsync())
{
// get the data.
}
}
Steps:
Should be reasonably innocuous and nothing to worry about.
But now we've acquired an open connection in a potentially limited connection pool. What if when waiting for step 2, other long running tasks are at the head of the line in the task scheduler?
Even worse now, we await with an open connection (and most likely added latency).
Aren't we holding open a connection longer than necessary? Isn't this an undesirable result? Wouldn't it be better to use synchronous methods to lessen the overall connection time, ultimately resulting in our data driven application performing better?
Of course I understand that async doesn't mean faster but async methods provide the opportunity for more total throughput. But as I've observed, there can definitely be weirdness when there are tasks scheduled in-between awaits that ultimately delay the operation, and essentially behave like blocking because of the limitations of the underlying resource.
[Note: this question is focused on ADO, but this also applies to file reads and writes.]
Hoping for some deeper insight. Thank you.
There are a few things to consider here:
Database connection pool limits, specifically the "Max Pool Size" which defaults to 100. The database connection pool has upper limit of the maximum number of connections. Besure to set "Max Pool Size=X" where X is the maximum number of database connections you want to have. This applies to either sync or async.
The thread pool settings. The thread pool will not add threads quickly if you load spikes. It will only add a new thread every 500ms or so. See MSDN Threading Guidelines from 2004 and The CLR Thread Pool 'Thread Injection' Algorithm. Here is a capture of the number of busy threads on one of my projects. The load spiked and requests were delayed due to lack of available threads to service the requests. The line increases as new threads were being added. Remember every thread required 1MB of memory for its stack. 1000 threads ~= 1GB of RAM just for threads.
The load characteristics of your project, relates to the thread pool.
The type of system you are providing, I will assume you are talking about a ASP.NET type app/api
The throughput (requests/sec) vs latency (sec/request) requirements. Async will add to latency but increase throughput.
The database/query performance, relates to the 50ms recommendation below
The article The overhead of async/await in NET 4.5 Edit 2018-04-16 the recommendation below applied to WinRT UI based applications.
Avoid using async/await for very short methods or having await
statements in tight loops (run the whole loop asynchronously instead).
Microsoft recommends that any method that might take longer than 50ms
to return should run asynchronously, so you may wish to use this
figure to determine whether it’s worth using the async/await pattern.
Also take a watch Diagnosing issues in ASP.NET Core Applications - David Fowler & Damian Edwards that talks about issues with thread pool and using async, sync, etc.
Hopefully this helps
if you have limited access to a resource, as in a DbConnection or a file, when do you stop using async methods in favor of synchronous?
You shouldn't need to switch to synchronous at all. Generally speaking, async only works if it's used all the way. Async-over-sync is an antipattern.
Consider the asynchronous code:
using (connection)
{
await connection.OpenAsync();
using(var reader = await command.ExecuteReaderAsync())
{
while(await reader.ReadAsync())
{
}
}
}
In this code, the connection is held open while the command is executed and the data is read. Anytime that the code is waiting on the database to respond, the calling thread is freed up to do other work.
Now consider the synchronous equivalent:
using (connection)
{
connection.Open();
using(var reader = command.ExecuteReader())
{
while(reader.Read())
{
}
}
}
In this code, the connection is held open while the command is executed and the data is read. Anytime that the code is waiting on the database to respond, the calling thread is blocked.
With both of these code blocks, the connection is held open while the command is executed and the data is read. The only difference is that with the async code, the calling thread is freed up to do other work.
What if when waiting for step 2, other long running tasks are at the head of the line in the task scheduler?
The time to deal with thread pool exhaustion is when you run into it. In the vast majority of scenarios, it isn't a problem and the default heuristics work fine.
This is particularly true if you use async everywhere and don't mix in blocking code.
For example, this code would be more problematic:
using (connection)
{
await connection.OpenAsync();
using(var reader = command.ExecuteReader())
{
while(reader.Read())
{
}
}
}
Now you have asynchronous code that, when it resumes, blocks a thread pool thread on I/O. Do that a lot, and you can end up in a thread pool exhaustion scenario.
Even worse now, we await with an open connection (and most likely added latency).
The added latency is miniscule. Like sub-millisecond (assuming no thread pool exhaustion). It's immeasurably small compared to random network fluctuations.
Aren't we holding open a connection longer than necessary? Isn't this an undesirable result? Wouldn't it be better to use synchronous methods to lessen the overall connection time, ultimately resulting in our data driven application performing better?
As noted above, synchronous code would hold the connection open just as long. (Well, OK, a sub-millisecond amount less, but that Doesn't Matter).
But as I've observed, there can definitely be weirdness when there are tasks scheduled in-between awaits that ultimately delay the operation, and essentially behave like blocking because of the limitations of the underlying resource.
It would be worrying if you observed this on the thread pool. That would mean you're already at thread pool exhaustion, and you should carefully review your code and remove blocking calls.
It's less worrying if you observed this on a single-thread scheduler (e.g., UI thread or ASP.NET Classic request context). In that case, you're not at thread pool exhaustion (though you still need to carefully review your code and remove blocking calls).
As a concluding note, it sounds as though you're trying to add async the hard way. It's harder to start at a higher level and work your way to a lower level. It's much easier to start at the lower level and work your way up. E.g., start with any I/O-bound APIs like DbConnection.Open / ExecuteReader / Read, and make those asynchronous first, and then let async grow up through your codebase.
Due to the way database connection pooling works at lower levels of protocol, the high level open / close commands don't have a lot of effect on performance. Generally though the internal thread scheduling IO is usually not a bottleneck unless you have some really long running tasks - we're talking something CPU intensive or worse - blocking inside. This will quickly exhaust your thread pool and things will start queuing up.
I would also suggest you investigate http://steeltoe.io, particularly the circuit breaker hystrix implementation. The way it works is it allows you to group your code into commands, and have command execution managed by command groups, which are essentially dedicated and segregated thread pools. The advantage is if you have a noisy, long running command, it can only exhaust it's own's command group thread pool without affecting the rest of the app. There are many other advantages of this portion of the library, primary being circuit breaker implementation, and one of my personal favorite collapsers. Imagine multiple incoming calls for a query GetObjectById being grouped into a single select * where id in(1,2,3) query and then results mapped back on the separate inbound requests. Db call is just an example, can be anything really.
Significant amounts of iteration introduce significant added latency and extra CPU usage
See http://telegra.ph/SqlDataReader-ReadAsync-vs-Read-04-18 for details.
As suspected:
Using async does not come without cost and requires consideration.
Certain types of operations lend themselves well to async, and others are problematic (for what should be obvious reasons).
High volume synchronous/blocking code has it's downsides, but for the most part is well managed by modern threading:
Testing / Profiling
4 x 100 paralleled queries, 1000 records each query.
Performance Profile for Synchronous Query
Average Query: 00:00:00.6731697, Total Time: 00:00:25.1435656
Performance Profile for Async Setup with Synchronous Read
Average Query: 00:00:01.4122918, Total Time: 00:00:30.2188467
Performance Profile for Fully Async Query
Average Query: 00:00:02.6879162, Total Time: 00:00:32.6702872
Assessment
The above results were run on SQL Server 2008 R2 using a .NET Core 2 console application. I invite anyone who has access to a modern instance of SQL Server to replicate these tests to see if there is a reversal in trend. If you find my testing method flawed, please comment so I correct and retest.
As you can easily see in the results. The more asynchronous operations we introduce, the longer the the queries take, and the longer the total time to complete. Even worse, fully asynchronous uses more CPU overhead which is counter productive to the idea that using async tasks would provide more available thread time. This overhead could be due to how I'm running these tests, but it's important to treat each test in a similar way to compare. Again, if anyone has a way to prove that async is better, please do.
I'm proposing here that "async all the way" has it's limitations and should be seriously scrutinized at certain iterative levels (like file, or data access).
Related
I have a busy ASP.NET 5 Core app (thousands of requests per second) that uses SQL Server. Recently we decided to try to switch some hot code paths to async database access and... the app didn't even start. I get this error:
The timeout period elapsed prior to obtaining a connection from the
pool. This may have occurred because all pooled connections were in
use and max pool size was reached.
And I see the number of threads in the thread pool growing to 40... 50... 100...
The code pattern we use is fairly simple:
using (var cn = new SqlConnection(connenctionStrng))
{
cn.Open();
var data = await cn.QueryAsync("SELECT x FROM Stuff WHERE id=#id"); //QueryAsync is from Dapper
}
I made a process dump and all threads are stuck on the cn.Open() line, just sitting there and waiting.
This mostly happens during application "recycles" on IIS, when the app process is restarted and HTTP requests are queued from one process to another. Resulting in tens of thousands requests in the queue, that need to be processed.
Well, yeah,, I get it. I think I know what's happening. async makes the app scale more. And while the database is busy responding to my query, the control is returned to other threads. Which try to open more, and more, and more connections in parallel. The connection pool maxes out. But why the closed connections are not returned to the pool immediately after the work is finished?
Switching from async to "traditional" code fixes the problem immediately.
What are my options?
Increasing max pool size from the default 100? Tried 200, didn't help. Should I try, like, 10000?
Using OpenAsync instead of Open? Didn't help.
I thought I'm running into this problem https://github.com/dotnet/SqlClient/issues/18 but nope, I'm on a newer version of SqlClient and it's said to be fixed. Supposedly.
Not use async with database access at all? Huh...
Do we really have to come up with our own throttling mechanisms when using async like this answer suggests? I'm surprised there's no built-in workaround...
P.S. Taking a closer look at the process dump - I checked the Tasks report and discovered literally tens of thousands blocked tasks in the waiting state. And there's exactly 200 db-querying tasks (which is the size of connection pool) waiting for queries to finish.
Well, after a bit of digging, investigating source codes and tons of reading, it appears that async is not always a good idea for DB calls.
As Stephen Cleary (the god of async who wrote many books about it) has nailed it - and it really clicked with me:
If your backend is a single SQL server database, and every single request
hits that database, then there isn't a benefit from making
your web service asynchronous.
So, yes, async helps you free up some threads, but the first thing these threads do is rush back to the database.
Also this:
The old-style common scenario was client <-> API <-> DB, and in that
architecture there's no need for asynchronous DB access
However if your database is a cluster or a cloud or some other "autoscaling" thing - than yes, async database access makes a lot of sense
Here's also an old archive.org article by RIck Anderson that I found useful: https://web.archive.org/web/20140212064150/http://blogs.msdn.com/b/rickandy/archive/2009/11/14/should-my-database-calls-be-asynchronous.aspx
We're going to create a new API in dotnet core 2.1, this Web API will be a high traffic/transactions like 10,000 per minute or higher. I usually create my API like below.
[HttpGet("some")]
public IActionResult SomeTask(int id)
{
var result = _repository.GetData(id)
return Ok(result);
}
If we implement our Web API like below, what would be the beneficial?
[HttpGet("some")]
public async Task<IActionResult> SomeTask(int id)
{
await result = _repository.GetData(id);
return Ok(result);
}
We're also going to use the EF core for this new API, should we use the EF async as well if we do the async Task
What you're really asking is the difference between sync and async. In very basic terms, async allows the possibility of a thread switch, i.e. work begins on one thread, but finishes on another, whereas sync holds onto the same thread.
That in itself doesn't really mean much without the context of what's happening in a particular application. In the case of a web application, you have a thread pool. The thread pool is generally comprised of a 1000 threads, as that's the typical default across web servers. That number can be less or more; it's not really important to this discussion. Though, it is important to note that there is a very real physical limit to the maximum number of threads in a pool. Since each one consumes some amount of system resources.
This thread pool, then, is often also referred to as the "max requests", since generally speaking one request = one thread. Therefore, if you have a thread pool of 1000, you can theoretically serve 1000 simultaneous requests. Anything over that gets queued and will be handled once one of the threads is made available. That is where async comes in.
Async work is pretty much I/O work: querying a database, read/writing to a file system, making a request to another service, such as an API, etc. With all of those, there's generally some period of idle time. For example, with a database query, you make the query, and then you wait. It takes some amount of time for the query to make it to the database server, for the database server to process it and generate the result set, and then for the database server to send the result back. Async allows the active thread to be returned to the pool during such periods, where it can then service other requests. As such, assuming you have an action like this that is making a database query, if it was sync and you received 1001 simultaneous requests to that action, the first 1000 would begin processing and the last one would be queued. That last request could not be handled until one of the other 1000 completely finished. Whereas, with async, as soon as one of the thousand handed off the query to the database server, it could be returned to the thread pool to handle that waiting request.
This is all a little high level. In actuality, there's a lot that goes into this and it's really not so simple. Async doesn't guarantee that the thread will be released. Certain work, particular CPU-bound work, can never be async, so even if you do it in an async method, it runs as if it was sync. However, generally speaking, async will handle more requests than sync in a scenario where your thread-starved. It does come at a cost though. That extra work of switching between threads adds some amount of overhead, even if it's miniscule, so async will almost invariably be slower than sync, even if only by nanoseconds. However, async is about scale, not performance, and the performance hit is generally an acceptable trade for the increased ability to scale.
I understand the benefits of async on the frontend (some kind of UI). With async the UI does not block and the user can "click" on other things while waiting for another operation to execute.
But what is the benefit of async programming on the backend?
The main benefit is that there could be various slow operations on the backend, which can prevent other requests from using the cpu at the same time. These operations could be: 1. Database operations 2. File operations 3. Remote calls to other services (servers), etc. You don't want to block the cpu while these operations are in progress.
First of all, there a benefit of handling more than one request at a time. Frameworks like ASP.net or django create new (or reuse existing) threads for each requests.
If you mean async operations from the thread of particular request, that's more complicated. In most cases, it does not help at all, because of the overhead of spawning new thread. But, we have things like Schedulers in C# for example, which help a lot. When correctly used, they free up a lot of CPU time normally wasted on waiting.
For example, you send a picture to a server. Your request is handled in new thread. This thread can do everything on it's own: upack the picture and save it to disk, then update the database.
Or, you can write to disk AND update the database at the same time. The thread that is done first is our focus here. When used without scheduler, it starts spinning a loop, checking if the other thread is done, which takes CPU time. If you use scheduler, it frees that thread. When the other task is done, it uses propably yet another precreated thred to finish handling of your request.
That scenario does make it seem like it's not worth the fuss, but it is easy to imagine more coplicated tasks, that can be done in the same time instead of sequentailly. On top of that, schedulers are rather smart and will make it so the total time needed will be lowest, and CPU usage moderate.
I need to put a customized logging system of sorts in place for an ASP.NET application. Among other things, it has to log some data per request. I've thought of two approaches:
Approach #1: Commit each entry per request. For example: A log entry is created and committed to the database on every request (using a transient DbContext). I'm concerned that this commit puts an overhead on the serving of the request that would not scale well.
Approach #2: Buffer entries, commit periodically. For example: A log entry is created and added to a concurrent buffer on every request (using a shared lock). When a limit in that buffer is exceeded, an exclusive lock is acquired, the buffered entries are committed to the database in one go (using another, also transient DbContext, created and destroyed only for each commit) and the buffer is emptied. I'm aware that this would make the "committing" request slow, but it's acceptable. I'm also aware that closing/restarting the application could result in loss of uncommitted log entries because the AppDomain will change in that case, but this is also acceptable.
I've implemented both approaches within my requirements, I've tested them and I've strained them as much as I could in a local environment. I haven't deployed yet and thus I cannot test them in real conditions. Both seem to work equally well, but I can't draw any conclusions like this.
Which of these two approaches is the best? I'm concerned about performance during peaks of a couple thousand users. Are there any pitfalls I'm not aware of?
To solve your concern with option 1 about slowing down each request, why not use the TPL to offload the logging to a different thread? Something like this:
public class Logger
{
public static void Log(string message)
{
Task.Factory.StartNew(() => { SaveMessageToDB(message); });
}
private static void SaveMessageToDB(string message)
{
// etc.
}
}
The HTTP request thread wouldn't have to wait while the entry is written. You could also adapt option 2 to do the same sort of thing to write the accumulated set of messages in a different thread.
I implemented a solution that is similar to option 2, but in addition to a number limit, there was also a time limit. If no logs entries had been entered in a certain number of seconds, the queue would be dumped to the db.
Use log4net, and set its buffer size appropriately. Then you can go home and have a beer the rest of the day... I believe it's Apache licensed, which means you're free to modify/recompile it for your own needs (fitting whatever definition of "integrated in the application, not third party" you have in mind).
Seriously though - it seems way premature to optimize out a single DB insert per request at the cost of a lot of complexity. If you're doing 10+ log calls per request, it would probably make sense to buffer per-request - but that's vastly simpler and less error prone than writing high-performance multithreaded code.
Of course, as always, the real proof is in profiling - so fire up some tests, and get some numbers. At minimum, do a batch of straight inserts vs your buffered logger and determine what the difference is likely to be per-request so you can make a reasonable decision.
Intuitively, I don't think it'd be worth the complexity - but I have been wrong on performance before.
I am working on ASP.NET project and yesterday I saw a piece of code that uses System.Threading.Thread to offload some tasks to a new thread. The thread runs a few SQL statements and logs the result.
Isn't it better to use another approach? For example to have a Windows Service that performs the SQL batch. Then the web page will just enqueue the batch (via WCF).
In general, what are the best practices for multithreading in ASP.NET? Are there justified usages of threads/TPL tasks/etc. in a web page?
My thought when using multi-threading in ASP.NET:
ASP.NET recycles AppDomain for some reasons like you change web.config or in the period of time to avoid memory leak. The thing is you don't know which exact time of recycle. Long running thread is not suitable because when ASP.NET recycles it will take your thread down accordingly. The right approach of this case is long running task should be running on background process via Queue, like you mention.
For short running and fire and forget task, TPL or async/await are the most appropriate because it does not block thread in thread pool to utilize for HTTP requests.
In my opinion this should be solved by raising some kind of flag in the database and a Windows service that periodically checks the flag and starts the job. If the job is too frequent a dedicated queue solution should be used (MSMQ, RabbitMQ, etc.) to avoid overloading the database or the table growing too fast. I don't think communicating directly with the Windows service via WCF or anything else is a good idea because this may result in dropped messages.
That being said sometimes a project needs to run in a shared hosting and cannot setup a dedicated Windows service. In this case a thread is acceptable as a work around that should be removed as soon as the project grows enough to have its own server.
I believe all other threading in ASP.NET is a sign of a problem except for using Tasks to represent async operations or in the extremely rare case when you want to perform a computation in parallel in a web project but your project has very few concurrent users (less concurrent users than the number of cores)
Why Tasks are useful in ASP.NET?
First reason to use Tasks for async operations is that as of .NET 4.5 async APIs return Tasks :)
Async operations (not to be confused with parallel computations) may be web service calls, database calls, etc. They may be useful for two things:
Fire several of them at once and your job will take a time equal to the longest operation. If you fire them in sequential (non-async) fashion they will take time equal to the sum of the times of each operation which is obviously more.
They can improve scalability by releasing the thread executing the page - Node.js style. ASP.NET supports this since forever but in version 4.5 it is really easy to use. I'll go as far as claiming that it is easier than Node.js because of async/await. Releasing the thread is important because you may deplete your threads in the pool by having them wait. The result is that your website becomes slow when there are a certain number of users despite the fact that the CPU usage is like 30% simply because new requests are waiting in queue. If you increase the number of threads in the thread pool you pay the price of constant context switching than by the OS. At certain point you will get 100% CPU usage but 40% of it will be spent context switching. You will increase the throughput but with diminishing returns. A lot of threads also increase the memory footprint.