Async Operation in A Flink Sink using CompletableFuture - asynchronous

Background
Planning to set a up data pipeline using Flink.
The flow looks like this
Kafka --> Flink Job --> gRPC endpoint
Story so far
Blocking implementation is up and running. But that will not scale for high QPS
Tried simulating async behavior here
Problem
For Async Behavior not sure how the behavior would be
if CompletableFuture is used, per message it will be processed in Async manner, but will the next message be fetched for processing before processing of first is complete ? In other words, there is a way to achieve async processing within a task manager. But what is the behavior of Task manager in fetching next message / tuple ? Will is wait till Async process is complete or will it submit to CompletableFuture / Thread and fetch next message ? Not clear about that
Will using a custom threadpool cause any issues if not shutdown as the pipeline will be running over a long period ?
Any other solution to achieve async behavior in Flink sink ?

I would leverage Flink's support for async operators, and have a DiscardingSink, versus trying to implement a custom async sink.
And no, I don't see any reason why having a persistent thread pool would cause problems.

Related

flink checkpointing halts while async i/o operator is awaiting response

Async i/o operator (1.15.2) is waiting for the future to return (basically let the embedded function complete its processing), till then checkpointing gets halt at this operator. And only proceeds once it gets finished. attaching the screenshot.
Checkpointing in progress
Basically I would like to test the scenario where async io is waiting for the response & job is restarted. So ideally the last checkpoint will get restored & technically the async i/o processing should get restarted again with the same set of data.
But as checkpoint is stuck, & if i restart the job, than previous checkpoint is restored which doesn’t have the correct set of data.
Please guide how should I tackle this scenario.
You should never let Flink end up waiting; all i/o should be done asynchronously. Otherwise, as you've seen, you'll end up interfering with checkpointing.
The async i/o operator is designed to help with this, but you must either use an asynchronous client library for connecting to the external service, or make your synchronous calls in another thread. See https://stackoverflow.com/a/64225825/2000823 for an example of how to do this.

Async. programming in .Net Core

I was reading the documentation of Microsoft specifically the Async programming article and I didn't understand this section while he is explaining the work of the server's threads when using Async code.
because it(The server) uses async and await, each of its threads is freed up when the I/O-bound work starts, rather than when it finishes.
Could anyone help what does it mean by the threads r freed up when the I/O starts??
Here is the article : https://learn.microsoft.com/en-us/dotnet/standard/async-in-depth
When ASP.NET gets an HTTP request, it takes a thread from the thread pool and uses that to execute the handler for that request (e.g., a specific controller action).
For synchronous actions, the thread stays assigned to that HTTP request until the action completes. For asynchronous actions, the await in the action method may cause the thread to return an incomplete task to the ASP.NET runtime. In this case, ASP.NET will free up the thread to handle other requests while the I/O is in flight.
Further reading about the difference between synchronous and asynchronous request handling and how asynchronous work doesn't require a thread at all times.
When your application makes a call to an external resource like Database or HttpClient thread, that initiated connection needs to wait.
Until it gets a response, it waits idly.
In the asynchronous approach, the thread gets released as soon as the app makes an external call.
Here is an article about how it happens:
https://medium.com/#karol.rossa/asynchronous-programming-73b4f1988cc6
And performance comparison between async and sync apporach
https://medium.com/#karol.rossa/asynchronous-performance-1be01a71925d
Here's an analogy for you: have you ever ordered at a restaurant with a large group and had someone not be ready to order when the waiter came to them? Did they bring in a different waiter to wait for him or did the waiter just come back to him after he took other people's orders?
The fact that the waiter is allowed to come back to him later means that he's freed up immediately after calling on him rather than having to wait around until he's ready.
Asynchronous I/O works the same way. When you do a web service call, for example, the slowest part (from the perspective of the client at least) is waiting for the result to come back: most of the delay is introduced by the network (and the other server), during which time the client thread would otherwise have nothing to do but wait. Async allows the client to do other things in the background.

Sending a message to a service bus queue via .NET Core Logger?

I have created a custom logger which sends messages to a service bus queue. My problem is that the Microsoft.Extensions.Logging.ILogger has no async Log method, so am having to call an async method synchronously which can cause issues.
_queueClient.SendAsync(message).Wait();
So my questions is, what is the safest way for me to accomplish this without it causing thread starvation?
Net Core Logging is designed specifically to not use async methods. They believe that logging should be quick enough that it doesnt need it.
"Logging should be so fast that it isn't worth the performance cost of
asynchronous code."
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/logging/?view=aspnetcore-2.2
The recommendation is that if you need to log stuff like this, you use a "Batched Logger" instead.
"Instead, synchronously add log messages to an in-memory queue and have
a background worker pull the messages out of the queue to do the
asynchronous work of pushing data to SQL Server."
So, basically, you should move your queue calls into a background worker (HostedService) and get it to do the async calls in the background worker instead. Your normal "logger" method should call something that sticks the messages into a MemoryCache queue or something similar and then your background worker will fetch the out of there in the background. I'm sure (off the top of my head) the Azure Logger has a simple example of this sort of thing

Web API 2 - are all REST requests asynchronous?

Do I need to do anything to make all requests asynchronous or are they automatically handled that way?
I ran some tests and it appears that each request comes in on its own thread, but I figure better to ask as I might have tested wrong.
Update: (I have a bad habit of not explaining fully - sorry) Here's my concern. A client browser makes a REST request to my server of http://data.domain/com/employee_database/?query=state:Colorado. That comes in to the appropriate method in the controller. That method queries the database and returns an object which is then turned into a JSON structure and returned to the calling app.
Now let's say 10,000 clients all make a similar query to the same server. So I have 10,000 requests coming in at once. Will my controller method be called simultaneously in 10,000 distinct threads? Or must the first request return before the second request is called?
I'm not asking about the code in my handler method having asynchronous components. For my case the request becomes a single SQL query so the code has nothing that can be handled asynchronously. And until I get the requested data, I can't return from the method.
No REST is not async by default. the request are handled synchronously. However, your web server (IIS) has a number of max threads setting which can work at the same time, and it maintains a queue of the request received. So, the request goes in the queue and if a thread is available it gets executed else, the request waits in the IIS queue till a thread is available
I think you should be using async IO/operations such as database calls in your case. Yes in Web Api, every request has its own thread, but threads can run out if there are many consecutive requests. Also threads use memory so if your api gets hit by too many request it may put pressure on your system.
The benefit of using async over sync is that you use your system resources wisely. Instead of blocking the thread while it is waiting for the database call to complete in sync implementation, the async will free the thread to handle more requests or assign it what ever process needs a thread. Once IO (database) call completes, another thread will take it from there and continue with the implementation. Async will also make your api run faster if your IO operations take longer to complete.
To be honest, your question is not very clear. If you are making an HTTP GET using HttpClient, say the GetAsync method, request is fired and you can do whatever you want in your thread until the time you get the response back. So, this request is asynchronous. If you are asking about the server side, which handles this request (assuming it is ASP.NET Web API), then asynchronous or not is up to how you implemented your web API. If your action method, does three things, say 1, 2, and 3 one after the other synchronously in blocking mode, the same thread is going to the service the request. On the other hand, say #2 above is a call to a web service and it is an HTTP call. Now, if you use HttpClient and you make an asynchronous call, you can get into a situation where one request is serviced by more than one thread. For that to happen, you should have made the HTTP call from your action method asynchronously and used async keyword. In that case, when you call await inside the action method, your action method execution returns and the thread servicing your request is free to service some other request and ultimately when the response is available, the same or some other thread will continue from where it was left off previously. Long boring answer, perhaps but difficult to explain just through words by typing, I guess. Hope you get some clarity.
UPDATE:
Your action method will execute in parallel in 10,000 threads (ideally). Why I'm saying ideally is because a CLR thread pool having 10,000 threads is not typical and probably impractical as well. There are physical limits as well as limits imposed by the framework as well but I guess the answer to your question is that the requests will be serviced in parallel. The correct term here will be 'parallel' but not 'async'.
Whether it is sync or async is your choice. You choose by the way to write your action. If you return a Task, and also use async IO under the hood, it is async. In other cases it is synchronous.
Don't feel tempted to slap async on your action and use Task.Run. That is async-over-sync (a known anti-pattern). It must be truly async all the way down to the OS kernel.
No framework can make sync IO automatically async, so it cannot happen under the hood. Async IO is callback-based which is a severe change in programming model.
This does not answer what you should do of course. That would be a new question.

Custom Windows Workflow activity that executes an asynchronous operation - redone using generic service

I am writing a custom Windows Workflow Foundation activity, that starts some process asynchronously, and then should wake up when an async event arrives.
All the samples I’ve found (e.g. this one by Kirk Evans) involve a custom workflow service, that does most of the work, and then posts an event to the activity-created queue. The main reason for that seems to be that the only method to post an event [that works from a non-WF thread] is WorkflowInstance.EnqueueItem, and the activities don’t have access to workflow instances, so they can't post events (from non-WF thread where I receive the result of async operation).
I don't like this design, as this splits functionality into two pieces, and requires adding a service to a host when a new activity type is added. Ugly.
So I wrote the following generic service that I call from the activity’s async event handler, and that can reused by various async activities (error handling omitted):
class WorkflowEnqueuerService : WorkflowRuntimeService
{
public void EnqueueItem(Guid workflowInstanceId, IComparable queueId, object item)
{
this.Runtime.GetWorkflow(workflowInstanceId).EnqueueItem(queueId, item, null, null);
}
}
Now in the activity code, I can obtain and store a reference to this service, start my async operation, an when it completes, use this service to post an event to my queue. The benefits of this - I keep all the activity-specific code inside activity, and I don't have to add new services for each activity types.
But seeing the official and internet samples doing it will specialized non-reusable services, I would like to check if this approach is OK, or I’m creating some problems here?
There is a potential problem here with regard to workflow persistence.
If you create long running worklfows that are persisted in a database to the runtime will be able to restart these workflows are not reloaded into memory until there is some external event that reloads them. As there they are responsible for triggering the event themselves but cannot until they are reloaded. And we have a catch 22 :-(
The proper way to do this is using an external service. And while this might feel like dividing the code into two places it really isn't. The reason is that the workflow is responsible for the big picture, IE what should be done. And the runtime service is responsible for the actual implementation or how it should be done. That way you can change the how without changing the why and when part.
A followup - regardless of all the reasons, why it "should be done" using a service, this will be directly supported by .NET 4.0, which provides a clean way for an activity to start an asynchronous work, while suspending the persistence of the activity.
See
http://msdn.microsoft.com/en-us/library/system.activities.codeactivitycontext.setupasyncoperationblock(VS.100).aspx
for details.

Resources