I've two workflows hosted as services and from workflow-1 I'm invoking the other one workflow-2 through durable duplex. The workflow-1 sends two requests two workflow-2 creating two instances and keeps on running. When workflow-2 instances done with their job they call back the same workflow-1 instance that is running through correlation. My question is what happens if the two workflow-2 instances invoke the workflow-1 at the same time. Will the calls executed one by one in a queue fashion or they get executed in different threads at the same time?
There is no multi threaded execution of a single workflow instance. The second call will be scheduled when it is received and start processing as soon as the scheduler gets to is.
Related
PROBLEM
Our PROCESSING SERVICE is serving UI, API, and internal clients and listening for commands from Kafka.
Few API clients might create a lot of generation tasks (one task is N messages) in a short time. With Kafka, we can't control commands distribution, because each command comes to the partition which is consumed by one processing instance (aka worker). Thus, UI requests could be waiting too long while API requests are processing.
In an ideal implementation, we should handle all tasks evenly, regardless of its size. The capacity of the processing service is distributed among all active tasks. And even if the cluster is heavily loaded, we always understand that the new task that has arrived will be able to start processing almost immediately, at least before the processing of all other tasks ends.
SOLUTION
Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.
On the side of the producer
the task comes from the client
immediately create a queue for this task
send all messages to this queue
On the side of the consumer
in one process, you constantly update the list of queues
in other processes, you follow this list and consume for example 1 message from each queue
scale consumers
QUESTION
Is there any common solution to such a problem? Using RabbitMQ or any other tooling. Нistorically, we use Kafka on the project, so if there is any approach using - it is amazing, but we can use any technology for the solution.
Why not use spark to execute the messages within the task? What I'm thinking is that each worker creates a spark context that then parallelizes the messages. The function that is mapped can be based on which kafka topic the user is consuming. I suspect however your queues might have tasks that contained a mixture of messages, UI, API calls, etc. This will result in a more complex mapping function. If you're not using a standalone cluster and are using YARN or something similar you can change the queueing method that the spark master is using.
As I understood the problem, you want to create request isolation from the customer using dynamically allocated queues which will allow each customer tasks to be executed independently. The problem looks like similar to Head of line blocking issue in networking
The dynamically allocating queues is difficult. This can also lead to explosion of number of queues that can be a burden to the infrastructure. Also, some queues could be empty or very less load. RabbitMQ won't help here, it is a queue with different protocol than kafka.
One alternative is to use custom partitioner in kafka that can look at the partition load and based on that load balance the tasks. This works if the tasks are independent in nature and there is no state store maintains in the worker.
The other alternative would be to load balance at the customer level. In this case you select a dedicated set of predefined queues for a set of customers. Customers with certain Ids will be getting served by a set of queues. The downside of this is some queues can have less load than others. This solution is similar to Virtual Output Queuing in networking,
My understanding is that the partitioning of the messages it's not ensuring a evenly load-balance. I think that you should avoid create overengineering and so some custom stuff that will come on top of the Kafka partitioner and instead think at a good partitioning key that will allows you to use Kafka in an efficiently manner.
An app makes 3 simultaneous HTTP requests to web server. using asynchronus technique, how many worker threads will be tied up waiting for the data
It wouldn't tied up, Because when you’re doing asynchronous work, you’re not always using a thread.
For example, if you made an async web service request, your client will not be using any threads between the “send” and “receive”.
You unwind after the “send”, and the “receive” occurs on an I/O completion port, at which time your callback is invoked and you will then be using a thread again. (Note that for this scenario your callback is executed on an i/o thread ASP.NET only uses worker threads, and only includes worker threads when it counts the threads in-use.)
I'm using an Azure environment and developing in .NET
I am running a web app (ClientApp) that takes client data to perform a series of calculations. The calculations are performance intensive, so they are running on a separate web app (CalcApp).
Currently, the ClientApp sends the calculation request to the CalcApp. The requests from every client are put into a common queue and run one at a time, FIFO. My goal is to create separate queues for each client and run several calculations concurrently.
I am thinking of using the Azure Service Bus queues to accomplish this. On the ClientApp, the service bus would check for an existing queue for that client and create one if needed. On the CalcApp, the app would periodically check for existing queues. If it finds a new queue, then it would create a new QueueClient that uses OnMessageAsync() and RunCalculationsAsync() as the callback function.
Is this feasible or even a good idea?
I would consider using multiple consumers instead, perhaps with a topic denoting the "client" if you need to differentiate the type of processing based on which client originated it. Each client can add an entry into the queue, and the consumers "fight" over the messages. There is no chance of the same message being processed twice if you follow this approach.
I'm not sure having multiple queues is necessary.
Here is more information on the Competing Consumers pattern.
https://msdn.microsoft.com/en-us/library/dn568101.aspx
You could also build one consumer and spawn multiple threads. In this model, you would have one queue and one consumer, but still have the ability to calculate more than one at a time. Ultimately, though, competing consumers is far more scalable, using a combination of both strategies.
I need to process about 250.000 documents per day with an EJB 3.1 asynchronous method in order to face an overall long time task.
I do this to use more threads and process more documents concurrently. Here's an example in pseudo code:
// this returns about 250.000 documents per day
List<Document> documentList = Persistence.listDocumentsToProcess();
for(Document currentDocument: documentList){
//this is the asynchronous call
ejbInstance.processAsynchronously(currentDocument);
}
Suppose I have a thread pool of size 10 and 4 core processors, my questions are:
how many documents will the application server process SIMULTANEOUSLY?
what happen when all thread in pool are processing a documents and one more asynchronous call comes? Will this work like a sort of JMS Queue?
would I have any improvement adopting a JMS Queue solution
I work with Java EE 6 and WebSphere 8.5.5.2
The default configuration for asynchronous EJB method calls is as follows (from the infocenter):
The EJB container work manager has the following thread pool settings:
Minimum number of threads = 1
Maximum number of threads = 5
Work request queue size = 0 work objects
Work request queue full action = Block
Remote Future object duration = 86400 seconds
So trying to answer your questions:
how many documents will the application server process SIMULTANEOUSLY? (assuming 10 size thread pool)
This thread pool is for all EJB async calls, so first you need to assume that your application is the only one using EJB async calls. Then you will potentially have 10 runnable instances, that will be processed in parallel. Whether they will be processed concurrently depends on the number of cores/threads available in the system, so you cant have accurate number (some cores/threads may be doing web work for example, or other process using cpu).
what happen when all thread in pool are processing a documents and one more asynchronous call comes?
It depends on the Work request queue size and Work request queue full action, settings. If there are no available threads in the pool, then requests will be queued till the queue size is reached. Then it depends on the action, which might be Block or Fail.
would I have any improvement adopting a JMS Queue solution
Depends on your needs. Here are some pros/cons JMS solution.
Pros:
Persistence - if using JMS your asynchronous task can be persistent, so in case of the server failure you will not lost them, and will be processed after restart or by other cluster member. EJB async queue is held only in memory, so tasks in queue are lost in case of failure.
Scalability - if you put tasks to the queue, they might be concurrently processed by many servers in the cluster, not limited to single JVM
Expiration and priorities - you can define different expiration time or priorities for your messages.
Cons:
More complex application - you will need to implement MDB to process your tasks.
More complex infrastructure - you will need database to store the queues (file system can be used for single server, and shared filesystem can be used for clusters), or external messaging solution like WebSphere MQ
a bit lower performance for processing single item and higher load on server, as it will have to be serialized/deserialized to persistent storage
I have an application that runs as a web service, which submits jobs to Spark on a user request. A job queue needs to be limited per user. I am planning to use Airflow as an orchestration framework to manage job queues but while it supports parallel DAG execution it's optimized for batch processing rather than real time. Is Airflow designed to handle ~200 DAG executions per second with multiple queues (one per user) or should I look for alternatives?
Do you have data move from one task to another? Does time matter here since you mentioned real-time. With Airflow, workflows are expected to be mostly static or slowly changing. Mostly for ETL batch processing, you can speed up the airflow heartbeat, but would be good to have a POC with your use case to test out.
Below is from Airflow official document: https://airflow.apache.org/#beyond-the-horizon
Airflow is not a data streaming solution. Tasks do not move data from
one to the other (though tasks can exchange metadata!). Airflow is not
in the Spark Streaming or Storm space, it is more comparable to Oozie
or Azkaban