Mule - Batch Job sync or async - asynchronous

I have two batch jobs in differents flows. The first, do an Upsert in Salesforce and when it finish, it call to the second flow that has another batch job.
This image represents the flows:
But when I see the log on the console, sometimes the log of the second batch is mixed with the log of the first.
I get the feeling that the batch processes are asynchronous and the second batch is called even though the first batch is being processed.
Am I wrong? Should I pay attention to the order of the logs?
If I wanted it to be totally synchronous, what would be the best way?

Mule Batch is asynchronous, it is like fire and forget. If you want to call the second batch after first batch is completed, then invoke the second batch at 'On Complete' phase of first batch as shown in below picture.
If you want to do some function before invoking the second batch, then you need to use request-reply scope to make batch component synchronous.

Yes the batch job is asynchronous. As soon as the batch execute is triggered the flow will move on to the next event processor.
If batch job 2 needs to run after batch job 1 only, then you can use the on-complete phase of the first batch job to trigger some event to indicate the first has finished so that can be used to trigger the second batch job.
Alternatively If the batch jobs are related that closely you might be able to combine them into one using multiple batch steps

Related

Correct Approach For Airflow DAG Project

I am trying to see if Airflow is the right tool for some functionality I need for my project. We are trying to use it as a scheduler for running a sequence of jobs
that start at a particular time (or possibly on demand).
The first "task" is to query the database for the list of job id's to sequence through.
For each job in the sequence send a REST request to start the job
Wait until job completes or fails (via REST call or DB query)
Go to next job in sequence.
I am looking for recommendations on how to break down the functionality discussed above into an airflow DAG. So far my approach would :
create a Hook for the database and another for the REST server.
create a custom operator that handles the start and monitoring of the "job" (steps 2 and 3)
use a sensor to poll handle waiting for job to complete
Thanks

Running process not shown in active process Instances and difference between ASYNC & SYNC tasks

My workflow is quite simple, I have two script, first script is ASYNC and the second is SYNC. In each script I have a loop from 0 to Integer.MAX_VALUE as follow
for(int i=0;i<Integer.MAX_VALUE;i++)
System.out.println("value is "+i);
When I run my process, it starts working and I can see in my log file that it is being filled. But when I want to stop it, I find nothing in my active process instances, neither in completed process or even in aborted. even if I check my data base, I have nothing related to this process in the ProcessInstanceInfo or even ProcessInstanceLog. So weird isn't it? what could be the reason?
The goal from creating this workflow is to see the difference between ASYNC and SYNC tasks, because as I know that ASYNC tasks when they start running, the workflow don't have to wait until this task finish, but what I have is that my task ASYNC is still running and it didn't go to next task. So my second question is can any one give me the difference between ASYNC and SYNC with a good example to learn. I would appreciate if I'll get at least one answer on one of my two questions. thanks
What do you stop? Do you abort the process instance ?
In the scripts you can populate the process variables with kcontext.setVariable("variable_name","variable_value"). This will reflect in DB if you have defined the process variable persistent in the process model.
The tasks, the sync one will return the flow control to the process when is completed. In contrast to the async one, process flow will continue immediately after it sends the async tasks to execute.

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Google Cloud RunTask before its scheduled to run

When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.

Post Dynamic condition in Control M

We have multiple jobs that serves as threads for file loading. But we want to trigger jobs only when file is received. So we created a file watcher job in control-M. We want to trigger thread Job for each file. So one file will be processed by a single thread job.
For example: If only one file is received only one thread job should be triggered say Thread1 job is triggered. Now after 1 min a new file is received then since Thread1 job is already running so Thread 2 job should be initiated.
I think, if we could post condition pro-grammatically in Control-M my purpose will be solved.
Please help and comment if any more information is required.
You could have the filewatcher post a generic out-condition then configure a dummy job at the start of each thread that would require exclusive control over a control resource and on completion delete it's in-condition and kickoff the rest of the thread.
3 file arrive.
Filewatcher completes and posts out-condition.
Only one thread header can start, it then removes the out-condition and continues the thread.
Filewatcher runs again, completes and posts out-condition.
Only one thread header can start, it then removes the out-condition and continues the thread.
etc..
Its not clear where you are trying to use the conditions, but its possible to add conditions programatically using the ctmcontb utility.
ex: ctmcontb -ADD Condition_Name ODAT

Resources