Running process not shown in active process Instances and difference between ASYNC & SYNC tasks - asynchronous

My workflow is quite simple, I have two script, first script is ASYNC and the second is SYNC. In each script I have a loop from 0 to Integer.MAX_VALUE as follow
for(int i=0;i<Integer.MAX_VALUE;i++)
System.out.println("value is "+i);
When I run my process, it starts working and I can see in my log file that it is being filled. But when I want to stop it, I find nothing in my active process instances, neither in completed process or even in aborted. even if I check my data base, I have nothing related to this process in the ProcessInstanceInfo or even ProcessInstanceLog. So weird isn't it? what could be the reason?
The goal from creating this workflow is to see the difference between ASYNC and SYNC tasks, because as I know that ASYNC tasks when they start running, the workflow don't have to wait until this task finish, but what I have is that my task ASYNC is still running and it didn't go to next task. So my second question is can any one give me the difference between ASYNC and SYNC with a good example to learn. I would appreciate if I'll get at least one answer on one of my two questions. thanks

What do you stop? Do you abort the process instance ?
In the scripts you can populate the process variables with kcontext.setVariable("variable_name","variable_value"). This will reflect in DB if you have defined the process variable persistent in the process model.
The tasks, the sync one will return the flow control to the process when is completed. In contrast to the async one, process flow will continue immediately after it sends the async tasks to execute.

Related

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Google Cloud RunTask before its scheduled to run

When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.

event listener in airflow

I am having a task which will listen for certain events and kick-off other functions.
This function (the listener) subscribes to a kafka topic and runs forever, or at least until it will get a 'stop' event.
Wrapping this as an airflow operator doesn't seem to work properly.
Meaning, if I send the stop event, it will not process it, or anything else for that matter.
Is it possible to run busy loop functions in airflow ?
No, do not run infinite loops in an Airflow task.
Airflow is designed as a batch processor - long/infinite running tasks are counter to it's entire scheduling and processing model and while it might "work", it will lock up a task runner slot.

Running tasks asynchronously that will never run simultaneously

I was wondering if there's a way running tasks asynchronously that will run in the background (Using Celery for example) which will never run simultaneously?
Which means, each task can run by itself simultaneously with it self but not with another tasks that interfere with the actions of the first task.
For example,
Task A: Reads from a file (can run simultaneously with it self (with other tasks that reads from files)
Task B: Writes to a file (Should not run simultaneously with the read tasks (With task A))
Essentially, what I need is a way for tasks A and B to find out if the other task is running and if it is, then delay itself and wait until it's done (probably with blocking the task queue)
Does defining a queue for the tasks solves the problem? or is it just a queue for the execution of tasks (So it will execute the 2nd task in the queue without waiting for the result of the first one)?
Is using a lock my only solution here?
If the lock solution is the only one, what's the correct way of implementing this?
I have found this:
Ensuring a task is only executed one at a time
But it uses django's cache as a lock and I'm not running my programs in a django environment so it doesn't work for me.

Pagodabox or PHPfog + Gearman

All,
I'm looking for a good way to do some job backgrounding through either of these two services.
I see PHPFog supports IronWorks, but i need something more realtime. Through these cloud based PaaS services, I'm not able to use popen(background.php --token=1234). So I'm thinking the best solution, might be to try to kick off a gearman worker to handle the job. (Actually my preferred method would be to use websockets to keep a connection open and receive feedback from the job, rather than long polling a db table through AJAX, but none of these guys support websockets)
Question 1 is, is there a better solution than using gearman to offload the job?
Question 2 is, http://help.pagodabox.com/customer/portal/articles/430779 I see pagodabox supports 'worker listeners' ... has anybody set this up with gearman? Would it work?
Thanks
I am using PagodaBox with a background worker in an application I am building right now. Basically, PagodaBox daemonizes a PHP process for you (meaning it will continually run in the background), so all you really have to do is create a script that checks a database table for tasks to run, runs them, and then sleeps a bit so it's not running too many queries against your database.
This is a simplified version of what I have running:
// Remove time limit
set_time_limit(0);
// Show ALL errors
error_reporting(-1);
// Run daemon
echo "--- Starting Daemon ---\n";
while(true) {
// Query 'work_queue' table for new tasks
// Loop over items and do whatever tasks are associated with them
// Update row to mark task as completed
// Wait a bit
sleep(30);
}
A benefit to this approach is that it's easy to test via CLI:
php tasks.php
You will see all the echo statements come through in console as it's running, and of course it's much easier to do than a more complicated setup with other dependencies like Gearman.
So whenever you add a new task to the table, the maximum amount of time you'll wait for that task to be started in a batch is 30 seconds (or whatever your sleep time is). This is better and preferable to cron jobs, because if you setup a cron job to run every minute (the lowest possible interval) and the work you have to do takes longer than a minute, another cron process will start working on the same queue and you could end up with quite a lot of duplicated task work that could cause a lot of issues that are hard to debug and troubleshoot. So if you instead have either only one background worker that runs all tasks, or multiple background workers that work on different task types, you will never run into this issue.

Resources