Running tasks asynchronously that will never run simultaneously - asynchronous

I was wondering if there's a way running tasks asynchronously that will run in the background (Using Celery for example) which will never run simultaneously?
Which means, each task can run by itself simultaneously with it self but not with another tasks that interfere with the actions of the first task.
For example,
Task A: Reads from a file (can run simultaneously with it self (with other tasks that reads from files)
Task B: Writes to a file (Should not run simultaneously with the read tasks (With task A))
Essentially, what I need is a way for tasks A and B to find out if the other task is running and if it is, then delay itself and wait until it's done (probably with blocking the task queue)
Does defining a queue for the tasks solves the problem? or is it just a queue for the execution of tasks (So it will execute the 2nd task in the queue without waiting for the result of the first one)?
Is using a lock my only solution here?
If the lock solution is the only one, what's the correct way of implementing this?
I have found this:
Ensuring a task is only executed one at a time
But it uses django's cache as a lock and I'm not running my programs in a django environment so it doesn't work for me.

Related

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

Running process not shown in active process Instances and difference between ASYNC & SYNC tasks

My workflow is quite simple, I have two script, first script is ASYNC and the second is SYNC. In each script I have a loop from 0 to Integer.MAX_VALUE as follow
for(int i=0;i<Integer.MAX_VALUE;i++)
System.out.println("value is "+i);
When I run my process, it starts working and I can see in my log file that it is being filled. But when I want to stop it, I find nothing in my active process instances, neither in completed process or even in aborted. even if I check my data base, I have nothing related to this process in the ProcessInstanceInfo or even ProcessInstanceLog. So weird isn't it? what could be the reason?
The goal from creating this workflow is to see the difference between ASYNC and SYNC tasks, because as I know that ASYNC tasks when they start running, the workflow don't have to wait until this task finish, but what I have is that my task ASYNC is still running and it didn't go to next task. So my second question is can any one give me the difference between ASYNC and SYNC with a good example to learn. I would appreciate if I'll get at least one answer on one of my two questions. thanks
What do you stop? Do you abort the process instance ?
In the scripts you can populate the process variables with kcontext.setVariable("variable_name","variable_value"). This will reflect in DB if you have defined the process variable persistent in the process model.
The tasks, the sync one will return the flow control to the process when is completed. In contrast to the async one, process flow will continue immediately after it sends the async tasks to execute.

How to launch a Dataflow job with Apache Airflow and not block other tasks?

Problem
Airflow tasks of the type DataflowTemplateOperator take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
Option 1: just launch the job and airflow job is successful
Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator only has an option to specify the wait time between completion checks called poll_sleep (source).
For the DataflowCreateJavaJobOperator there is an option check_if_running to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
Answers
Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
A: Partly yes. Airflow has parallelism option in the configuration which define the number of tasks that should execute at a time across the system. Having a task block this slot might slow down the execution in the system but this issue is bound to happen as you increase the number of tasks and DAGs. You can increase this in the configuration depending on your needs
Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
A: Yes. You can use PythonOperator and in the python_callable you can use the dataflow hook to launch the job in async mode (launch and don't wait).
Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
A: When you say reschedule, I'm assuming that you are going to retry the task that looks for job that checks if the job finished correctly. If I'm right, you can set the task on retry mode and the delay at which you want the retry to happen.
Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.
A: I think I answered this in the second question.

Google Cloud RunTask before its scheduled to run

When using Google Cloud Tasks, how can i prematurely run a tasks that is in the queue. I have a need to run the task before it's scheduled to run. For example the user chooses to navigate away from the page and they are prompted. If they accept the prompt to move away from that page, i need to clear the queued task item programmatically.
I will be running this with a firebase-function on the backend.
Looking at the API for Cloud Tasks found here it seems we have primitives to:
list - get a list of tasks that are queued to run
delete - delete a task this is queued to run
run - forces a task to run now
Based on these primitives, we seem to have all the "bits" necessary to achieve your ask.
For example:
To run a task now that is scheduled to run in the future.
List all the tasks
Find the task that you want to run now
Delete the task
Run a task (now) using the details of the retrieved task
We appear to have a REST API as well as language bound libraries for the popular languages.

Pagodabox or PHPfog + Gearman

All,
I'm looking for a good way to do some job backgrounding through either of these two services.
I see PHPFog supports IronWorks, but i need something more realtime. Through these cloud based PaaS services, I'm not able to use popen(background.php --token=1234). So I'm thinking the best solution, might be to try to kick off a gearman worker to handle the job. (Actually my preferred method would be to use websockets to keep a connection open and receive feedback from the job, rather than long polling a db table through AJAX, but none of these guys support websockets)
Question 1 is, is there a better solution than using gearman to offload the job?
Question 2 is, http://help.pagodabox.com/customer/portal/articles/430779 I see pagodabox supports 'worker listeners' ... has anybody set this up with gearman? Would it work?
Thanks
I am using PagodaBox with a background worker in an application I am building right now. Basically, PagodaBox daemonizes a PHP process for you (meaning it will continually run in the background), so all you really have to do is create a script that checks a database table for tasks to run, runs them, and then sleeps a bit so it's not running too many queries against your database.
This is a simplified version of what I have running:
// Remove time limit
set_time_limit(0);
// Show ALL errors
error_reporting(-1);
// Run daemon
echo "--- Starting Daemon ---\n";
while(true) {
// Query 'work_queue' table for new tasks
// Loop over items and do whatever tasks are associated with them
// Update row to mark task as completed
// Wait a bit
sleep(30);
}
A benefit to this approach is that it's easy to test via CLI:
php tasks.php
You will see all the echo statements come through in console as it's running, and of course it's much easier to do than a more complicated setup with other dependencies like Gearman.
So whenever you add a new task to the table, the maximum amount of time you'll wait for that task to be started in a batch is 30 seconds (or whatever your sleep time is). This is better and preferable to cron jobs, because if you setup a cron job to run every minute (the lowest possible interval) and the work you have to do takes longer than a minute, another cron process will start working on the same queue and you could end up with quite a lot of duplicated task work that could cause a lot of issues that are hard to debug and troubleshoot. So if you instead have either only one background worker that runs all tasks, or multiple background workers that work on different task types, you will never run into this issue.

Resources