Airflow Executor - airflow

I'm still fairly new to using airflow. I have a few dags scheduled to run but they run sequentially. I want them to start running in parallel to each other. So let's say I have a task scheduled to run at 1 pm and another task to run at 2pm. If the first tasks isn't done, by 2pm, I still want the 2pm task to start running and allow the 1pm task to keep going as well.

As explained in https://www.astronomer.io/guides/airflow-executors-explained/, you should use LocalExecutor if you want to run more than a single task at any point-in-time.

Related

A DAG is preventing other smaller DAGs tasks to start

I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html

Airflow runs manual and scheduled Dag even though max_active_runs_per_dag=1

Good Day,
we use Airflow to orchestrate runs of our jobs.
The Job in question is usually scheduled for 2:30 and takes quite some time.
Due to a new data source it was expected to run a full day.
Since our jobs don't work parallel we set max_active_runs_per_dag to 1 to ensure that there are no multiple instances of the same job even when it takes more than 24h. In general this seem to work, but not in this case.
What happened:
We triggered a manual run at 13:00
at 2:30 (next day) the scheduled run is triggered and runs simultaneously
Expectation:
The scheduled run should wait for the manual run to finish
More Information:
The Airflow instance did not restart.
Airflow version 1.10.2
I thank you for any advice.
Seems to be an open Issue which will be fixed in 2.1 and 1.15.
A workaround was not provided yet.
https://github.com/apache/airflow/issues/9975

How to Build Dynamic Queues using Apache Air Flow?

I have just started to explore Apache Airflow.
Is there any way to run a job that will look into the running DAGS and move those tasks in those DAGS to new DAG by creating them and adding those tasks in it.
For Example : DAG A has four tasks, 4th one has been waiting from 7 hours to start - Goal is to create new DAG and move that tasks automatically to new DAG.
Scenario : Actually we have around 40 VM, and each job time varies with its own instance. For Example : Task A will take 2 hours today but might take 12 Hours tomorrow in the same DAG. What i need is to move that task to other DAG if the waiting time of any task exceed certain time to run on other VM instantly.
The main benefi is to keep all the task waiting time minimum as possible by building dynamic DAGs

How long should we keep the airflow scheduler running?

I'm very confused by how airflow picks up the changes in new DAG code with the scheduler.
Can someone clarify how the airflow scheduler works with new code? Do I need to stop and rerun airflow scheduler every time I change code in DAGs? or can I just set --num_runs 1and run it every time I make new changes?
Thanks!
The scheduler should be running all the time. You should just run airflow scheduler without a num_runs param. The scheduler is designed to be a long running process, an infinite loop. It orchestrates the work that is being done, it is the heart of airflow. If it's not running, you aren't scheduling more work to be done.
Each iteration of the loop will reload what is called the DagBag, a collection of loaded DAGs. Any modifications to a DAG, as well as removal/addition of DAGs should be reflected the next scheduler loop.
Airflow's scheduler checks periodically and continuously the DAGs location to scan and refresh DAGs. If you didn't change the config, it's done with just few seconds of pause between each round.
The --num_run parameter was not introduced for refreshing purposes but for reliability:
Airflow officially advises here that the scheduler should be restarted frequently using the num_runs and/or the run_duration configuration parameters.

Create Task Schedule that run task after one is finished

Currently I have a task that run every 5 minute.
what I want is to have that task rerun every time it is completed with 1 minute delay.
what I have in mind is to create multiple task, task A and task B. task B will run after task A complete and vice versa. But not sure how to execute that.
I have found a workaround for my situation. what I do is create loop for task A to run followed by task B with delay in between.

Resources