In our application, one DAG run processes 50records every 10mins. To load historical data(~80k) in a short period of time, we increased the max_active_runs to 3 and decreased the interval to 2mins.
When the DAG starts execution, in the first task, we pick up the first 50records that are eligible, mark them as IN-PROGRESS, and proceed.
The issue we noticed is, when multiple DAGs start their execution at the same time (when there is some delay from the previous DAG run), the same records are being picked up by more than one DAG run.
Is there a possibility to force a delay between multiple active DAG runs?
Related
I have a big DAG with around 400 tasks that starts at 8:00 and runs for about 2.5 hours.
There are some smaller DAGs that need to start at 9:00, they are scheduled but are not able to start until the first DAG finishes.
I reduced concurrency=6. The DAG is running only 6 parallel tasks, however this is not solving the issue that the other tasks in other DAGs don't start.
There is no other global configuration to limit the number of running tasks, other smaller dags usually run in parallel.
What can be the issue here?
Ariflow version: 2.1 with Local Executor with Postgres backend running on a 20core server.
Tasks of active DAGs not starting
I don't think it's related to concurrency. This could be related to Airflow using the mini-scheduler.
When a task is finished Task supervisor process perform a "mini scheduler" attempting to schedule more tasks of the same DAG. This means that the DAG will be finished quicker as the downstream tasks are set to Scheduled mode directly however one of it's side effect that it can cause starvation for other DAGs in some circumstances. A case like you present where you have one very big DAG that takes very long time to complete and starts before smaller DAGs may be the exact case where stravation can happen.
Try to set schedule_after_task_execution = False in airflow.cfg and it should solve your issue.
Why don't you use the option to invoke the task after the previous one is finished?
In the first DAG, insert the call to the next one as follows:
trigger_new_dag = TriggerDagRunOperator(
task_id=[task name],
trigger_dag_id=[trigered dag],
dag=dag
)
This operator will start a new DAG after the previous one is executed.
Documentation: https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/trigger_dagrun/index.html
I am completely new to Airflow and am trying to grasp the concepts of scheduling and default args.
I have a scenario where I would like to schedule my DAG hourly to do some data transfer task between a source and a database. What I am trying to understand is, lets say one of my DAG runs has triggered at 00:00 AM. Now if it takes more than an hour for this run to successfully complete all of its tasks (say 1 hour 30 min), does it mean that the next DAG run that was supposed to be triggered at 01:00 AM will NOT get triggered but the DAG run from 02:00 AM will get triggered?
Yes.
In order to avoid, you need catchup=True for the DAG object.
Reference : https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html
(Search for Catchup)
Airflow Scheduler used to monitor all DAGs and tasks in Airflow.Default Arguments can be used to create tasks with default parameters in DAG.
The first DAG runs based on start_date and runs based on scheduled_interval sequentially. Scheduler doesn’t trigger tasks until the period has ended.For your requirement you can set dag.catchup to true as to run the DAG for each completed interval and scheduler will execute them sequentially.Catchup is used to start the DAG run since the last data interval which has not started for any data interval.
I am trying to define periods of time where the scheduler will not mark tasks as ready.
For example, I have a DAG with a lengthy backfill period. The execution of the backfill will run throughout the day until the DAG is caught up. However I do not want and executions of the DAG to execute between midnight and 2 am.
Is it possible to achieve this through configuration?
After pausing a dag for 2-3 days, when resuming the dag with catchup=False, will run immediately with the last execution.
For example a dag that sends data to an external system is scheduled to run everyday on 19:00.
Stopping the dag for 4 days and enabling on 11:00 would run the dag immediately with yesterdays execution and then again on 19:00 for that day.
In this case the dag runs two times on the day it's resumed.
Is it possible to resume the dag and the first run will happen actually on 19:00?
With default operators, we cannot achieve what you are expecting. Closest to that, what airflow has is LatestOnlyOperator. This is one of the simplest Operators and needs only following configuration
latest_only = LatestOnlyOperator(task_id='latest_only')
This would let the downstream tasks run only if the current time falls between current execution date and next execution date. So, in your case, it would skip execution of three days, but yesterday's run would trigger the jobs.
I am in the process of setting up cross-dependent DAGS using the Airflow documentation. I have a particular use case where my DAG B requires that DAG A runs first - however, if DAG A is delayed long enough DAG B should still run. So I'm essentially looking for a way to wire an OR operation between 2 sensors.
Say DAG B needs to run daily by 5PM then this is how I would do it in code:
while True:
CURRENT_TIME = getCurrentTime()
if DAG A completed OR CURRENT_TIME > 5pm:
run DAG B
This is much simpler to do in code however not seeing how this is done with Airflow.
Interesting problem, here's how I think it can be accomplished
You need an ExternalTaskSensor in the beginning of your DAG-B, as told in the Cross-DAG Dependencies guide, to hold off the execution of DAG-B until DAG-A completes.
Here you must also set the timeout param so that the sensor fails after certain maximum time
Then in the first actual task of DAG-B (that comes immediately after ExternalTaskSensor) set trigge_rule=TriggerRule.ALL_DONE to ensure that the actual processing of your DAG-B starts irrespective of whether DAG-A completes within stipulated time or not; in other words
execution of DAG-B will be held off until DAG-A completes, but only for a maximum delta duration
If DAG-A completes within this duration, then DAG-B will begin executing immediately after that
but if DAG-A fails to complete within this duration, DAG-B will begin executing anyways after the passing of this duration