I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.
Related
I have an Airflow DAG set up to run monthly (with the #monthly time_interal). The next dag runs seem to be scheduled but they don't appear as "queued" in the Airflow UI. I don't understant because everythink seems good otherwise. Here is how my DAG is configured :
with DAG(
"dag_name",
start_date=datetime(2023, 1, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
Do you get no runs at all when you unpause your DAG or one that is being backfilled and it says Last run 2023-01-01, 00:00:00?
In the latter case Airflow is behaving as intended, the run that just happened was the one that would have actually been queued and ran at midnight on 2023-02-01. :)
I used your configuration on a new simple DAG and it gave me one backfilled successful run with the run ID scheduled__2023-01-01T00:00:00+00:00 so running for the data interval 2023-01-01 (logical_date) to 2023-02-01, which means the Run that would have actually been queued at midnight on 2023-02-01.
The next run is scheduled for the logical date 2023-02-01 which means for the data from 2023-02-01 to 2023-03-01. This run will only actually be queued and happen at midnight 2023-03-01 as the Run After date shows:
This guide might help with terminology Airflow uses around schedules.
I'm assuming you wanted the DAG to backfill two runs, one that would have happened on 2023-01-01 and one that would have happened on 2023-02-01. This DAG should do that:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.empty import EmptyOperator
with DAG(
"dag_name_3",
start_date=datetime(2022, 12, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
t1 = EmptyOperator(task_id="t1")
I am using Apache Airflow version 1.10.3 with the sequential executor, and I would like the DAG to fail after a certain amount of time if it has not finished. I tried setting dagrun_timeout in the example code
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2019, 6, 1),
'retries': 0,
}
dag = DAG('min_timeout', default_args=default_args, schedule_interval=timedelta(minutes=5), dagrun_timeout = timedelta(seconds=30), max_active_runs=1)
t1 = BashOperator(
task_id='fast_task',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='slow_task',
bash_command='sleep 45',
dag=dag)
t2.set_upstream(t1)
slow_task alone takes more than the time limit set by dagrun_timeout, so my understanding is that airflow should stop DAG execution. However, this does not happen, and slow_task is allowed to run for its entire duration. After this occurs, the run is marked as failed, but this does not kill the task or DAG as desired. Using execution_timeout for slow_task does cause the task to be killed at the specified time limit, but I would prefer to use an overall time limit for the DAG rather than specifying execution_timeout for each task.
Is there anything else I should try to achieve this behavior, or any mistakes I can fix?
The Airflow scheduler runs a loop at least every SCHEDULER_HEARTBEAT_SEC (the default is 5 seconds).
Bear in mind at least here, because the scheduler performs some actions that may delay the next cycle of its loop.
These actions include:
parsing the dags
filling up the DagBag
checking the DagRun and updating their state
scheduling next DagRun
In your example, the delayed task isn't terminated at the dagrun_timeout because the scheduler performs its next cycle after the task completes.
According to Airflow documentation:
dagrun_timeout (datetime.timedelta) – specify how long a DagRun should be up before timing out / failing, so that new DagRuns can be created. The timeout is only enforced for scheduled DagRuns, and only once the # of active DagRuns == max_active_runs.
So dagrun_timeout wouldn't work for non-scheduled DagRuns (e.g. manually triggered) and if the number of active DagRuns < max_active_runs parameter.
I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.
In airflow.cfg file, I configured min_file_process_interval value to be 120 seconds. The dag had schedule interval to run after every minute.
But it is being scheduled only after every 120 seconds (as per min_file_process_interval value). Is this expected ?
I changed the min_file_process_interval to 200 seconds, then the dag schedule started being picked up after 200 seconds.
Just to clarify,if the case is opposite, i.e. dag schedule interval is 2 minutes and min_file_process_interval is 1 minute, then the dag will run fine as per its schedule.
Below is my dag:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 5, 14),
'retries': 0,
}
dag = DAG('print', catchup=False, default_args=default_args, schedule_interval='*/1 * * * *')
t1 = BashOperator(
task_id='print_date',
bash_command='echo `date` , ',
dag=dag)
Per airflow documents:
In cases where there are only a small number of DAG definition files, the loop could potentially process the DAG definition files many times a minute. To control the rate of DAG file processing, the min_file_process_interval can be set to a higher value. This parameter ensures that a DAG definition file is not processed more often than once every min_file_process_interval seconds.
It means airflow will process at every min_file_process_interval seconds. Thus, you should set your DAG schedule interval to be the multiples of min_file_process_interval.
We are running airflow version 1.9 in celery executor mode. Our task instances are stuck in retry mode. When the job fails, the task instance retries. After that it tries to run the task and then fall back to new retry time.
First state:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:46:53.101483 and task will be retried at 2018-08-28T03:47:25.463271.
After some time:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow administrator for assistance
After some time: it again went into retry mode.
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:51:48.322424 and task will be retried at 2018-08-28T03:52:57.893430.
This is happening for all dags. We created a test dag and tried to get logs for both scheduler and worker logs.
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'Pramiti',
'depends_on_past': False,
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
dag = DAG('airflow-examples.test_failed_dag_v2', description='Failed DAG',
schedule_interval='*/10 * * * *',
start_date=datetime(2018, 9, 7), default_args=default_args)
b = BashOperator(
task_id="ls_command",
bash_command="mdr",
dag=dag
)
Task Logs
Scheduler Logs
Worker Logs