I just started using airflow and I basically want to run my dag to load historical data. So I'm running this command
airflow backfill my_dag -s 2018-07-30 -e 2018-08-01
And airflow is running my dag only for 2018-07-30. My expectation was airflow to run for 2018-07-30, 2018-07-31 and 2018-08-01.
Here's part of my dag's code:
import airflow
import configparser
import os
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
from datetime import datetime
def getConfFileFullPath(fileName):
return os.path.join(os.path.abspath(os.path.dirname(__file__)), fileName)
config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())
config.read([getConfFileFullPath('pipeline.properties')])
args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2018,7,25),
'end_date':airflow.utils.dates.days_ago(1)
}
dag_id='my_dag'
dag = DAG(
dag_id=dag_id, default_args=args,
schedule_interval=None, catchup=False)
...
So am I doing anything wrong with my dag configuration?
Problem: schedule_interval=None
In order to initiate multiple runs within your defined date range you need to set the schedule interval for the dag. For example try:
schedule_interval=#daily
Start date, end date and schedule interval defines how many runs will be initiated by the scheduler when backfill is executed.
Airflow scheduling and presets
Related
I have an Airflow DAG set up to run monthly (with the #monthly time_interal). The next dag runs seem to be scheduled but they don't appear as "queued" in the Airflow UI. I don't understant because everythink seems good otherwise. Here is how my DAG is configured :
with DAG(
"dag_name",
start_date=datetime(2023, 1, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
Do you get no runs at all when you unpause your DAG or one that is being backfilled and it says Last run 2023-01-01, 00:00:00?
In the latter case Airflow is behaving as intended, the run that just happened was the one that would have actually been queued and ran at midnight on 2023-02-01. :)
I used your configuration on a new simple DAG and it gave me one backfilled successful run with the run ID scheduled__2023-01-01T00:00:00+00:00 so running for the data interval 2023-01-01 (logical_date) to 2023-02-01, which means the Run that would have actually been queued at midnight on 2023-02-01.
The next run is scheduled for the logical date 2023-02-01 which means for the data from 2023-02-01 to 2023-03-01. This run will only actually be queued and happen at midnight 2023-03-01 as the Run After date shows:
This guide might help with terminology Airflow uses around schedules.
I'm assuming you wanted the DAG to backfill two runs, one that would have happened on 2023-01-01 and one that would have happened on 2023-02-01. This DAG should do that:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.empty import EmptyOperator
with DAG(
"dag_name_3",
start_date=datetime(2022, 12, 1),
schedule_interval="#monthly",
catchup=True,
default_args={"retries": 5, "retry_delay": timedelta(minutes=1)},
) as dag:
t1 = EmptyOperator(task_id="t1")
I have the following airflow DAG:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#Runs every 1 minute
dag = DAG(dag_id='example_dag', start_date=datetime(2020, 1, 1), schedule_interval='*/1 * * * *')
t1 = BashOperator(task_id='bash_task', bash_command='echo Hello!', dag=dag)
Problem here is that Airflow is scheduling and executing tasks from past dates like the first minute of 2020, the second minute of 2020, the third minute of 2020 and so on.
I want Airflow to schedule and execute only the tasks that occur after the dag deploy (i.e. if I deploy today, I want the first task to be executed in the next minute) and not to execute expired tasks.
Any advice? Thanks!
I found the answer here. Read the "Catchup and Idempotent DAG" section.
I have scheduled a DAG with a simple bash task to run every 5th minute:
# bash_dag.py
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'start_date' : datetime(2019, 5, 30)
}
dag = DAG(
'bash_count',
default_args=default_args,
schedule_interval='*/5 * * * *',
catchup = False
)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Scheduling works fine, DAG is executing every 5th minute threshold. However, I have noticed that there is a significant delay between the 5th minute threshold and task queueing time. For the examples shown in the image, task queueing takes in between 3 to 50 seconds. For example, last DAG execution in the image was supposed to be triggered after 20:05:00 but task instance was queued 28 seconds later (20:05:28).
I'm surprised this is the case, since the DAG being scheduled has a single very simple task. Is this a normal airflow delay? Should I expect further delays when dealing with more complex DAGs?
I'm running a local airflow server with Postgres as db on a 16 GB Mac with OS Mojave. Machine is not resource constrained.
Is there a way to run a task if the upstream task succeeded or failed but not if the upstream was skipped?
I am familiar with trigger_rule with the all_done parameter, as mentioned in this other question, but that triggers the task when the upstream has been skipped. I only want the task to fire on the success or failure of the upstream task.
I don't believe there is a trigger rule for success and failed. What you could do is set up duplicate tasks, one with the trigger rule all_success and one with the trigger rule all_failed. That way, the duplicate task is only triggered if the parents ahead of it fails / succeeds.
I have included code below for you to test for expected results easily.
So, say you have three tasks.
task1 is your success / fail
task2 is your success only task
task3 is your failure only
#dags/latest_only_with_trigger.py
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
dag = DAG(
dag_id='stackoverflowtest',
schedule_interval=dt.timedelta(minutes=5),
start_date=dt.datetime(2019, 2, 20)
)
task1 = DummyOperator(task_id='task1', dag=dag)
task2 = DummyOperator(task_id='task2', dag=dag,
trigger_rule=TriggerRule.all_success)
task3 = DummyOperator(task_id='task3', dag=dag
trigger_rule=TriggerRule.all_failed)
###### ORCHESTRATION ###
task2.set_upstream(task1)
task3.set_upstream(task1)
Hope this helps!
We are running airflow version 1.9 in celery executor mode. Our task instances are stuck in retry mode. When the job fails, the task instance retries. After that it tries to run the task and then fall back to new retry time.
First state:
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:46:53.101483 and task will be retried at 2018-08-28T03:47:25.463271.
After some time:
All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
If this task instance does not start soon please contact your Airflow administrator for assistance
After some time: it again went into retry mode.
Task is not ready for retry yet but will be retried automatically. Current date is 2018-08-28T03:51:48.322424 and task will be retried at 2018-08-28T03:52:57.893430.
This is happening for all dags. We created a test dag and tried to get logs for both scheduler and worker logs.
from datetime import *
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'Pramiti',
'depends_on_past': False,
'retries': 3,
'retry_delay': timedelta(minutes=1)
}
dag = DAG('airflow-examples.test_failed_dag_v2', description='Failed DAG',
schedule_interval='*/10 * * * *',
start_date=datetime(2018, 9, 7), default_args=default_args)
b = BashOperator(
task_id="ls_command",
bash_command="mdr",
dag=dag
)
Task Logs
Scheduler Logs
Worker Logs