Airflow: DAG status is success even when no task ran - airflow

In its 2 out of 10 runs, the DAG status automatically sets to succes even when no tasks inside of it ran. Following is the DAG args which was passed and its tree view.
args = {
'owner': 'xyz',
'depends_on_past': False,
'catchup': False,
'start_date': datetime(2019, 7, 8),
'email': ['a#b.c'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'provide_context': True,
'retry_delay': timedelta(minutes=2)
}
And I am passing DAG as a context like this:
with DAG(PARENT_DAG_NAME, default_args=args, schedule_interval='30 * * * *') as main_dag:
task1 = DummyOperator(
task_id='Raw_Data_Ingestion_Started',
)
task2 = DummyOperator(
task_id='Raw_Data_Ingestion_Completed',
)
task1 >> task2
Any idea what could be the issue? Is it something I need to change in the config file? And this behaviour is not periodic.

According to the official airflow documentation on DummyOperator:
Operator that does literally nothing. It can be used to group tasks in a DAG.

Related

How to limit the number of dag retries?

I have a DAG configured like this:
AIRFLOW_DEFAULT_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'dagrun_timeout': timedelta(hours=5)
}
DAILY_RUNNER = DAG(
'daily_runner',
max_active_runs=1,
start_date=datetime(2019, 1, 1),
schedule_interval="0 17 * * *",
default_args=AIRFLOW_DEFAULT_ARGS)
My current understanding is that retries says that a task will be retried once before failing for good. Is there a way to set a similar limit for the number of times a DAG gets retried? If I have a dag in the running state, I want to be able to set it to failed from within the UI once and have it stop rerunning.
Currently, there is no way to set retry at dag level.
Please refer the below answer for retrying a set of tasks/whole-dag in case of failures.
Can a failed Airflow DAG Task Retry with changed parameter

Apache Airflow dag schedules in midnight UTC

I created Apache Airflow DAG with following default args. I want this DAG to run every day at 10PM UTC but it's always running at 12AM UTC and ignoring the date time I had set in start_date. Is this not the right way? Thanks.
default_args = {
'owner': config.OWNER,
'depends_on_past': False,
'start_date': datetime(2018, 10, 14, 22, 0, 0),
'email': [config.ALERT_EMAIL],
'email_on_failure': True,
'email_on_retry': False,
'retry_delay': timedelta(minutes=1),
'retries': 2,
}
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='#daily')
You can also use cron format in your schedule interval argument like this:
# DAG
dag = DAG('Test',
default_args=default_args,
description='Initial setup',
schedule_interval='0 22 * * *')
Regarding the schedule_interval you have at least three options:
datetime.timedelta
dateutil.relativedelta
cron style string
The schedule_interval defines how often that DAG runs. This timedelta object is added to your latest task instance’s execution_date to figure out the next schedule. And keep in mind that: start_date for the task, determines the execution_date for the first task instance.
All of the above is correct.
I have encountered an issue where, in Airflow 2.0, schedule_interval is ignored when put in the default_args. When I removed it and put it in the DAG declaration, all worked. I could test this by looking at the DAG details in the UI.
Example:
default_args = {
'owner': 'Hector Hoffman',
'depends_on_past': False,
'start_date':start_date,
'schedule_interval': '0 5 * * *',
'email': ['hector#email.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 0,
'on_failure_callback': task_fail_slack_alert
}
Results in:
Whereas, when I put it in the DAG:
with models.DAG(
"dealstampede_workflow",
default_args=default_args,
catchup=False,
schedule_interval='0 5 * * *'
) as dag:
Results in:
If anyone has any insight as to why the schedule_interval doesn't work in the default_args I'd appreciate feedback. Thanks.

Apache Airflow Task Instance state is blank

I have the dag config like below
args = {
'owner': 'XXX',
'depends_on_past': False,
'start_date': datetime(2018, 2, 26),
'email': ['sample#sample.com'],
'email_on_failure': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(dag_id='Daily_Report',
default_args=args,
schedule_interval='0 11 * * *',
dagrun_timeout=timedelta(seconds=30))
I have a bash operator and a data bricks operator
run_this = BashOperator(task_id='run_report',
bash_command=templated_command,
dag=dag)
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
notebook_task=notebook_task,
existing_cluster_id='xxxx',
dag=dag)
I'm running this like run_this.set_downstream(notebook_run)
The bash operator runs fine but the data bricks operator doesn't run it just leaves a blank state like below
Blank Status Airflow
Any thing I'm missing ? Im using Airflow version from Databricks here https://github.com/databricks/incubator-airflow
Try highlighting the text in the white label. It will probably say "None". White on white is terrible UX so I'm not sure why Airflow does it that way.

How to pass parameters to a hql run via airflow

I would like to know how I can pass a parameter to an hive query script run via airflow. If I want to add a parameter only for this script say target_db = mydatabase, how can i do that? Do I need to add it to the default_args and then call it in then call it in the op_kwargs of the script?
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'start_date': datetime(2017, 11, 1),
'email_on_failure': True,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(dag_name, default_args=default_args, schedule_interval="#daily")
t_add_step = PythonOperator(
task_id='add__step',
provide_context=True,
python_callable=add_emr_step,
op_kwargs={
'aws_conn_id': dag_params['aws_conn_id'],
'create_job_flow_task': 'create_emr_flow',
'get_step_task': 'get_email_step'
},
dag=dag
)
Assuming you are invoking Hive using BashOperator, it would look something like this
...
set_hive_db = BashOperator (
bash_command = """
hive --database {{params.database}} -f {{params.hql_file}}
""",
params = {
"database": "testingdb",
"hql_file": "myhql.hql"
},
dag = dag
)
...
Another approach would be to USE database inside your hql and just call hive -f hqlfile.hql in your BashOperator

unwanted DAG runs in Airflow

I configured my DAG like this:
default_args = {
'owner': 'Aviv',
'depends_on_past': False,
'start_date': datetime(2017, 1, 1),
'email': ['aviv#oron.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1)
}
dag = DAG(
'MyDAG'
, schedule_interval=timedelta(minutes=3)
, default_args=default_args
, catchup=False
)
and for some reason, when i un-pause the DAG, its being executed twice immediatly.
Any idea why? And is there any rule i can apply to tell this DAG to never run more than once in the same time?
You can specify max_active_runs like this:
dag = airflow.DAG(
'customer_staging',
schedule_interval="#daily",
dagrun_timeout=timedelta(minutes=60),
template_searchpath=tmpl_search_path,
default_args=args,
max_active_runs=1)
I've never seen it happening, are you sure that those runs are not backfills, see: https://stackoverflow.com/a/47953439/9132848
I think its because you have missed the scheduled time and airflow is backfilling it automatically when you ON it again. You can disable this by
catchup_by_default = False in the airflow.cfg.

Resources