How to enable Subdag in Airflow? - airflow

In Airflow document, it is mentioned as below
"Subdags must have a schedule and be enabled
Even though subdags are triggered as part of a larger dag, if their schedule is set to None or ‘#once’, the subdag operator will succeed without doing anything".
But not clear, how we can enable the Subdags. Is there any way to enable the Subdag?

You can create a SubDAG like this:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
default_args = {
'email_on_failure': False,
'email_on_retry': False,
'start_date': datetime(2017, 12, 16),
}
schedule_interval = "#daily"
def create_subdag(main_dag, subdag_id):
subdag = DAG('{0}.{1}'.format(main_dag.dag_id, subdag_id),
default_args=default_args)
DummyOperator(
task_id='foo',
dag=subdag)
return subdag
main_dag = DAG(
dag_id='main_dag',
schedule_interval=schedule_interval,
default_args=default_args,
max_active_runs=1
)
my_subdag = SubDagOperator(
task_id='subdag',
dag=main_dag,
retries=3,
subdag=create_subdag(main_dag, 'subdag')
)

Related

Airflow Task triggered manually but remains in queued state

I am using Airflow 2.3.1 and running with Local Executor against MS SQL Server as metadata db.
I am trying to execute a dag manually, it shows as queued but nothing happens. There is no other tasks running when this dag is triggered. When I hover on the task, it says "Not yet started".
Tried restarting the scheduler and webserver, but nothing different. The code of the dag is as follows
from datetime import datetime, timedelta import pendulum from airflow
import DAG from airflow.operators.bash import BashOperator from
airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'start_date': datetime(2022,5,27),
'email': False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5), }
MIDT_dag = DAG(
'Dag_1',
default_args=default_args,
catchup=False,
description='Test DAG',
schedule_interval=timedelta(days=1) )
task_1 = BashOperator(
task_id='first_task',
bash_command= r"/srv/python3_8_13/venv/bin/python /srv/source_code/InputToRawMIDT_Amadeus_Spark_Linux.py",
dag=MIDT_dag, )
task_2 = BashOperator(
task_id='second_task',
bash_command='echo Testing',
dag=MIDT_dag, )
task_1 >> task_2
Appreciate any help.
Thanks
Manoj George
It seems like your DAG is disabled.
open the UI, choose DAGS in the menu and enabled it.

Implementing cross-DAG dependency in Apache airflow

I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.

Airflow tasks not gettin running

I am trying to run a simple BASHOperator task in Airflow. The DAG when trigerred manually lists the tasks in Tree and Graph view but the tasks are always in not started state.
I have restarted my Airflow scheduler. I am running Airflow on local host using a Kubectl image on Docker Compose.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['vijayraghunath21#gmail.com'],
'email_on_success': True,
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=2),
}
with DAG(
dag_id='bash_demo',
default_args=default_args,
description='Bash Demo',
start_date=datetime(2021, 1, 1),
# schedule_interval='0 2 * * *',
schedule_interval=None,
max_active_runs=1,
catchup=False,
tags=['bash_demo'],
) as dag:
dag.doc_md = __doc__
# Task 1
dummy_task = DummyOperator(task_id='dummy_task')
# Task 2
bash_task = BashOperator(
task_id='bash_task', bash_command="echo 'command executed from BashOperator'")
dummy_task >> bash_task
DAG Image
As shown on the image you added the DAG is set to off thus it's not running. You should click on the toggle button to set it to on.
This issue can be avoided in two ways:
Global solution- if you wills set dags_are_paused_at_creation = False in airflow.cfg - This will effect all DAGs in the system.
Local solution - if you will use is_paused_upon_creation in the DAG contractor:
with DAG(
dag_id='bash_demo',
...
is_paused_upon_creation=False,
) as dag:
This parameter specifies if the dag is paused when created for the first time. If the dag exists already, the parameter is being ignored.

Apache-Airflow - Task is in the none state when running DAG

Just started with airflow and wanted to run simple dag with BashOperator that outputs 'Hello' to console
I noticed that my status is indefinitely stuck in 'Running'
When I go on task details, I get this:
Task is in the 'None' state which is not a valid state for execution. The task must be cleared in order to be run.
Any suggestions or hints are much appreciated.
Dag:
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'dude_whose_doors_open_like_this_-W-',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['yessure#gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'Test',
default_args=default_args,
description='Test',
schedule_interval=timedelta(days=1)
)
t1 = BashOperator(
task_id='ECHO',
bash_command='echo "Hello"',
dag=dag
)
t1
I've managed to solve it by adding 'start_date': dt(1970, 1, 1),
to default args object
and adding schedule_interval=None to my dag object
Could you remove the last line of t1- this isn't necessary. Also start_dateshouldn't be set dynamically - this can lead to problems with the scheduling.

Airflow + Sentry - no information from dags/tasks

I am trying to start using sentry to grab information from airflow. I am using the newest version of airflow (from v1.10.6 sentry is integrated with airflow). However i am not able to get any information about the dag or task status.
I prepared some simple dag which should fail, but on sentry i don't receive anything. The connection is established becouse when i make some typo for example in imports, the error infomation is catched at sentry. For this example i used the SequentialExecutor
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
from airflow.utils.dates import days_ago
from airflow import AirflowException
################################################################################
# dag
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(seconds=3),
}
dag = DAG(
'debug_sentry',
default_args=default_args,
schedule_interval=None,
)
################################################################################
# first_task
def _first_task_callable(*args, **kwargs):
pass
first_task = PythonOperator(
task_id='first_task',
python_callable=_first_task_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
# second_task_which_fails
def _second_task_which_fails_callable(*args, **kwargs):
a = 1
b = 0
c = a / b
return c
second_task_which_fails = PythonOperator(
task_id='second_task_which_fails',
python_callable=_second_task_which_fails_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
# third_task
def _third_task_callable(*args, **kwargs):
pass
third_task = PythonOperator(
task_id='third_task',
python_callable=_third_task_callable,
provide_context=True,
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag
)
################################################################################
first_task >> second_task_which_fails >> third_task
What i did wrong or i missed something in configuration at airflow.cfg?
[sentry]
sentry_dsn = https://<my_dsn>
There was a recent fix to the Sentry integration in Airflow as per: https://github.com/apache/airflow/pull/7232. Try updating airflow to this commit?

Resources