This is my dag file in dags folder.
Code that goes along with the Airflow located at:
http://airflow.readthedocs.org/en/latest/tutorial.html
"""
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from work_file import Test
class Main(Test):
def __init__(self):
super(Test, self).__init__()
def create_dag(self):
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2015, 6, 1),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG("python_dag", default_args=default_args, schedule_interval='0 * * * *')
dummy_task = DummyOperator(task_id='dummy_task', retries=3)
python_task = PythonOperator(task_id='python_task', python_callable=self.my_func)
dummy_task >> python_task
if __name__ == "__main__":
a = Main()
a.create_dag()
This is my other file work_file.py which is in the same dags folder.
class Test:
def __init__(self):
pass
def my_func(self):
return "Hello"
Aim:-The Aim is to call the my_func from my dag file.
Problem:- There seems to be no error on the UI,but my dag python_dag is not visible.
My server, scheduler is also running, I've tried restarting the same but nothing happened.
I have imported the file as well (from work_file import Test)
Thanks in advance!
There are multiple problems with the DAG:
The operators are not assigned to any DAG. Add dag=dag to the constructors. E.g., DummyOperator(..., dag=dag).
create_dag() does not return the DAG. Add return dag.
The DAG script is not executed as the top level code. That is, the modules __name__ is never '__main__'. Remove if __name__ == "__main__":.
The DAG objects must be in the global namespace of the module. Assign the return value of create_dag() to a variable: dag = a.create_dag().
Related
Let's take an example DAG.
Here is the code for it.
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.models import TaskInstance
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
def task_failure_notification_alert(context):
logging.info("Task context details: %s", str(context))
def dag_failure_notification_alert(context):
logging.info("DAG context details: %s", str(context))
def red_exception_task(ti: TaskInstance, **kwargs):
raise Exception('red')
default_args = {
"owner": "analytics",
"start_date": datetime(2021, 12, 12),
'retries': 0,
'retry_delay': timedelta(),
"schedule_interval": "#daily"
}
dag = DAG('logger_dag',
default_args=default_args,
catchup=False,
on_failure_callback=dag_failure_notification_alert
)
start_task = DummyOperator(task_id="start_task", dag=dag, on_failure_callback=task_failure_notification_alert)
red_task = PythonOperator(
dag=dag,
task_id='red_task',
python_callable=red_exception_task,
provide_context=True,
on_failure_callback=task_failure_notification_alert
)
end_task = DummyOperator(task_id="end_task", dag=dag, on_failure_callback=task_failure_notification_alert)
start_task >> red_task >> end_task
We can see two functions i.e. task_failure_notification_alert and dag_failure_notification_alert are being called in case of failures.
We can see logs in case of Task failure by the below steps.
We can see logs for the task as below.
but I am unable to find logs for the on_failure_callback of DAG anywhere in UI. Where can we see it?
Under airflow/logs find the "scheduler" folder, under it look for the specific date you ran the Dag for example 2022-12-03 and there you will see name of the dag_file.log.
I am using Airflow 2.3.4 ;
I am Triggering with Config. When I hardcode the config values, this DAG runs successfully.
But on Triggering after passing config
my tasks never start, but the status turn green(Success).
Please help me understand what's going wrong !
from datetime import datetime, timedelta
from airflow import DAG
from pprint import pprint
from airflow.operators.python import PythonOperator
from operators.jvm import JVMOperator
args = {
'owner': 'satyam',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag_params = {
'dag_id': 'synthea_etl_end_to_end_with_config',
'start_date': datetime.utcnow(),
'end_date': datetime(2025, 2, 5),
'default_args': args,
'schedule_interval': timedelta(hours=4)
}
dag = DAG(**dag_params)
# [START howto_operator_python]
def print_context(ds, **kwargs):
"""Print the Airflow context and ds variable from the context."""
pprint(kwargs)
pprint(ds)
return 'Whatever you return gets printed in the logs'
jvm_task = JVMOperator(
task_id='jvm_task',
correlation_id='123456',
jar='/home/i1136/Synthea/synthea-with-dependencies.jar',
options={
'java_args': [''],
'jar_args': ["-p {{ dag_run.conf['population_count'] }} --exporter.fhir.export {{ dag_run.conf['fhir'] }} --exporter.ccda.export {{ dag_run.conf['ccda'] }} --exporter.csv.export {{ dag_run.conf['csv'] }} --exporter.csv.append_mode {{ dag_run.conf['csv'] }} --exporter.baseDirectory /home/i1136/Synthea/output_dag_config" ]
})
print_context_task = PythonOperator(task_id='print_context_task', provide_context=True, python_callable=print_context, dag=dag)
jvm_task.set_downstream(print_context_task)
The problem is with 'start_date': datetime.utcnow(), which is always >= the dag_run start_date, in this case Airflow will mark the run as succeeded without running it. For this variable, it's better to choose the minimum date of your runs, if you don't have one, you can use the yesterday date, but the next day you will not be able to re-run the tasks failed on the previous day:
import pendulum
dag_params = {
...,
'start_date': pendulum.yesterday(),
...,
}
In my case it was a small bug in python script - not detected by Airflow after refreshing
I am trying to implement DAG dependency between 2 DAGs say A and B. DAG A runs once every hour and DAG B runs every 15 mins.
Each time DAG B starts it's run I want to make sure DAG A is not in running state.
If DAG A is found to be running then DAG B has to wait until DAG A completes the run.
If DAG A is not running, DAG B can proceed with it's tasks.
DAG A :
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_A', schedule_interval='0/60 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task1 = DummyOperator(task_id='task1', retries=1, dag=dag)
task2 = DummyOperator(task_id='task2', retries=1, dag=dag)
task3 = DummyOperator(task_id='task3', retries=1, dag=dag)
task1 >> task2 >> task3
DAG B:
from datetime import datetime,timedelta
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'dependency',
'depends_on_past': False,
'start_date': datetime(2020, 9, 10, 10, 1),
'email': ['xxxx.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG('DAG_B', schedule_interval='0/15 * * * *',max_active_runs=1, catchup=False,
default_args=default_args) as dag:
task4 = DummyOperator(task_id='task4', retries=1, dag=dag)
task5 = DummyOperator(task_id='task5', retries=1, dag=dag)
task6 = DummyOperator(task_id='task6', retries=1, dag=dag)
task4 >> task5 >> task6
I have tried using ExternalTaskSensor operator. I am unable to understand if the sensor finds DAG A to be in success state it triggers the next task else wait for the task to complete.
Thanks in advance.
I think the only way you can achieve that in "general" way is to use some external locking mechanism
You can achieve quite a good approximation though using pools:
https://airflow.apache.org/docs/apache-airflow/1.10.3/concepts.html?highlight=pool
if you set pool size to 1 and assign both dag A and B to the pool, only one of those can be running at a time. You can also add priority_weight in the way that you see best - in case you need to prioritise A over B or the other way round.
You could use ExternalTaskSensor to achieve what you are looking for. The key aspect is to initialize this sensor with the correct execution_date, being that in your example the execution_date of the last DagRun of DAG_A.
Check this example where DAG_A runs every 9 minutes for 200 seconds. DAG_B runs every 3 minutes and runs for 30 seconds. These values are arbitrary and only for demo purpose, could be pretty much anything.
DAG A (nothing new here):
import time
from airflow import DAG
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
def _executing_task(**kwargs):
print("Starting task_a")
time.sleep(200)
print("Completed task_a")
dag = DAG(
dag_id="example_external_task_sensor_a",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/9 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
task_a = PythonOperator(
task_id='task_a',
python_callable=_executing_task,
)
chain(start, task_a)
DAG B:
import time
from airflow import DAG
from airflow.utils.db import provide_session
from airflow.models.dag import get_last_dagrun
from airflow.models.baseoperator import chain
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from airflow.sensors.external_task import ExternalTaskSensor
def _executing_task():
time.sleep(30)
print("Completed task_b")
#provide_session
def _get_execution_date_of_dag_a(exec_date, session=None, **kwargs):
dag_a_last_run = get_last_dagrun(
'example_external_task_sensor_a', session)
print(dag_a_last_run)
print(f"EXEC DATE: {dag_a_last_run.execution_date}")
return dag_a_last_run.execution_date
dag = DAG(
dag_id="example_external_task_sensor_b",
default_args={"owner": "airflow"},
start_date=days_ago(1),
schedule_interval="*/3 * * * *",
tags=['example_dags'],
catchup=False
)
with dag:
start = DummyOperator(
task_id='start')
wait_for_dag_a = ExternalTaskSensor(
task_id='wait_for_dag_a',
external_dag_id='example_external_task_sensor_a',
allowed_states=['success', 'failed'],
execution_date_fn=_get_execution_date_of_dag_a,
poke_interval=30
)
task_b = PythonOperator(
task_id='task_b',
python_callable=_executing_task,
)
chain(start, wait_for_dag_a, task_b)
We are using the param execution_date_fn of the ExternalTaskSensor in order to obtain the execution_date of the last DagRun of the DAG_A, if we don't do so, it will wait for DAG_A with the same execution_date as the actual run of DAG_B which may not exists in many cases.
The function _get_execution_date_of_dag_a does a query to the metadata DB to obtain the exec_date by using get_last_dagrun from Airflow models.
Finally the other important parameter is allowed_states=['success', 'failed'] where we are telling it to wait until DAG_A is found in one of those states (i.e if it is in running state will keep executing poke).
Try it out and let me know if it worked for you!.
I want to make a parent DAG with a few child DAGs that get called via the SubDagOperator.
I can only find examples how to dynamically create Subdags in the SubDagOperator task.
However, in this case I want standalone child DAGs that are already defined in a DAG.py file and stitch those together in a parent dag
If I set the SubDAGOperator task with just the Dag Name of the child dag:
task_1 = SubDagOperator(
task_id="task_1",
subdag=child_dag_name,
dag=parent_dag
)
I get the following Error:
NameError: name 'child_dag_name' is not defined
This answer equally relies on knowledge of Python as much it does on having know-how of Airflow
Recall that
python: importing a module means that all top-level (indentation zero) stuff is immediately executed (during import process)
airflow: only those DAG objects are picked by scheduler / webserver that are occur on top-level (indentation zero) of dag-definition file
Keeping above 2 things in mind, here's what you can do
create a helper / utility function in your child_dag.py file to insantiate and return a DAG object for child-dag
use that helper function for instantiating the top-level child-DAG as well as for creating SubDagOperator task
dag_object_builder.py
from typing import Dict, Any
from airflow.models import DAG
def create_dag_object(dag_id: str, dag_params: Dict[str, Any]) -> DAG:
dag: DAG = DAG(dag_id=dag_id, **dag_params)
return dag
child_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
...
def create_child_dag_object(dag_id: str) -> DAG:
my_dag: DAG = dag_object_builder.create_dag_object(
dag_id=dag_id,
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
return my_dag
my_child_dag: DAG = create_child_dag_object(dag_id="my_child_dag")
parent_dag.py
from datetime import datetime
from typing import Dict, Any
from airflow.models import DAG
from airflow.operators.subdag_operator import SubDagOperator
from src.main.subdag_example import child_dag
from src.main.subdag_example import dag_object_builder
default_args: Dict[str, Any] = {
"owner": "my_owner",
"email": ["my_username#my_domain.com"],
"weight_rule": "downstream",
"retries": 1
}
my_parent_dag: DAG = dag_object_builder.create_dag_object(
dag_id="my_parent_dag",
dag_params={
"start_date": datetime(year=2019, month=7, day=10, hour=21, minute=30),
"schedule_interval": None,
"max_active_runs": 1,
"default_view": "graph",
"catchup": False,
"default_args": default_args
}
)
...
my_subdag_task: SubDagOperator = SubDagOperator(
task_id="my_subdag_task",
dag=my_parent_dag,
subdag=child_dag.create_child_dag_object(dag_id="my_subdag")
)
If your intention is to link-up DAGs together and you don't have any particular requirement that necessitates using a SubDagOperator, then I would suggest using the TriggerDagRunOperator instead since SubDags have their share of nuisances.
Read more about it here: Wiring top-level DAGs together
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from workflow.task import some_task
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['jimin.park1#aig.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=1),
'start_date': airflow.utils.dates.days_ago(0)
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('JiminTest', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False)
t1 = PythonOperator(
task_id='Task1',
provide_context=True,
python_callable=some_task,
dag=dag
)
The actual some_task itself simply appends timestamp to some file. As you can see in the dag config file, the task itself is configured to run every 1 min.
def some_task(ds, **kwargs):
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open("test.txt", "a") as myfile:
myfile.write(current_time + '\n')
I simply tail -f the output file and started up the webserver without the scheduler running. This function was being called and things were being appended to the file when webserver starts up. When I start up the scheduler, on each execution loop, the file gets appended.
What I want is for the function to be executed on every minute as intended, not every execution loop.
The scheduler will run each DAG file every scheduler loop, including all import statements.
Is there anything running code in the file from where you are importing the function?
Try to check the scheduler_heartbeat_sec config parameter in your config file. For your case it should be smaller than 60 seconds.
If you want the scheduler not to cahtchup previous runs set catchup_by_defaultto False (I am not sure if this relevant to your question though).
Please indicate which Apache Airflow version are you using