Airflow - Broken DAG - Timeout - airflow

I have a DAG that executes a function that connects to a Postgres DB, deletes the contents in the table and then inserts a new data set.
I am trying this in my local and I see when I try to run this, the web server takes a long time to connect and in most cases doesn't succeed. However as part of the connecting process it seems to be executing the queries from the back-end. Since I have a delete function I see the data getting deleted from the table(basically one of the functions gets executed) even though I have not scheduled the script or manually started. Could someone advice as to what I am doing wrong in this.
One error that pops out in the UI is
Broken DAG: [/Users/user/airflow/dags/dwh_sample23.py] Timeout
Also see an i next to the dag id in the UI that says This is DAG isn't available in the web server's DAG object.
Given below is the code I am using:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = BashOperator(
task_id='DB_Connect',
python_callable=db_login(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t2 = BashOperator(
task_id='del',
python_callable=tbl1_del(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t3 = BashOperator(
task_id='populate',
python_callable=pop_tbl1(),
bash_command='python3 ~/airflow/dags/dwh_sample23.py',
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)
Could anyone assist? Thanks.

Instead of using BashOperator you can use PythonOperator and call db_login(), tbl1_del(), pop_tbl1() in PythonOperator
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'admin',
'depends_on_past': False,
'start_date': datetime(2018, 5, 21),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('dwh_sample23', default_args=default_args)
#######################
## Login to DB
def db_login():
''' This function connects to the Data Warehouse and returns the cursor to execute queries '''
global dwh_connection
try:
dwh_connection = psycopg2.connect(" dbname = 'dbname' user = 'user' password = 'password' host = 'hostname' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Success')
return(dwh_connection)
def tbl1_del():
''' This function takes clears all rows from tbl1 '''
cur = dwh_connection.cursor()
cur.execute("""DELETE FROM tbl1;""")
dwh_connection.commit()
def pop_tbl1():
''' This function populates all rows in tbl1 '''
cur = dwh_connection.cursor()
cur.execute(""" INSERT INTO tbl1
select id,name,price from tbl2;""")
dwh_connection.commit()
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=db_login(),
dag=dag)
t2 = PythonOperator(
task_id='del',
python_callable=tbl1_del(),
dag=dag)
t3 = PythonOperator(
task_id='populate',
python_callable=pop_tbl1(),
dag=dag)
t1.set_downstream(t2)
t2.set_downstream(t3)

This is really old by now, but we got this error in prod and I found this question, and think its nice that it would have an answer.
Some of the code is getting executed during DAG load, i.e. you actually run
db_login()
tbl1_del()
pop_tbl1()
dwh_connection.close()
##########################################
inside webserver and scheduler loop, when they load dag definition from the file.
I believe you didn't intend that to happen.
Everything should work just fine if you just remove these 4 lines.
Generally don't place function you want executors to execute on file/module level, because when interpreter of the scheduler/webserver loads the file to get dag definition, it would invoke them.
Just try putting this in your dag file and see check webserver logs to see what happens.
from time import sleep
def do_some_printing():
print(1111111)
sleep(60)
do_some_printing()

Related

Airflow - Trying to execute a set of Python functions

I am trying to execute a Airflow script that consists of a couple of Python functions. These functions basically query a database and perform few tasks. I am trying to execute this is Airflow so that I would be able to monitor each of these functions seperately. Given below is the code I am trying to execute and get the below error
Subtask: NameError: name 'task_instance' is not defined
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2
Could anyone assist on this. Thanks..
Update 1 :
Encountered an error
AttributeError: 'NoneType' object has no attribute 'execute'
pointing to the last line on the above piece of code
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
Complete code
Complete code:
## Third party Library Imports
import pandas as pd
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 5, 29, 12),
'email': ['airflow#airflow.com']
}
dag = DAG('sample1', default_args=default_args)
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
return (dwh_connection)
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=data_warehouse_login,provide_context=True,
dag=dag)
#######################
def insert_data(**kwargs):
task_instance = kwargs['task_instance']
db_con_xcom = task_instance.xcom_pull(key="dwh_connection", task_ids='DWH_Connect')
cur = db_con_xcom
cur.execute("""insert into tbl_1 select limit 2 """)
##########################################
t2 = PythonOperator(
task_id='DWH_Connect1',
python_callable=insert_data,provide_context=True,dag=dag)
t1 >> t2
This is a basic error message from Python.
NameError: name 'task_instance' is not defined
tells that task_instance is nowhere to be found when you want to use it.
The task instance is provided in the context which is already being passed to the function.
Airflow sends the context with the setting
provide_context=True,
within the task. Also the definition accepts kwargs:
def insert_data(**kwargs):
which is also correct.
Correction
You first need to take the task instance out of the context like so:
task_instance = kwargs['task_instance']
Then you can use the task instance to use xcom_pull. So it should look like this (put in a few comments as well):
def insert_data(**kwargs):
task_instance = kwargs['task_instance']
db_con_xcom = task_instance.xcom_pull(key="db_con", task_ids='db_log')
#return (v1) # wrong, why return here?
#cur = db_con.cursor() # wrong, db_con might not be available
cur = db_con_xcom
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
Since the question is becoming bigger I think it is appropriate to add a second answer.
Even after the edit from the comment "I removed the indentation portion of the code" I am still not sure about this bit of code:
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
return (dwh_connection)
It should look like this:
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439'")
except:
print("I am unable to connect")
print('Connection Task Complete')
task_instance = kwargs['task_instance']
task_instance.xcom_push(key="dwh_connection" , value = "dwh_connection")
#return (dwh_connection) # don't need a return here
Besides that the idea in your other question (Python - AttributeError: 'NoneType' object has no attribute 'execute') to use a PostgresHook seems interesting to me. You might want to pursue that thought in the other question.

Airflow - Script does not execute when triggered

I have a airflow script that tries to insert data from one table to another, I am using a Amazon Redshift DB. The given below script when triggered does not execute. Task_id status remains as 'no status' in the Graph view and no other error is shown.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_login():
global db_conn
try:
db_conn = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = '5439' sslmode = 'require' ")
except:
print("I am unable to connect to the database.")
print('Connection Task Complete: Connected to DB')
return (db_conn)
#######################
def insert_data():
cur = db_conn.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2 ;""")
db_conn.commit()
print('ETL Task Complete')
def job_run():
db_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DBConnect',
python_callable=job_run,
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Could anyone assist to find where the problem could be. Thanks
Updated Code (05/28)
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def data_warehouse_login():
global dwh_connection
try:
dwh_connection = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (dwh_connection)
def insert_data():
cur = dwh_connection.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
dwh_connection.commit()
print('Task Complete: Insert success')
def job_run():
data_warehouse_login()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run(),
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
Log message when running the script
[2018-05-28 11:36:45,300] {jobs.py:343} DagFileProcessor26 INFO - Started process (PID=26489) to work on /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:45,306] {jobs.py:534} DagFileProcessor26 ERROR - Cannot use more than 1 thread when using sqlite. Setting max_threads to 1
[2018-05-28 11:36:45,310] {jobs.py:1521} DagFileProcessor26 INFO - Processing file /Users/user/airflow/dags/sample.py for tasks to queue
[2018-05-28 11:36:45,310] {models.py:167} DagFileProcessor26 INFO - Filling up the DagBag from /Users/user/airflow/dags/sample.py
/Users/user/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Task Complete: Insert success
[2018-05-28 11:36:50,964] {jobs.py:1535} DagFileProcessor26 INFO - DAG(s) dict_keys(['latest_only', 'example_python_operator', 'test_utils', 'example_bash_operator', 'example_short_circuit_operator', 'example_branch_operator', 'tutorial', 'example_passing_params_via_test_command', 'latest_only_with_trigger', 'example_xcom', 'example_http_operator', 'example_skip_dag', 'example_trigger_target_dag', 'example_branch_dop_operator_v3', 'example_subdag_operator', 'example_subdag_operator.section-1', 'example_subdag_operator.section-2', 'example_trigger_controller_dag', 'insert_data2']) retrieved from /Users/user/airflow/dags/sample.py
[2018-05-28 11:36:51,159] {jobs.py:1169} DagFileProcessor26 INFO - Processing example_subdag_operator
[2018-05-28 11:36:51,167] {jobs.py:566} DagFileProcessor26 INFO - Skipping SLA check for <DAG: example_subdag_operator> because no tasks in DAG have SLAs
[2018-05-28 11:36:51,170] {jobs.py:1169} DagFileProcessor26 INFO - Processing sample_dag
[2018-05-28 11:36:51,174] {jobs.py:354} DagFileProcessor26 ERROR - Got an exception! Propagating...
Traceback (most recent call last):
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 346, in helper
pickle_dags)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1581, in process_file
self._process_dags(dagbag, dags, ti_keys_to_schedule)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 1171, in _process_dags
dag_run = self.create_dag_run(dag)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/utils/db.py", line 53, in wrapper
result = func(*args, **kwargs)
File "/Users/user/anaconda3/lib/python3.6/site-packages/airflow/jobs.py", line 776, in create_dag_run
if next_start <= now:
TypeError: '<=' not supported between instances of 'NoneType' and 'datetime.datetime'
Log from the Graph View
* Log file isn't local.
* Fetching here: http://:8793/log/sample_dag/DWH_Connect/2018-05-28T12:23:57.595234
*** Failed to fetch log file from worker.
* Reading remote logs...
* Unsupported remote log location.
Instead of having a PythonOperator you need to have a BashOperator and a PythonOperator.
You are getting the error because PythonOperator doesn't have a bash_command argument
t1 = PythonOperator(
task_id='DBConnect',
python_callable=db_login,
dag=dag
)
t2 = BashOperator(
task_id='Run Python File',
bash_command='python3 ~/airflow/dags/sample.py',
dag=dag
)
t1 >> t2
To the answer kaxil provided I would like to extend that you should be using an IDE to develop for Airflow. PyCharm works fine for me.
That being said, please make sure to look up the available fields in the docs next time. For PythonOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.PythonOperator
Signature looks like:
class airflow.operators.PythonOperator(python_callable, op_args=None, op_kwargs=None, provide_context=False, templates_dict=None, templates_exts=None, *args, **kwargs)
and for BashOperator, see the docs here:
https://airflow.apache.org/code.html#airflow.operators.BashOperator
Signature is:
class airflow.operators.BashOperator(bash_command, xcom_push=False, env=None, output_encoding='utf-8', *args, **kwargs)
Highlights are from me to show the parameters you have been using.
Make sure to dig through the documentation a bit before using an Operator is my recommendation.
EDIT
After seeing the code update there is one thing left:
Make sure when defining python_callable in a task that you do so without brackets, otherwise the code will be called (which is very unintuitive if you don't know about it). So your code should look like this:
t1 = PythonOperator(
task_id='DWH_Connect',
python_callable=job_run,
dag=dag)

Airflow - Run each python function separately

I have the airflow below script that runs all python scripts as one function. I would like to have each the python functions to run individually so that I could keep track of each function and their status.
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log():
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
return (db_con)
def insert_data():
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
def job_run():
db_log()
insert_data()
##########################################
t1 = PythonOperator(
task_id='DB_Connect',
python_callable=job_run,
# bash_command='python3 ~/airflow/dags/sample.py',
dag=dag)
t1
The above script works just fine but would like to split this by function to keep better track. Could anyone assist on this. Tnx..
Updated Code (version 2):
## Third party Library Imports
import psycopg2
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
#from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import io
# Following are defaults which can be overridden later on
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 23, 12),
'email': ['airflow#airflow.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG('sample_dag', default_args=default_args, catchup=False, schedule_interval="#once")
#######################
## Login to DB
def db_log(**kwargs):
global db_con
try:
db_con = psycopg2.connect(
" dbname = 'name' user = 'user' password = 'pass' host = 'host' port = 'port' sslmode = 'require' ")
except:
print("Connection Failed.")
print('Connected successfully')
task_instance = kwargs['task_instance']
task_instance.xcom_push(value="db_con", key="db_log")
return (db_con)
def insert_data(**kwargs):
v1 = task_instance.xcom_pull(key="db_con", task_ids='db_log')
return (v1)
cur = db_con.cursor()
cur.execute("""insert into tbl_1 select id,bill_no,status from tbl_2 limit 2;""")
#def job_run():
# db_log()
# insert_data()
##########################################
t1 = PythonOperator(
task_id='Connect',
python_callable=db_log,provide_context=True,
dag=dag)
t2 = PythonOperator(
task_id='Query',
python_callable=insert_data,provide_context=True,
dag=dag)
t1 >> t2
There are two possible solutions for this:
A) Create several tasks per function
The tasks in Airflow are being called in separate processes. Variables which get defined as global won't work since the second task can usually not see into the variables of the first task.
Introducing: XCOM. This is a feature of Airflow and we answered a few questions for this already, for example here (with examples): Python Airflow - Return result from PythonOperator
EDIT
You have to provide context and pass the context along as written in the examples. For your example, this would mean:
add provide_context=True, to your PythonOperator
change the signature of job_run to def job_run(**kwargs):
pass the kwargs to data_warehouse_login with data_warehouse_login(kwargs) inside the function
B) Create one complete function
In this very scenario I'd still remove the global (just call insert_data, call data_warehouse_login from within and return the connection) and use just one task.
If an error occurs, throw an exception. Airflow will handle these just fine. Just make sure to put appropriate messages in the exception and use the best exception type.

Triggering A SubDag

EDITED
I have edited this question by considering the inputs from #tobi6
I copied the subdag operator from Airflow source code
Source code: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/subdag_operator.py
I modified a few things in the execute method. The changes were made to trigger the SubDag and wait until the SubDag completes execution. The trigger is working great but the tasks are not being executed (DAG is in the running/Green state while the tasks are in the null/White state).
Please refer below for the changes I made:
from airflow.exceptions import AirflowException
from airflow.models import BaseOperator, Pool
from airflow.utils.decorators import apply_defaults
from airflow.utils.db import provide_session
from airflow.utils.state import State
from airflow.executors import GetDefaultExecutor
from time import sleep
import logging
from datetime import datetime
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
"""
Yo dawg. This runs a sub dag. By convention, a sub dag's dag_id
should be prefixed by its parent and a dot. As in `parent.child`.
:param subdag: the DAG object to run as a subdag of the current DAG.
:type subdag: airflow.DAG
:param dag: the parent DAG
:type subdag: airflow.DAG
"""
import airflow.models
dag = kwargs.get('dag') or airflow.models._CONTEXT_MANAGER_DAG
if not dag:
raise AirflowException('Please pass in the `dag` param or call '
'within a DAG context manager')
session = kwargs.pop('session')
super(SubDagOperator, self).__init__(*args, **kwargs)
# validate subdag name
if dag.dag_id + '.' + kwargs['task_id'] != subdag.dag_id:
raise AirflowException(
"The subdag's dag_id should have the form "
"'{{parent_dag_id}}.{{this_task_id}}'. Expected "
"'{d}.{t}'; received '{rcvd}'.".format(
d=dag.dag_id, t=kwargs['task_id'], rcvd=subdag.dag_id))
# validate that subdag operator and subdag tasks don't have a
# pool conflict
if self.pool:
conflicts = [t for t in subdag.tasks if t.pool == self.pool]
if conflicts:
# only query for pool conflicts if one may exist
pool = (
session
.query(Pool)
.filter(Pool.slots == 1)
.filter(Pool.pool == self.pool)
.first()
)
if pool and any(t.pool == self.pool for t in subdag.tasks):
raise AirflowException(
'SubDagOperator {sd} and subdag task{plural} {t} both '
'use pool {p}, but the pool only has 1 slot. The '
'subdag tasks will never run.'.format(
sd=self.task_id,
plural=len(conflicts) > 1,
t=', '.join(t.task_id for t in conflicts),
p=self.pool
)
)
self.subdag = subdag
self.executor = executor
def execute(self, context):
dag_run = self.subdag.create_dagrun(
conf=context['dag_run'].conf,
state=State.RUNNING,
execution_date=context['execution_date'],
run_id='trig__' + str(datetime.utcnow()),
external_trigger=True
)
while True:
if dag_run.get_state() == State.FAILED or dag_run.get_state() == State.SUCCESS:
break
else:
sleep(10)
continue
Below is the code that shows how I'm using the same
from airflow import DAG
from operators.sd_operator import SubDagOperator # My SubDag Operator
from airflow.operators.python_operator import PythonOperator
import logging
from datetime import datetime
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2017, 7, 17),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
}
def print_dag_details(**kwargs):
logging.info(str(kwargs['dag_run'].conf))
with DAG('example_dag', schedule_interval=None, catchup=False, default_args=default_args) as dag:
task_1 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_1'),
task_id='sub_dag_1'
)
task_2 = SubDagOperator(
subdag=sub_dag_func('example_dag', 'sub_dag_2'),
task_id='sub_dag_2',
)
print_kwargs = PythonOperator(
task_id='print_kwargs',
python_callable=print_dag_details,
provide_context=True
)
print_kwargs >> task_1 >> task_2
Any information you provide would be helpful. Thanks in advance.
It is a bit hard to understand your question without context.
"I copied the subdag operator and modified a few things in the execute method."
From where was this copied?
"The trigger is working great ..."
How does this look like?
There are a few things I saw in the code:
It might be helpful to add assigned fields to the function call of sub_dag_func, e.g. sub_dag_func(subdag='parent_dag'...).
In the binary shift definition, used to set upstream / downstream there are tasks defined I cannot find in the DAG (df_job_1, df_job_2). This might be connected to SubDAGs (haven't looked into them yet).
The name of the sub dag seems inconsistent with the comment in the code saying By convention, a sub dag's dag_id should be prefixed by its parent and a dot but it is sub_dag_1, sub_dag_2

Airflow dynamic DAG and Task Ids

I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12

Resources