I’m looking for a way to execute my DAG every M minutes for N times, where M, and N are integers I want to control.
Is there an elegant way to handle this in Airflow? Let’s assume that I have the following tasks and dependency.
t1 = BashOperator(
task_id='t1’,
bash_command='echo 1',
)
t2 = BashOperator(
task_id='t2’
bash_command='echo 1',
)
t3 = BashOperator(
task_id='t3’
bash_command='echo 1',
)
t1 >> [t2, t3]
Related
I want to return 2 or more tasks from a function that should be run in sequence in the spot they're inserted in the dependencies, see below.
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return magic(t2, t3) # magic needed here (preferably)
t1 >> generate_tasks() # otherwise here
# desired result: t1 >> t2 >> t3
Is this doable? As I understand it Airflow 2.0 seems to achieve this with a TaskGroup, but we're on Google's Composer, and 2.0 won't be available for a while.
Best workaround I've found:
t1 = PythonOperator()
def generate_tasks():
t2 = PythonOperator()
t3 = PythonOperator()
return [t2, t3]
tasks = generate_tasks()
t1 >> tasks[0] >> tasks[1]
But I'd really like that to be abstracted away, as it more or less defeats the purpose of having multiple operators returned from a single function. We want it to be a single unit as far as the end user knows, even though it can be composed of 2 or more tasks.
How to do it with TaskGroup in Airflow 2.0:
class Encryptor:
def encrypt_and_archive(self):
with TaskGroup("archive_and_encrypt") as section_1:
encrypt = DummyOperator(task_id="encrypt")
archive = BashOperator(task_id="archive", bash_command='echo 1')
encrypt >> archive
return section_1
with DAG(dag_id="example_return_task_group", start_date=days_ago(2), tags=["example"]) as dag:
start = DummyOperator(task_id="start")
encrypt_and_archive = Encryptor().encrypt_and_archive()
end = DummyOperator(task_id='end')
# 👇 single variable, containing two tasks
start >> encrypt_and_archive >> end
Which creates the following graph:
Is something similar remotely doable before 2.0?
You didn't explain what magic(t2, t3) is.
TaskGroup is strictly UI feature it doesn't effect on the DAG logic. According to your description it seems that you are looking for a specific logic (otherwise what is magic?).
I believe this is what you are after:
default_args = {
'owner': 'airflow',
'start_date': datetime(2021, 1, 24),
}
def generate_tasks():
operator_list =[]
for i in range(5): # Replace to generate the logic you wish to dynamically create tasks
op = DummyOperator(task_id=f"t{str(i)}_task", dag=dag)
if i>0:
operator_list[i - 1] >> op
operator_list.append(op)
return operator_list
with DAG(
dag_id='loop',
default_args=default_args,
schedule_interval=None,
) as dag:
start_op = DummyOperator(task_id='start_task')
end_op = DummyOperator(task_id='end_task')
tasks = generate_tasks()
start_op >> tasks[0]
tasks[-1] >> end_op
You can replace the DummyOperator with any operator you'd like.
Can we see or get the output of a sql executed in JdbcOperator?
with DAG(dag_id='Exasol_DB_Checks',schedule_interval= '#hourly',default_args=default_args,catchup=False,template_searchpath=tmpl_search_path) as dag:
start_task=DummyOperator(task_id='start_task',dag=dag)
sql_task_1 = JdbcOperator(task_id='sql_cmd',
jdbc_conn_id='Exasol_db',
sql = ['select current_timestamp;','select current_user from DUAL;',"test.sql"],
autocommit=True,
params={
"my_param": "{{ var.value.source_path }}"}
)
start_task >> sql_task_1
Maybe you can use a JdbcHook inside a PythonOperator for your needs:
task = PythonOperator(
task_id='task1',
python_callable=do_work,
dag=dag
)
def do_work():
jdbc_hook = JdbcHook(jdbc_conn_id="some_db"),
jdbc_conn = jdbc_hook.get_conn()
jdbc_cursor = jdbc_conn.cursor()
jdbc_cursor.execute('SELECT ......')
row = jdbc_cursor.fetchone()[0]
task1 > task2
https://airflow.apache.org/docs/stable/concepts.html#hooks
Normally we define the Operators within the same python file where our DAG is defined (see this basic example). So was I doing the same. But my tasks are itself BIG, using custom operators, so I wanted to have a polymorphism structured dag project, where all such tasks using same operator are in a separate file. For simplicity, let me give a very basic example. I have an operator x having several tasks. This is my project structure;
main_directory
├──tasks
| ├──operator_x
| | └──op_x.py
| ├──operator_y
| : └──op_y.py
|
└──dag.py
op_x.py has following method;
def prepare_task():
from main_directory.dag import dag
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
and the dag.py contains following code;
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task()
Now when I execute this in my airflow environment and run airflow list_dags I get the desired dag named test_dag listed, but when I do airflow list_tasks -t test_dag I only get one task with id print_date and NOT the one defined inside the subdirectory with ID print_inner_date. can anyone help me understand what am I missing ?
Your code would create cyclic imports. Instead, try the following:
op_x.py should have:
def prepare_task(dag):
t2 = BashOperator(
task_id='print_inner_date',
bash_command='date',
dag=dag)
return t2
dag.py:
from main_directory.tasks.operator_x import prepare_task
default_args = {
'retries': 5,
'retry_delay': dt.timedelta(minutes=5),
'on_failure_callback': gen_email(EMAIL_DISTRO, retry=False),
'on_retry_callback': gen_email(EMAIL_DISTRO, retry=True),
'start_date': dt.datetime(2019, 5, 10)
}
dag = DAG('test_dag', default_args=default_args, schedule_interval=dt.timedelta(days=1))
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = prepare_task(dag=dag)
Also make sure that main_directory is in your PYTHONPATH.
I have parallel execution of 2 tasks below in my DAG
In the real world these could be 15 or 20 tasks with the input parameters coming from an array, like below.
fruits = ["apples", "bananas"]
bad_dag = DAG('bad_dag_3', default_args=default_args, schedule_interval=None)
t0=BashOperator(
task_id="print",
bash_command='echo "Beginning parallel tasks next..." ',
dag=bad_dag)
t1=BashOperator(
task_id="fruit_"+fruits[0],
params={"fruits": fruits},
bash_command='echo fruit= {{params.fruits[0]}} ',
dag=bad_dag)
t2=BashOperator(
task_id="fruit_"+fruits[1],
params={"fruits": fruits},
bash_command='echo fruit= {{params.fruits[1]}} ',
dag=bad_dag)
t0>>[t1, t2]
Whats the best way for me to write this DAG, so I dont have to re-write the same BashOperator over and over again like I have above.
I cannot use a loop because I cannot parallelize the tasks if I use a loop.
Use the below DAG. The idea is that the task_id for each task should be unique, airflow will handle the rest.
fruits = ["apples", "bananas"]
bad_dag = DAG('bad_dag_3', default_args=default_args, schedule_interval=None)
t0=BashOperator(
task_id="print",
bash_command='echo "Beginning parallel tasks next..." ',
dag=bad_dag)
for fruit in fruits:
task_t = BashOperator(
task_id="fruit_"+fruit,
params={"fruit": fruit},
bash_command='echo fruit= {{params.fruit}} ',
dag=bad_dag)
t0 >> task_t
edit:
This will work, I defined ex_func_airflow(var_1 = i) which was causing the issue
I would like to create tasks in airflow by looping on a list.
tabs = [1,2,3,4,5]
for i in tabs:
task = PythonOperator(
task_id = name,
provide_context=False,
op_args = [i],
python_callable=ex_func_airflow,
dag=dag)
task_0 >> task >> task_1
When this is run in airflow the argument that is passed is always the last element in that list.
So i'm essentially running:
ex_func_airflow(6)
five times instead of running
ex_func_airflow(1)
ex_func_airflow(2)
ex_func_airflow(3)
..etc.
How would can I pass the correct arguments for each task?
The following codes work for me.
def print_context(ds, **kwargs):
print("hello")
def ex_func_airflow(i):
print(i)
dag = DAG(
dag_id="loop_dag",
schedule_interval=None,
start_date=datetime(2018, 12, 31),
)
task_0 = PythonOperator(
task_id='task_0',
provide_context=True,
python_callable=print_context,
dag=dag)
task_1 = PythonOperator(
task_id='task_1',
provide_context=True,
python_callable=print_context,
dag=dag)
tabs = [1, 2, 3, 4, 5]
for i in tabs:
task_id = f'task_tab_{i}'
task = PythonOperator(
task_id=task_id,
provide_context=False,
op_args=[i],
python_callable=ex_func_airflow,
dag=dag)
task_0 >> task >> task_1