I have a DAG on GCP Airflow with tasks like the below:
with DAG(dag_name, schedule_interval='0 6 * * *', default_args=default_dag_args) as dag:
notify_start = po.PythonOperator(
task_id = 'notify-on-start',
python_callable = slack,
op_kwargs={'msg': slack_start}
)
create_dataproc_cluster = d.create_cluster(default_dag_args['cluster_name'], service_account, num_workers)
[assorted dataproc tasks]
notify_on_fail = po.PythonOperator(
task_id = 'notify-on-task-failure',
python_callable = slack,
op_kwargs={'msg': slack_error, 'err': True},
trigger_rule = trigger_rule.TriggerRule.ONE_FAILED
)
delete_cluster = d.delete_cluster(default_dag_args['cluster_name'])
notify_finish = po.PythonOperator(
task_id = 'notify-on-completion',
python_callable = slack,
op_kwargs={'msg': slack_finish},
trigger_rule = trigger_rule.TriggerRule.ALL_DONE
)
notify_start >> create_dataproc_cluster >> [assorted dataproc tasks >> delete_cluster >> notify_on_fail >> notify_finish
The problem I am facing is if one of the dataproc tasks fails, the notify_on_fail task does not trigger, despite having the ONE_FAILED trigger rule. Rather, it spins down the cluster and sends the all clear message (notify_finish) Are my tasks in the wrong order, or is something else wrong?
Based on your expectation, I think the DAG should be like this
notify_start >> create_dataproc_cluster >> [assorted dataproc tasks]
[assorted dataproc tasks] >> notify_on_fail >> delete_cluster
[assorted dataproc tasks] >> notify_finish >> delete_cluster
Related
I have a few tasks that can be run at the same time. When they're finished I need to run a final task. I've tried to do this using task grouping like so:
import airflow
from airflow.utils.task_group import TaskGroup
with airflow.DAG(
'my_dag',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1),
) as dag:
with TaskGroup(group_id='task_group_1') as tg1:
task1 = MyOperator(
task_id='task1',
dag=dag,
)
task2 = MyOperator(
task_id='task2',
dag=dag,
)
[task1, task2]
final_task = MyOtherOperator(
task_id="final_task",
dag=dag
)
tg1 >> final_task
However what happens here is final_task is run multiple times after each task in the task group so:
task1 -> final_task
task2 -> final_task
What I want is for the task group to run in parallel and when it's finished for the the final task to run just once so:
[task1, task2] -> final_task
I thought using task groups would help me accomplish this requirement but it isn't working as expected. Can anyone help? Thank you.
EDIT: Here is the result from the Airflow docs example. It results in task3 being run after both group.task1 and group1.task2. I need it to run just once after both of the grouped tasks are finished.
LAST EDIT:
It turns out I misunderstood tree view - graph view confirms the grouping operation though I am still getting some other errors for the final task. Thanks for helping me learn more about DAGs.
Try removing [task1, task2] from the TaskGroup so that it looks like the following:
import airflow
from airflow.utils.task_group import TaskGroup
with airflow.DAG(
'my_dag',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1),
) as dag:
with TaskGroup(group_id='task_group_1') as tg1:
task1 = MyOperator(
task_id='task1',
dag=dag,
)
task2 = MyOperator(
task_id='task2',
dag=dag,
)
final_task = MyOtherOperator(
task_id="final_task",
dag=dag
)
tg1 >> final_task
I don't think you need to return anything from the TaskGroup as you're doing. Just reference the TaskGroup as a dependency.
Here is an example from the apache airflow documentation:
with TaskGroup("group1") as group1:
task1 = EmptyOperator(task_id="task1")
task2 = EmptyOperator(task_id="task2")
task3 = EmptyOperator(task_id="task3")
group1 >> task3
Also, you don't need to use TaskGroups to achieve this functionality. You could simply do this:
import airflow
from airflow.utils.task_group import TaskGroup
with airflow.DAG(
'my_dag',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1),
) as dag:
task1 = MyOperator(
task_id='task1',
dag=dag,
)
task2 = MyOperator(
task_id='task2',
dag=dag,
)
final_task = MyOtherOperator(
task_id="final_task",
dag=dag
)
task1 >> final_task
task2 >> final_task
I have two DAGs:
DAG_A , DAG_B.
DAG_A triggers DAG_B thru TriggerDagRunOperator.
My tasks in DAG_B:
with DAG(
dag_id='DAG_B',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
start = DummyOperator(
task_id='start')
delete_xcom_task = PostgresOperator(
task_id='clean_up_xcom',
postgres_conn_id='postgres_default',
sql="delete from xcom where dag_id='DAG_A' and task_id='TASK_A' ")
end = DummyOperator(
task_id='end')
#trigger_rule='none_failed')
#num_table is set by DAG_A. Will have an empty list initially.
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
for index,table in enumerate(iterable_list):
table = table.strip()
read_src1 = PythonOperator(
task_id=f'Read_Source_data_{table}',
python_callable=read_src,
op_kwargs={'index': index}
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'ADLS_Loading_{table}',
python_callable=upload_file_to_directory_bulk,
op_kwargs={'index': index}
)
write_Snowflake1 = PythonOperator(
task_id=f'Snowflake_Staging_{table}',
python_callable=write_Snowflake,
op_kwargs={'index': index}
)
task_sf_storedproc1 = DummyOperator(
task_id=f'Snowflake_Processing_{table}'
)
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >>task_sf_storedproc1 >> delete_xcom_task >> end
After executing airflow db init and making the webserver and scheduler up, DAG_B fails with failure in task delete_xcom_task.
2021-06-22 08:04:43,647] {taskinstance.py:871} INFO - Dependencies not met for <TaskInstance: Target_DIF.clean_up_xcom 2021-06-22T08:04:27.861718+00:00 [queued]>, dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' requires all upstream tasks to have succeeded, but found 2 non-success(es). upstream_tasks_state={'total': 2, 'successes': 0, 'skipped': 0, 'failed': 0, 'upstream_failed': 0, 'done': 0}, upstream_task_ids={'Snowflake_Processing_products', 'Snowflake_Processing_inventories'}
[2021-06-22 08:04:43,651] {local_task_job.py:93} INFO - Task is not able to be run
But both DAGs become successful from the second runs.
Can anyone explain me what is happening internally?
How can I avoid the failure during the first run?
Thanks.
I suspect that the problem is in schedule_interval='#once' for DAG_B: When you add the DAG for the first time, the schedule_interval tells the scheduler to run the DAG once. So, DAG_B is triggered once by the scheduler and not by DAG_A. Any preparations that needs to be done by DAG_A for DAG_B to run successfully have not been done yet, therefore DAG_B fails.
Later on, DAG_A runs as scheduled and triggers DAG_B as expected. Both succeed.
To avoid DAG_B being triggered by the scheduler set schedule_interval=None.
Psuedo python code.
def update_tags(product):
update-tag-based-on-stuff
def products(p,h,**kwargs)
// Get Products
for product in products:
update_tags(product)
with models.DAG(
"fetchProducts",
default_args=default_args,
schedule_interval=None
) as dag:
get_products = PythonOperator(
task_id = 'get_products',
python_callable = products,
op_kwargs = {'p':params, 'h':headers},
provide_context=True
);
get_products >> ??
What I am trying to achieve.
When the products loop is triggered, I want it to spawn a task (update_tags) in AirFlow and then proceed to the next loop, i.e. Don't wait for the task to finish.
My airflow.cfg is using localExecutor.
Currently i have a DAG consisting of 4 operators as shown below:
with DAG('dag', default_args=args, schedule_interval=schedule_interval, catchup=True) as dag:
main_dag = PythonOperator(
task_id='1',
python_callable=func,
provide_context=True,
dag=dag)
run_after_main_dag_1 = PythonOperator(
task_id='1',
python_callable=foo,
provide_context=True,
dag=dag)
run_after_main_dag_2 = BranchPythonOperator(
task_id='2',
python_callable=foo,
provide_context=True)
run_after_main_dag_2_2 = PythonOperator(
task_id='3',
python_callable=foo,
provide_context=False,
dag=dag)
#this runs sequential, but shouldn't.
main_dag >> run_after_main_dag_1 >> run_after_main_dag_2 >> run_after_main_dag_2_2
Here's what i'd like to achieve:
Run main_dag operator
Once main_dag is finished, start run_after_main_dag_1 and run_after_main_dag_2 in parallel, as they are not independent of each other.
I simply can't find how to achieve this in the docs anywhere. There must be a simple syntax i have completely overlooked.
Anyone who knows how to make it happen?
So there was a simple answer:
main_dag >> run_after_main_dag_1
main_dag >> run_after_main_dag_2 >> run_after_main_dag_2_2
In Airflow >> and << are used to set up the downstream and upstream dependency.
You code
main_dag >> run_after_main_dag_1 >> run_after_main_dag_2 >> run_after_main_dag_2_2 #sequentially
It is actually defining the relationship that runs sequentially as run_after_main_dag_1's upstream is set to main_dag and so on.
In order to separate run_after_main_dag_1 and run_after_main_dag_2 you can define relationship such that both have upstream task as main_dag
main_dag >> run_after_main_dag_1 # It is just dependent on main_dag
main_dag >> run_after_main_dag_2 # It is just dependent on main_dag
It will then kick off the two tasks in parallel once the main_dag task finish its execution
I read this How to use airflow xcoms with MySqlOperator and while it has a similiar title it doesn't really address my issue.
I have the following code:
def branch_func_is_new_records(**kwargs):
ti = kwargs['ti']
xcom = ti.xcom_pull(task_ids='query_get_max_order_id')
string_to_print = 'Value in xcom is: {}'.format(xcom)
logging.info(string_to_print)
if int(xcom) > int(LAST_IMPORTED_ORDER_ID)
return 'import_orders'
else:
return 'skip_operation'
query_get_max_order_id = 'SELECT COALESCE(max(orders_id),0) FROM warehouse.orders where orders_id>1 limit 10'
get_max_order_id = MySqlOperator(
task_id='query_get_max_order_id',
sql= query_get_max_order_id,
mysql_conn_id=MyCon,
xcom_push=True,
dag=dag)
branch_op_is_new_records = BranchPythonOperator(
task_id='branch_operation_is_new_records',
provide_context=True,
python_callable=branch_func_is_new_records,
dag=dag)
get_max_order_id >> branch_op_is_new_records >> import_orders
branch_op_is_new_records >> skip_operation
The MySqlOperator returns a number according to the number the BranchPythonOperator choose the next task. It's guaranteed that the MySqlOperator has returned value greater than 0.
My problem is that nothing is pushed to XCOM by the MySqlOperator
On the UI when I go to XCOM I see nothing. The BranchPythonOperator oviously reads nothing so my code fails.
Why the XCOM doesn't work here?
The MySQL operator currently (airflow 1.10.0 at time of writing) doesn't support returning anything in XCom, so the fix for you for now is to write a small operator yourself. You can do this directly in your DAG file (untested, so there may be silly errors):
from airflow.operators.mysql_operator import MySqlOperator as BaseMySqlOperator
from airflow.hooks.mysql_hook import MySqlHook
class ReturningMySqlOperator(BaseMySqlOperator):
def execute(self, context):
self.log.info('Executing: %s', self.sql)
hook = MySqlHook(mysql_conn_id=self.mysql_conn_id,
schema=self.database)
return hook.get_first(
self.sql,
parameters=self.parameters)
def branch_func_is_new_records(**kwargs):
ti = kwargs['ti']
xcom = ti.xcom_pull(task_ids='query_get_max_order_id')
string_to_print = 'Value in xcom is: {}'.format(xcom)
logging.info(string_to_print)
if str(xcom) == 'NewRecords':
return 'import_orders'
else:
return 'skip_operation'
query_get_max_order_id = 'SELECT COALESCE(max(orders_id),0) FROM warehouse.orders where orders_id>1 limit 10'
get_max_order_id = ReturningMySqlOperator(
task_id='query_get_max_order_id',
sql= query_get_max_order_id,
mysql_conn_id=MyCon,
# xcom_push=True,
dag=dag)
branch_op_is_new_records = BranchPythonOperator(
task_id='branch_operation_is_new_records',
provide_context=True,
python_callable=branch_func_is_new_records,
dag=dag)
get_max_order_id >> branch_op_is_new_records >> import_orders
branch_op_is_new_records >> skip_operation