Dynamically generated tasks in task group Airflow and condition for EmailOperator - airflow

I want to build the next dag in airflow
If there are new tickets in the search_jira_tickets task, then return me a list of tickets that I should process according to the scheme above. There are few problems:
I get airflow exception TypeError: 'XComArg' object is not iterable when I iterate over the list, returned by the function serch_new_jira_tickets(). I need iteration because one ticket can be good and another not. Here is my dag:
#task
def serch_new_jira_tickets():
jql = 'MY_JQL_QUERY'
issues_list = jira.search_issues(jql)
if issues_list:
return issues_list
else:
raise AirflowSkipException('No new issues found')
#task
def check_ticket(issue):
...
#task
def process_ticket(issue):
...
with DAG(
dag_id='update_tickets',
default_args=default_args,
schedule_interval='#hourly'
) as dag:
new_tickets = serch_new_jira_tickets()
for ticket in new_tickets:
with TaskGroup(group_id='process_funds_jira_tickets') as group:
email_manager = EmailOperator(
task_id='send_email',
to='me#example.com',
subject='Value in jira ticket was updated',
html_content='Value in ticket has been updated',
dag=dag)
check_ticket = check_ticket(ticket)
process_ticket = process_ticket(ticket)
check_ticket >> process_ticket >> email_manager
new_tickets >> group
I don't know how to create a condition for EmailOperator, under which it would be executed only if the jira ticket one of the fields == 100, otherwise nothing should happen. I.e. if one of the value in process_ticket task == 100 than process email_manager task, otherwise not.

For your first problem I think what you want is the new Dynamic Task Mapping feature in Airflow 2.3. Prior to this version, any kind of for loop on a variable number of Tasks can only be done with some hacks.
Assuming you are able to use Airflow 2.3 you need to modify your task serch_new_jira_tickets (sic) to return a list of tickets. If there are no tickets, it should return an empty list.
You can then remove your TaskGroup and do this:
new_tickets = serch_new_jira_tickets()
checked = check_ticket.expand(ticket=new_tickets)
processed = process_ticket.expand(ticket=checked)
emailed = email_manager.expand(subject=processed)
I think the EmailOperator would need to be tweaked as well, but I'm not sure how you are passing in the template parameters. Perhaps your process_ticket task returns subject strings?
email_manager = EmailOperator.partial(
task_id='send_email',
to='me#example.com',
subject='Value in jira ticket was updated',
html_content='Value in ticket has been updated',
dag=dag)
For your second problem I suspect you want to use the ShortCircuitOperator. You would then add two more tasks... one ShortCircuitOperator that calls the EmailOperator, or a dummy task.

Related

Pull list xcoms in TaskGroups not working

My airflow code has the below Python Operator callable where I am creating a list and pushing it to xcoms:
keys = []
values = []
def attribute_count_check(e_run_id,**context):
job_run_id = int(e_run_id)
da = "select count (distinct row_num) from dds_metadata.dds_temp_att_table where run_id ={}".format(job_run_id)
cursor.execute(da)
res = cursor.fetchall()
view_res = [x for res in res for x in res]
count_of_sql = view_res[0]
print(count_of_sql)
if count_of_sql < 1:
print("deleting of cluster")
return 'delete_cluster'
else :
print("triggering attr_check")
num_attributes_per_task = num_attr #job_config
diff = math.ceil (count_of_sql / num_attributes_per_task)
instance = int(diff)
n = num_attributes_per_task
global values
global keys
for r in range(1, instance+1):
#a = r
keys.append(r)
lower_ranges =(n*(r-1)) +1
upper_range = (n*(r - 1)) + n
b =(lower_ranges,upper_range)
values.append(b)
task_instance = context['task_instance']
task_instance.xcom_push(key="di_keys", value=keys)
task_instance.xcom_push(key="di_values", value=values)
The xcoms from the job is as in the below screenshot :
Now I am trying to fetch the values from xcoms to create cluster dynamically with the code below:
with TaskGroup('dataproc_create_cluster',prefix_group_id=False) as dataproc_create_clusters:
for i in zip('{{ ti.xcom_pull(key="di_keys")}}','{{ ti.xcom_pull(key="di_values")}}'):
dynmaic_create_cluster = DataprocCreateClusterOperator(
task_id="create_cluster_{}".format(list(eval(str(i)))[0]),
project_id='{0}'.format(PROJECT),
cluster_config=CLUSTER_GENERATOR_CONFIG,
region='{0}'.format(REGION),
cluster_name="dataproc-cluster-{}-sit".format(str(i[0])),
)
But I am getting the below error:
Broken DAG: [/opt/airflow/dags/Cluster_config.py] Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 547, in __init__
validate_key(task_id)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py", line 56, in validate_key
"dots and underscores exclusively".format(k=k)
airflow.exceptions.AirflowException: The key (create_cluster_{) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
So I changed the task_id as below:
task_id="create_cluster_"+re.sub(r'\W+', '', str(list(eval(str(i)))[0])),
After which I got the below error:
airflow.exceptions.DuplicateTaskIdFound: Task id 'create_cluster_' has already been added to the DAG
This made me think that the value in Xcoms is being parsed one literal at a time, so I used render_template_as_native_obj=True, .
But I am still getting the duplicate task id error
Regarding the jinja2 templating outside of templated fields
First, you can only use jinja2 templating in templated fields. Simply said, there are two processes. One is parsing the DAG (which happens first), the other is executing the tasks. At the moment your DAG is parsed, no tasks have run yet and there is no TaskInstance available, and thus also no XCOM pull available. However, with templated fields, you can use jinja2 templating for which the value of the fields are computed at the moment your task executes. At that point, the TaskInstance and the XCOM pull is available.
For example, in a PythonOperator you can use the following templated fields;
template_fields: Sequence[str] = ('templates_dict', 'op_args', 'op_kwargs')
Changing the number of tasks based on a result of a task.
Second, you can not change the number of tasks it contains based on the output of a task. Airflow simply does not support this. There is one exception; which is using mapped tasks. There is a nice example in the docs that I copied here;
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())

Airflow Broken DAG error during dynamic task creation with variables

I am trying to create dynamic tasks depending on airflow variable :
My code is :
default_args = {
'start_date': datetime(year=2021, month=6, day=20),
'provide_context': True
}
with DAG(
dag_id='Target_DIF',
default_args=default_args,
schedule_interval='#once',
description='ETL pipeline for processing users'
) as dag:
iterable_list = Variable.get("num_table")
for index, table in enumerate(iterable_list):
read_src1 = PythonOperator(
task_id=f'read_src_{table}'
python_callable=read_src,
)
upload_file_to_directory_bulk1 = PythonOperator(
task_id=f'upload_file_to_directory_bulk_{table}',
python_callable=upload_file_to_directory_bulk
)
write_Snowflake1 = PythonOperator(
task_id=f'write_Snowflake_{table}',
python_callable=write_Snowflake
)
# TaskGroup level dependencies
# DAG level dependencies
start >> read_src1 >> upload_file_to_directory_bulk1 >> write_Snowflake1 >> end
I am facing the below error :
Broken DAG: [/home/dif/airflow/dags/target_dag.py] Traceback (most recent call last):
airflow.exceptions.AirflowException: The key (read_src_[) has to be made of alphanumeric characters, dashes, dots and underscores exclusively
The code works perfect with changes in the code :
#iterable_list = Variable.get("num_table")
iterable_list = ['inventories', 'products']
Start and End are dummy operators.
Airflow variable has data as shown in the image.
My expected dynamic workflow:
I am able to achieve the above flow with a list but not with Airflow variable.
Any leads to find the cause of the error is appreciated. Thanks.
The Variable.get("num_table") returns string.
thus your loop is actually iterating over the chars of ['inventories, 'ptoducts'] which is why in the first iteration of the loop the task_id=f'read_src_{table}' is read_src_[ and [ is not a valid char for task_id.
You should convert the string into list.
Save your var as: "inventories,ptoducts" and then you can do:
iterable_string = Variable.get("num_table")
iterable_list = iterable_string.split(",")
for index, table in enumerate(iterable_list):
You should note that using Variable.get("num_table") as a top level code is a very bad practice!
The problem is that by default, Airflow reads the variables as str. Try using this:
iterable_list = Variable.get("num_table", deserialize_json=True)
I was able to arrive at the solution with the followings modifications :
import ast
...
...
iterable_string = Variable.get("num_table",default_var="[]")
iterable_list = ast.literal_eval(iterable_string)
...
Airflow variables are stored as strings.
So my data was stored as "[tab1,tab2]".
So I have used literal_eval to convert the string back to list.
I have also added an empty list as default so that if no values are present in the variable num_table, I will not process further.

combine BranchPythonOperator and PythonVirtualenvOperator

I have a PythonVirtualenvOperator which reads some data from a database - if there is no new data, then the DAG should end there, otherwise it should call additional tasks e.g
#dag.py
load_data >>[if_data,if_no_data]>>another_task>>last_task
I understand that it can be done using PythonBranchOperator but I can't see how I can combine the venv and the branch-operator.
Is it doable?
This can be solved using Xcom.
load_date can push the number of records it processed (new data).
Your pipe can be:
def choose(**context):
value = context['ti'].xcom_pull(task_ids='load_data')
if int(value)>0:
return 'if_data'
return 'if_no_data'
branch = BranchPythonOperator(
task_id='branch_task',
provide_context=True, # Remove this line if Airflow>=2.0.0
python_callable=choose)
load_data >> branch >>[if_data,if_no_data]>>another_task>>last_task

Airflow: How to template or pass the output of a Python Callable function as arguments to other tasks?

I'm new to Airflow and working on making my ETL pipeline more re-usable. Originally, I had a few lines of top-level code that would determine the job_start based on a few user input parameters, but I found through much searching that this would trigger at every heartbeat which was causing some unwanted behavior in truncating the table.
Now I am investigating wrapping this top level code into a Python Callable so it is secure from the refresh, but I am unsure of the best way to pass the output to my other tasks. The gist of my code is below:
def get_job_dts():
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
return job_params
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1'
,python_callable=first_task
,op_args=job_params #<-- How do I send job_params to op_args??
,dag=dag
)
t0 >> t1
I've searched around and hear mentions of jinja templates, variables, or xcoms, but I'm fuzzy on how to implement it. Does anyone have an example I could look at where I can save that list into a variable that can be used by my other tasks?
The best way to do this is to push your value into XCom in get_job_dts, and pull the value back from Xcom in first_task.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Push job_params into XCom
kwargs['ti'].xcom_push(key='job_params', value=job_params)
return job_params
def first_task(ti, **kwargs):
# Pull job_params into XCom
job_params = ti.xcom_pull(key='job_params')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
op_args=job_params,
dag=dag
)
t0 >> t1
As RyantheCoder mentioned, XCOM is the way to go. My implementation is geared towards the tutorial where I implicitly perform a push automatically from a return value in a PythonCallable.
I am still confused by the difference in passing in (ti, **kwargs) vs. using (**context) to the function that is pulling. Also, where does "ti" come from?
Any clarifications appreciated.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Automatically pushes to XCOM, refer to: Airflow XCOM tutorial: https://airflow.apache.org/concepts.html?highlight=xcom#xcoms
return job_params
def first_task(**context):
# Change task_ids to whatever task pushed the XCOM vars you need, rest are standard notation
job_params = job_params = context['task_instance'].xcom_pull(task_ids='get_dates')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
dag=dag
)
t0 >> t1
As you mentioned changing task start time and end time dynamically, I supposed what you need is to create dynamic dag rather than just pass the args to dag. Especially, changing start time and interval without changing dag name will cause unexpected result, it will highly suggested not to do so. So you can refer to this link to see if this strategy can help.

Airflow - xcom value acess into custom operator

I am using Airlfow, since last 6 months. I felt so happy to define the workflows in Airflow.
I have the below scenario where I am not able to get the xcom value (highlighted in yellow color).
Please find the code below sample code:
Work Flow
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
dummy_operator = DummyOperator(
task_id='Start',
dag=main_dag
)
push_function_task = PythonOperator(
task_id='push_function',
provide_context=True,
python_callable=push_function,
op_kwargs={},
dag=main_dag)
push_function_task .set_upstream(dummy_operator)
custom_task = CustomOperator(
dag=main_dag,
task_id='import_data',
provide_context=True,
url="http://www.google.com/{}".format("{{task_instance.xcom_pull(task_ids='push_function')}}")
)
custom_task .set_upstream(push_function_task)
Notes:
1. CustomOperator is my own operator wtritten for downloading the data for the given URL
Please help me.
Thanks,
Samanth
I believe you have a mismatch in keys when pushing and pulling the XCom. Each XCom value is tied to a DAG ID, task ID, and key. If you are pushing with report_id key, then you need to pull with it as well.
Note, if a key is not specified to xcom_pull(), it uses the default of return_value. This is because if a task returns a result, Airflow will automatically push it to XCom under the return_value key.
This gives you two options to fix your issue:
1) Continue to push to the report_id key and make sure you pull from it as well
def push_function(**context):
context['ti'].xcom_push(key='reportid', value='xyz')
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function', key='reportid') }}")
)
2) Have push_function() return the value you want to push to XCom, then pull from the default key.
def push_function(**context):
return 'xyz'
...
custom_task = CustomOperator(
...
url="http://www.google.com/{}".format("{{ task_instance.xcom_pull(task_ids='push_function') }}")
)

Resources