When using TriggerDagRunOperator to trigger another DAG, it just gives a generic name like trig_timestamp:
Is it possible to give this run id a meaningful name so I can easily identify different dag runs?
You can't immediately do this with the TriggerDagOperator as the "run_id" is generated inside it's execute method. However, you could implement your own operator, CustomTriggerDagOperator that would behave the way you want/need. For example:
class CustomTriggerDagOperator(TriggerDagOperator):
def execute(self, context):
if self.execution_date is not None:
run_id = 'trig__{}'.format(self.execution_date)
self.execution_date = timezone.parse(self.execution_date)
else:
run_id = 'trig__' + timezone.utcnow().isoformat()
run_id += f'{self.trigger_dag_id}'
dro = DagRunOrder(run_id=run_id)
if self.python_callable is not None:
dro = self.python_callable(context, dro)
if dro:
trigger_dag(dag_id=self.trigger_dag_id,
run_id=dro.run_id,
conf=json.dumps(dro.payload),
execution_date=self.execution_date,
replace_microseconds=False)
else:
self.log.info("Criteria not met, moving on")
This example above just appends the id of the triggered dag. You could use this same strategy to set the run_id arbitrarily.
One of the other options is REST API with apache >1.10
/api/experimental/dags/<DAG_ID>/dag_runs
{
"conf":{"customer_id":"ABCDEF6000", "trans_id":"AN6000"},
"run_id":"5dd89388-642a-4bf2-8776-d7a5284ee0d0"
}
Related
I want to build the next dag in airflow
If there are new tickets in the search_jira_tickets task, then return me a list of tickets that I should process according to the scheme above. There are few problems:
I get airflow exception TypeError: 'XComArg' object is not iterable when I iterate over the list, returned by the function serch_new_jira_tickets(). I need iteration because one ticket can be good and another not. Here is my dag:
#task
def serch_new_jira_tickets():
jql = 'MY_JQL_QUERY'
issues_list = jira.search_issues(jql)
if issues_list:
return issues_list
else:
raise AirflowSkipException('No new issues found')
#task
def check_ticket(issue):
...
#task
def process_ticket(issue):
...
with DAG(
dag_id='update_tickets',
default_args=default_args,
schedule_interval='#hourly'
) as dag:
new_tickets = serch_new_jira_tickets()
for ticket in new_tickets:
with TaskGroup(group_id='process_funds_jira_tickets') as group:
email_manager = EmailOperator(
task_id='send_email',
to='me#example.com',
subject='Value in jira ticket was updated',
html_content='Value in ticket has been updated',
dag=dag)
check_ticket = check_ticket(ticket)
process_ticket = process_ticket(ticket)
check_ticket >> process_ticket >> email_manager
new_tickets >> group
I don't know how to create a condition for EmailOperator, under which it would be executed only if the jira ticket one of the fields == 100, otherwise nothing should happen. I.e. if one of the value in process_ticket task == 100 than process email_manager task, otherwise not.
For your first problem I think what you want is the new Dynamic Task Mapping feature in Airflow 2.3. Prior to this version, any kind of for loop on a variable number of Tasks can only be done with some hacks.
Assuming you are able to use Airflow 2.3 you need to modify your task serch_new_jira_tickets (sic) to return a list of tickets. If there are no tickets, it should return an empty list.
You can then remove your TaskGroup and do this:
new_tickets = serch_new_jira_tickets()
checked = check_ticket.expand(ticket=new_tickets)
processed = process_ticket.expand(ticket=checked)
emailed = email_manager.expand(subject=processed)
I think the EmailOperator would need to be tweaked as well, but I'm not sure how you are passing in the template parameters. Perhaps your process_ticket task returns subject strings?
email_manager = EmailOperator.partial(
task_id='send_email',
to='me#example.com',
subject='Value in jira ticket was updated',
html_content='Value in ticket has been updated',
dag=dag)
For your second problem I suspect you want to use the ShortCircuitOperator. You would then add two more tasks... one ShortCircuitOperator that calls the EmailOperator, or a dummy task.
Currently I'm having an Airflow dag which is taking multiple values as arguments and planning to use them as dynamic to run the steps within the dag.
For Eg I have this method to push the values into XCom:
def push_to_xcom(ds, **kwargs):
shape_change_tables = []
ss_cd = {}
env = ''
if 'env' in kwargs['dag_run'].conf:
env = kwargs['dag_run'].conf['env']
else:
env = 'dev'
print("by default environment take as 'dev'")
if isinstance(kwargs['dag_run'].conf['ss_cd'], dict):
ss_cd = dict(kwargs['dag_run'].conf['ss_cd'])
else:
print('<<<<<<<<<<Pass sscd as an argument>>>>>>>>>>>')
sys.exit(-1)
if isinstance(kwargs['dag_run'].conf['shape'], list):
shape_change_tables = list(kwargs['dag_run'].conf['shape'])
else:
print('<<<<<<<<<<Pass shape change tables as an argument>>>>>>>>>>>')
sys.exit(-1)
kwargs['ti'].xcom_push(key='shape_change_tables', value=shape_change_tables)
kwargs['ti'].xcom_push(key='ss_cd', value=ss_cd)
kwargs['ti'].xcom_push(key='env', value=env)
I'd need to use those 3 xcom variables outside the operator within the same dag.
Let's say I'd need to use the variable shape_change_tables which is a dictionary, in the following loop:
for i in json.loads(open('somejson.json', 'r').read())['tables'].keys():
# Id want to use the above Xcom Variable shape_change_tables in the below if condition
if i in {val for dic in shape_change_tables for val in dic.values()}:
How can I pull the value? Can I simply use below before the if condition?
shape_change_tables = ti.xcom_pull(key='shape_change_tables', task_ids='push_to_xcom')
If anyone has come across the same, any help would be appreciated.
I have a PythonVirtualenvOperator which reads some data from a database - if there is no new data, then the DAG should end there, otherwise it should call additional tasks e.g
#dag.py
load_data >>[if_data,if_no_data]>>another_task>>last_task
I understand that it can be done using PythonBranchOperator but I can't see how I can combine the venv and the branch-operator.
Is it doable?
This can be solved using Xcom.
load_date can push the number of records it processed (new data).
Your pipe can be:
def choose(**context):
value = context['ti'].xcom_pull(task_ids='load_data')
if int(value)>0:
return 'if_data'
return 'if_no_data'
branch = BranchPythonOperator(
task_id='branch_task',
provide_context=True, # Remove this line if Airflow>=2.0.0
python_callable=choose)
load_data >> branch >>[if_data,if_no_data]>>another_task>>last_task
I'm new to Airflow and working on making my ETL pipeline more re-usable. Originally, I had a few lines of top-level code that would determine the job_start based on a few user input parameters, but I found through much searching that this would trigger at every heartbeat which was causing some unwanted behavior in truncating the table.
Now I am investigating wrapping this top level code into a Python Callable so it is secure from the refresh, but I am unsure of the best way to pass the output to my other tasks. The gist of my code is below:
def get_job_dts():
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
return job_params
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1'
,python_callable=first_task
,op_args=job_params #<-- How do I send job_params to op_args??
,dag=dag
)
t0 >> t1
I've searched around and hear mentions of jinja templates, variables, or xcoms, but I'm fuzzy on how to implement it. Does anyone have an example I could look at where I can save that list into a variable that can be used by my other tasks?
The best way to do this is to push your value into XCom in get_job_dts, and pull the value back from Xcom in first_task.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Push job_params into XCom
kwargs['ti'].xcom_push(key='job_params', value=job_params)
return job_params
def first_task(ti, **kwargs):
# Pull job_params into XCom
job_params = ti.xcom_pull(key='job_params')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
op_args=job_params,
dag=dag
)
t0 >> t1
As RyantheCoder mentioned, XCOM is the way to go. My implementation is geared towards the tutorial where I implicitly perform a push automatically from a return value in a PythonCallable.
I am still confused by the difference in passing in (ti, **kwargs) vs. using (**context) to the function that is pulling. Also, where does "ti" come from?
Any clarifications appreciated.
def get_job_dts(**kwargs):
#Do something to determine the appropriate job_start_dt and job_end_dt
#Package up as a list as inputs to other PythonCallables using op_args
job_params = [job_start_dt, job_end_dt]
# Automatically pushes to XCOM, refer to: Airflow XCOM tutorial: https://airflow.apache.org/concepts.html?highlight=xcom#xcoms
return job_params
def first_task(**context):
# Change task_ids to whatever task pushed the XCOM vars you need, rest are standard notation
job_params = job_params = context['task_instance'].xcom_pull(task_ids='get_dates')
# And then do the rest
t0 = PythonOperator(
task_id = 'get_dates'
python_callable = get_job_dts
dag=dag
)
t1 = PythonOperator(
task_id = 'task_1',
provide_context=True,
python_callable=first_task,
dag=dag
)
t0 >> t1
As you mentioned changing task start time and end time dynamically, I supposed what you need is to create dynamic dag rather than just pass the args to dag. Especially, changing start time and interval without changing dag name will cause unexpected result, it will highly suggested not to do so. So you can refer to this link to see if this strategy can help.
This is my operator:
bigquery_check_op = BigQueryOperator(
task_id='bigquery_check',
bql=SQL_QUERY,
use_legacy_sql = False,
bigquery_conn_id=CONNECTION_ID,
trigger_rule='all_success',
xcom_push=True,
dag=dag
)
When I check the Render page in the UI. Nothing appears there.
When I run the SQL in the console it return value 1400 which is correct.
Why the operator doesn't push the XCOM?
I can't use BigQueryValueCheckOperator. This operator is designed to FAIL against a check of value. I don't want nothing to fail. I simply want to branch the code based on the return value from the query.
Here is how you might be able to accomplish this with the BigQueryHook and the BranchPythonOperator:
from airflow.operators.python_operator import BranchPythonOperator
from airflow.contrib.hooks import BigQueryHook
def big_query_check(**context):
sql = context['templates_dict']['sql']
bq = BigQueryHook(bigquery_conn_id='default_gcp_connection_id',
use_legacy_sql=False)
conn = bq.get_conn()
cursor = conn.cursor()
results = cursor.execute(sql)
# Do something with results, return task_id to branch to
if results == 0:
return "task_a"
else:
return "task_b"
sql = "SELECT COUNT(*) FROM sales"
branching = BranchPythonOperator(
task_id='branching',
python_callable=big_query_check,
provide_context= True,
templates_dict = {"sql": sql}
dag=dag,
)
First we create a python callable that we can use to execute the query and select which task_id to branch too. Second, we create the BranchPythonOperator.
The simplest answer is because xcom_push is not one of the params in BigQueryOperator nor BaseOperator nor LoggingMixin.
The BigQueryGetDataOperator does return (and thus push) some data but it works by table and column name. You could chain this behavior by making the query you run output to a uniquely named table (maybe use {{ds_nodash}} in the name), and then use the table as a source for this operator, and then you can branch with the BranchPythonOperator.
You might instead try to use the BigQueryHook's get_conn().cursor() to run the query and work with some data inside the BranchPythonOperator.
Elsewhere we chatted and came up with something along the lines of this for putting in the callable of a BranchPythonOperator:
cursor = BigQueryHook(bigquery_conn_id='connection_name').get_conn().cursor()
# one of these two:
cursor.execute(SQL_QUERY) # if non-legacy
cursor.job_id = cursor.run_query(bql=SQL_QUERY, use_legacy_sql=False) # if legacy
result=cursor.fetchone()
return "task_one" if result[0] is 1400 else "task_two" # depends on results format