When I trigger my DAG I pass an extra bunch of parameters in dag_run.conf['metadata']
so my trigger event looks like this:
{'bucket': 'blah-blah',
'contentType': 'text/json',
'crc32c': '375Jog==', 'customTime': '1970-01-01T00:00:00.000Z',
'etag': 'CJCqi+DTrO0CEAk=', 'eventBasedHold': False, 'generation': '1606821286696208',
'id': 'xxx',
'kind': 'storage#object',
'md5Hash': 'xxxx',
'mediaLink': 'xxxx',
'metadata': {'url': 'xxx',
'date_extracted': '20201115',
'file_type': 'xxxx',
'filename': 'xxxx.json',
'row_count': '30', 'time_extracted': '063013'},
}
I have a python function that runs on on_failure_callback but the context here is totally different from the dag_run context.
The context passed to function on failure is:
{'conf': <airflow.configuration.AutoReloadableProxy object at 0x7fa275de3c18>,
'dag': <DAG: my_dag>, 'ds': '2020-12-09',
'next_ds': '2020-12-09',
'next_ds_nodash': '20201209',
....}
Is there a way to pass dag_run.conf['metadata'] as part of the new context?
I have tried using partial but "{{ dag_run.conf['metadata'] }}" is interpreted as a string.
My dataflowoperator looks like this:
DataflowTemplateOperator(
task_id="df_task1",
job_name="df-{{ dag_run.conf['trace_id'] }}-{{dag_run.conf['file_type']}}",
template="gs://dataflow/my-df-job",
on_failure_callback= partial(task_fail_slack_alert,"{{ dag_run.conf['metadata'] }}"),
parameters={
"filePath":"{{ dag_run.conf['file_name'] }}",
"traceId":"{{ dag_run.conf['trace_id'] }}"
},
)
my callable function just prints out for now:
def task_fail_slack_alert(dag_run,context):
print("payload {}".format(context))
print("dag_run {}".format(dag_run))
Log:
INFO - payload {'conf': <airflow.configuration.AutoReloadableProxy object at 0x7fa275de3c18>...}
INFO - dag_run {{ dag_run.conf['metadata'] }}
You can't use {{ dag_run.conf['metadata'] }} like that.
You can access it from the context of the function:
def task_fail_slack_alert(context):
dag_run = context.get('dag_run')
task_instances = dag_run.get_task_instances()
print(task_instances)
print(dag_run.conf.get('metadata'))
print(context.get('exception'))
DataflowTemplateOperator(
task_id="df_task1",
job_name="df-{{ dag_run.conf['trace_id'] }}-{{dag_run.conf['file_type']}}",
template="gs://dataflow/my-df-job",
on_failure_callback=task_fail_slack_alert,
parameters={
"filePath":"{{ dag_run.conf['file_name'] }}",
"traceId":"{{ dag_run.conf['trace_id'] }}"
},
)
Related
I have a list of users (this is not actual Ansible code, just an example):
users:
-tom
-jerry
-pluto
Now I am trying to create a dynamic structure ( list of dictionaries ?) like this, for setting random (at runtime) passwords to them:
users_and_passwords:
-tom, "random_password_here1"
-jerry, "random_password_here2"
-pluto, "random_password_here3"
How can I write a set_fact task to generate random password for each user and save it for later use?
How can I then read that and use it in other tasks later?
So far I tried declaring a list of users (later to be looped from a file), then an empty list of passwords. Now I am trying to populate the Password list with random strings, then to combine them into a single {user:password} dict.I can't figure out how to do this easier, and I am stuck in the "populate password list" phase.
users:
-top
-jerry
-pluto
passwords: []
tasks:
- name: generate passwords in an empty Dict
set_fact:
passwords: "{{ passwords | default({}) + {{ lookup('password', '/dev/null chars=ascii_lowercase,ascii_uppercase,digits,punctuation length=20') }} }}"
with_items: "{{users}}"
Rather than having a complex dictionary that would read:
- tom: password123456
- jerry: password123456789
- muttley: password12345
Aim for a structure that would be normalized, like:
- name: tom
password: password123456
- name: jerry
password: password123456789
- name: muttley
password: password12345
This can be achieved with the task:
- set_fact:
users_with_passwords: >-
{{
users_with_passwords | default([]) +
[{
'name': item,
'password': lookup(
'password',
'/dev/null chars=ascii_lowercase,ascii_uppercase,digits,punctuation length=20'
)
}]
}}
loop: "{{ users }}"
And, can be easily accessed via something like:
- debug:
msg: "{{ item.name }}'s password is `{{ item.password }}`"
loop: "{{ users_with_passwords }}"
loop_control:
label: "{{ item.name }}"
Given the playbook
- hosts: localhost
gather_facts: no
vars:
users:
- tom
- jerry
- muttley
tasks:
- set_fact:
users_with_passwords: >-
{{
users_with_passwords | default([]) +
[{
'name': item,
'password': lookup(
'password',
'/dev/null chars=ascii_lowercase,ascii_uppercase,digits,punctuation length=20'
)
}]
}}
loop: "{{ users }}"
- debug:
msg: "{{ item.name }}'s password is `{{ item.password }}`"
loop: "{{ users_with_passwords }}"
loop_control:
label: "{{ item.name }}"
This yields:
TASK [set_fact] *************************************************************
ok: [localhost] => (item=tom)
ok: [localhost] => (item=jerry)
ok: [localhost] => (item=muttley)
TASK [debug] ****************************************************************
ok: [localhost] => (item=tom) =>
msg: tom's password is `tLY>#jg6k/_|sqke{-mm`
ok: [localhost] => (item=jerry) =>
msg: jerry's password is `Liu1wF#gPM$q^z~g|<E1`
ok: [localhost] => (item=muttley) =>
msg: muttley's password is `BGHL_QUTHmbn\(NGW`pJ`
How can I write a set_fact task to generate random password for each user and save it for later use?
This sound like a task for the password_lookup which "Retrieves or generate a random password, stored in a file", in Example
- name: Create random but idempotent password
set_fact:
toms_password: "{{ lookup('password', '/dev/null', seed=username) }}"
How can I then read that and use it in other tasks later?
Just by Using variable name which was defined during set_fact.
Further readings
Ansible documentation Lookup plugins
How to generate single reusable random password with Ansible
Ansible: Generate random passwords automatically for users
Creating user passwords from an Ansible Playbook
How do I create a user and set a password using Ansible
I want to get the status of a SparkSubmitOperator, transform it to some value that my API understands and pass it within the payload of a SimpleHttpOperator so that I can update the job status inside my DB. I want to do this by using Taskflow API.
I wrote the code below but I get this error when I try to load it:
Broken DAG: [/opt/airflow/dags/export/inapp_clicks/export.py] Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 1378, in set_downstream
self._set_relatives(task_or_task_list, upstream=False, edge_modifier=edge_modifier)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/baseoperator.py", line 1316, in _set_relatives
task_object.update_relative(self, not upstream)
AttributeError: 'function' object has no attribute 'update_relative'
Code:
from datetime import datetime
from airflow.decorators import dag, task
from airflow.models import Variable
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
#dag(schedule_interval=None, start_date=datetime.now(), tags=["export", "inapp"])
def export_inapp_clicks():
DEFAULT_NUM_EXECUTORS = 2
DEFAULT_EXECUTOR_CORES = 3
DEFAULT_EXECUTOR_MEMORY = "2g"
DEFAULT_DRIVER_MEMORY = "1g"
#task()
def update_job_status(dag, ti, execution_date):
jst = dag.get_task("export_inapp_clicks_job_submission")
jsti = TaskInstance(jst, execution_date)
xcom_value = ti.xcom_pull(task_ids="export_inapp_clicks_job_submission")
print("Task:", jst)
print("Task Instance:", jsti)
print("Task State:", jsti.current_state())
print("XCOM Value:", xcom_value)
# TODO: call API via SimpleHttpOperator
job_submission = SparkSubmitOperator(
task_id="export_inapp_clicks_job_submission",
conn_id="yarn",
name="{{ dag_run.conf['name'] }}",
conf=Variable.get("export_inapp_clicks_conf", deserialize_json=True),
jars=Variable.get("export_inapp_clicks_jars"),
application=Variable.get("pyspark_executor_path"),
application_args=[
"--module",
"export_inapp_clicks.export",
"--org-id",
"{{ dag_run.conf['orgId'] }}",
"--app-id",
"{{ dag_run.conf['appId'] }}",
"--inapp-id",
"{{ dag_run.conf['inappId'] }}",
"--start-date",
"{{ dag_run.conf['startDate'] }}",
"--end-date",
"{{ dag_run.conf['endDate'] }}",
"--data-path",
Variable.get("event_data_path"),
"--es-nodes",
Variable.get("es_nodes"),
"--destination",
Variable.get("export_inapp_clicks_output"),
"--explain",
"--debug",
"--encode-columns",
"--log-level",
"WARN"
],
py_files=Variable.get("export_inapp_clicks_py_files"),
num_executors=Variable.get("export_inapp_clicks_num_executors", DEFAULT_NUM_EXECUTORS),
executor_cores=Variable.get("export_inapp_clicks_executor_cores", DEFAULT_EXECUTOR_CORES),
executor_memory=Variable.get("export_inapp_clicks_executor_memory", DEFAULT_EXECUTOR_MEMORY),
driver_memory=Variable.get("export_inapp_clicks_driver_memory", DEFAULT_DRIVER_MEMORY),
status_poll_interval=10
)
job_submission >> update_job_status
export_dag = export_inapp_clicks()
Consider the following example, the first task will correspond to your SparkSubmitOperator task:
_get_upstream_task Takes care of getting the state of the first task from the second one, by performing a query to the metadata database:
DAG definition, first two task:
import json
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from airflow.utils.db import provide_session
from airflow.models.taskinstance import TaskInstance
from airflow.providers.http.operators.http import SimpleHttpOperator
#dag(
default_args= {"owner": "airflow"},
schedule_interval=None,
start_date=days_ago(0),
catchup=False,
tags=["custom_example", "TaskFlow"],
)
def taskflow_previous_task():
#provide_session
def _get_upstream_task(upstream_task_id, dag, execution_date, session=None, **_):
upstream_ti = (
session.query(TaskInstance)
.filter(
TaskInstance.dag_id == dag.dag_id,
TaskInstance.execution_date == execution_date,
TaskInstance.task_id == upstream_task_id,
)
.first()
)
return upstream_ti
#task
def job_submission_task(**context):
print(f"Task Id: {context['ti'].task_id}")
return {"job_data": "something"}
#task(trigger_rule='all_done')
def update_job_status(job_data, **context):
print(f"Data from previous Task: {job_data['job_data']}")
upstream_ti = _get_upstream_task("job_submission_task", **context)
print(f"Upstream_ti state: {upstream_ti.state}")
return upstream_ti.state
job_results = job_submission_task()
job_status = update_job_status(job_results)
job_submission_task returns a dict that is passed to update_job_status via Xcoms using XcomArg which is a main feature of Taskflow API. By doing so you get to avoid explicitly perfoming xcom_pull() and xcom_push() operations.
Once you get the TaskInstance object from _get_upstream_task method, you can return it and retrieve it again from the last task wich will perfom the HTTP request:
Final task, end of DAG definition:
task_post_op = SimpleHttpOperator(
task_id="post_op",
endpoint="post",
data=json.dumps({"job_status": f"{job_status}"}),
headers={"Content-Type": "application/json"},
log_response=True,
)
job_status >> task_post_op
example_dag = taskflow_previous_task()
Since the param data of SimpleHttpOperator is templated, you can retrieve the Xcom value from the second task using Jinja:
data=json.dumps({"job_status": f"{job_status}"}),
Execution logs:
Task_1:
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-08-20T23:15:15.226853+00:00
[2021-08-20 23:15:17,148] {logging_mixin.py:104} INFO - Task Id: job_submission_task
[2021-08-20 23:15:17,148] {python.py:151} INFO - Done. Returned value was: {'job_data': 'something'}
[2021-08-20 23:15:17,202] {taskinstance.py:1211} INFO - Marking task as SUCCESS.
Task_2:
AIRFLOW_CTX_DAG_ID=taskflow_previous_task
AIRFLOW_CTX_TASK_ID=update_job_status
AIRFLOW_CTX_EXECUTION_DATE=2021-08-20T23:15:15.226853+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-08-20T23:15:15.226853+00:00
[2021-08-20 23:15:18,768] {logging_mixin.py:104} INFO - Data from previous Task: something
[2021-08-20 23:15:18,792] {logging_mixin.py:104} INFO - Upstream_ti state: success
[2021-08-20 23:15:18,793] {python.py:151} INFO - Done. Returned value was: success
[2021-08-20 23:15:18,874] {taskinstance.py:1211} INFO - Marking task as SUCCESS.
Task_3:
AIRFLOW_CTX_DAG_ID=taskflow_previous_task
AIRFLOW_CTX_TASK_ID=post_op
AIRFLOW_CTX_EXECUTION_DATE=2021-08-20T23:15:15.226853+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2021-08-20T23:15:15.226853+00:00
[2021-08-20 23:15:21,201] {http.py:111} INFO - Calling HTTP method
[2021-08-20 23:15:21,228] {base.py:78} INFO - Using connection to: id: http_default. Host: https://www.httpbin.org, Port: None, Schema: , Login: , Password: None, extra: {}
[2021-08-20 23:15:21,245] {http.py:140} INFO - Sending 'POST' to url: https://www.httpbin.org/post
[2021-08-20 23:15:21,973] {http.py:115} INFO - {
"args": {},
"data": "{\"job_status\": \"success\"}",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "25",
"Content-Type": "application/json",
"Host": "www.httpbin.org",
"User-Agent": "python-requests/2.25.1",
"X-Amzn-Trace-Id": "Root=1-61203789-0136b7557ba4e0116bb5e16d"
},
"json": {
"job_status": "success"
},
"origin": "200.73.153.254",
"url": "https://www.httpbin.org/post"
}
[2021-08-20 23:15:22,027] {taskinstance.py:1211} INFO - Marking task as SUCCESS.
Let me know if that worked for you, I tried to use as many Taskflow features as I could.
Source: Docs1 Docs2
Edit:
Added trigger_rule='all_done' to update_job_status task.
I have a python operator and BigQueryInsertJobOperator in my DAG. The result returned by the python operator should be passed to BigQueryInsertJobOperator in the params field.
Below is the script I am running.
def get_columns():
field = "name"
return field
with models.DAG(
"xcom_test",
default_args=default_args,
schedule_interval="0 0 * * *",
tags=["xcom"],
)as dag:
t1 = PythonOperator(task_id="get_columns", python_callable=get_columns, do_xcom_push=True)
t2 = BigQueryInsertJobOperator(
task_id="bigquery_insert",
project_id=project_id,
configuration={
"query": {
"query": "{% include 'xcom_query.sql' %}",
"useLegacySql": False,
}
},
force_rerun=True,
provide_context=True,
params={
"columns": "{{ ti.xcom_pull(task_ids='get_columns') }}",
"project_id": project_id
},
)
The xcom_query.sql looks below
INSERT INTO `{{ params.project_id }}.test.xcom_test`
{{ params.columns }}
select 'Andrew' from `{{ params.project_id }}.test.xcom_test`
While running this, the columns params are converted to a string and hence resulting in an error. Below is how the query was converted.
INSERT INTO `project.test.xcom_test`
{{ ti.xcom_pull(task_ids='get_columns') }}
select 'Andrew' from `project.test.xcom_test`
Any idea what am I missing ?
I found the reason why my dag is failing.
The "params" field for BigQueryInsertJobOperator is not a templatized field and hence calling "task_instance.xcom_pull" will not work this way.
But instead, you can directly access the 'task_instance' variable from jinja template.
INSERT INTO `{{ params.project_id }}.test.xcom_test`
({{ task_instance.xcom_pull(task_ids='get_columns') }})
select
'Andrew' from `{{ params.project_id }}.test.xcom_test`
https://marclamberti.com/blog/templates-macros-apache-airflow/ - This article explains how to identifies template parameters in airflow
I have simple playbook where fetching some data from Vault server using curl.
tasks:
- name: role_id
shell: 'curl \
--header "X-Vault-Token: s.ddDblh8DpHkOu3IMGbwrM6Je" \
--cacert vault-ssl-cert.chained \
https://active.vault.service.consul:8200/v1/auth/approle/role/cpanel/role-id'
register: 'vault_role_id'
- name: test1
debug:
msg: "{{ vault_role_id.stdout }}"
The output is like this:
TASK [test1] *********************************************************************************************************************************************************************
ok: [localhost] => {
"msg": {
"auth": null,
"data": {
"role_id": "65d02c93-689c-eab1-31ca-9efb1c3e090e"
},
"lease_duration": 0,
"lease_id": "",
"renewable": false,
"request_id": "8bc03205-dcc2-e388-57ff-cdcaef84ef69",
"warnings": null,
"wrap_info": null
}
}
Everything is ok if I am accessing first level attribute, like .stdout in previous example. I need deeper level attribute to reach, like vault_role_id.stdout.data.role_id. When I try this it is failing with following error:
"The task includes an option with an undefined variable. The error was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'data'\n\n
Do you have suggestion what I can do to get properly attribute values from deeper level in this object hierarchy?
"The task includes an option with an undefined variable. The error was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'data'\n\n
Yes, because what's happening is that rendering it into msg: with {{ is coercing the JSON text into a python dict; if you do want it to be a dict, then use either msg: "{{ (vault_role_id.stdout | from_json).data.role_id }}" or you can use set_fact: {vault_role_data: "{{vault_role_id.stdout}}"} and then vault_role_data will be a dict for the same reason it was coerced by your msg
You can see the opposite process by prefixing the msg with any characters:
- name: this one is text
debug:
msg: vault_role_id is {{ vault_role_id.stdout }}
- name: this one is coerced
debug:
msg: '{{ vault_role_id.stdout }}'
while this isn't what you asked, you should also add --fail to your curl so it exists with a non-zero return code if the request returns non-200-OK, or you can use the more ansible-y way via - uri: and set the return_content: yes parameter
Is it possible to parse JSON string inside an airflow template?
I have a HttpSensor which monitors a job via a REST API, but the job id is in the response of the upstream task which has xcom_push marked True.
I would like to do something like the following, however, this code gives the error jinja2.exceptions.UndefinedError: 'json' is undefined
t1 = SimpleHttpOperator(
http_conn_id="s1",
task_id="job",
endpoint="some_url",
method='POST',
data=json.dumps({ "foo": "bar" }),
xcom_push=True,
dag=dag,
)
t2 = HttpSensor(
http_conn_id="s1",
task_id="finish_job",
endpoint="job/{{ json.loads(ti.xcom_pull(\"job\")).jobId }}",
response_check=lambda response: True if response.json().state == "complete" else False,
poke_interval=5,
dag=dag
)
t2.set_upstream(t1)
You can add a custom Jinja filter to your DAG with the parameter user_defined_filters to parse the json.
a dictionary of filters that will be exposed
in your jinja templates. For example, passing
dict(hello=lambda name: 'Hello %s' % name) to this argument allows
you to {{ 'world' | hello }} in all jinja templates related to
this DAG.
dag = DAG(
...
user_defined_filters={'fromjson': lambda s: json.loads(s)},
)
t1 = SimpleHttpOperator(
task_id='job',
xcom_push=True,
...
)
t2 = HttpSensor(
endpoint='job/{{ (ti.xcom_pull("job") | fromjson)["jobId"] }}',
...
)
However, it may be cleaner to just write your own custom JsonHttpOperator plugin (or add a flag to SimpleHttpOperator) that parses the JSON before returning so that you can just directly reference {{ti.xcom_pull("job")["jobId"] in the template.
class JsonHttpOperator(SimpleHttpOperator):
def execute(self, context):
text = super(JsonHttpOperator, self).execute(context)
return json.loads(text)
Alternatively, it is also possible to add the json module to the template by doing and the json will be available for usage inside the template. However, it is probably a better idea to create a plugin like Daniel said.
dag = DAG(
'dagname',
default_args=default_args,
schedule_interval="#once",
user_defined_macros={
'json': json
}
)
then
finish_job = HttpSensor(
task_id="finish_job",
endpoint="kue/job/{{ json.loads(ti.xcom_pull('job'))['jobId'] }}",
response_check=lambda response: True if response.json()['state'] == "complete" else False,
poke_interval=5,
dag=dag
)