Passing xcom of SimpleHttpOperator to DatabricksSubmitRunOperator - airflow

I have airflow DAG which consists of the following steps:
SimpleHttpOperator
DatabricksSubmitRunOperator
The SimpleHttpOperator connects to an API and gets the response. I want to then pass the response to the DatabricksSubmitRunOperator which can then send the response to Databricks.
How can I pull the response from the SimpleHttpOperator and pass it to DatabricksSubmitRunOperator?

Try the below
sho_task = SimpleHttpOperator(
task_id="sho_task_id",
method="GET",
endpoint="/some/endpoint",
http_conn_id="connection_id", # Must be configured in Airflow UI under connections as HTTP connection type
response_filter=lambda response: response.json()["nested"]["property"],
do_xcom_push=True, # Important parameter!
dag=dag
)
dsro_task = DatabricksSubmitRunOperator(
task_id='databricks_task',
notebook_task={
'notebook_path':'/some/notebook/path',
'base_parameters': {'return_from_prev_task': "{{ ti.xcom_pull('sho_task_id') }}"}
},
dag=dag
)
sho_task >> dsro_task
There are 2 things to note here. First, you need to set do_xcom_push=True in order to save the response to XCOM. Second, the SimpleHttpOperator stores the response as text(string) by default. If you need to access a certain nested property you need to set response_filter=lambda response:response.json()["nested"]["property"] or set response_filter=lambda response:response.json() then use {{ ti.xcom_pull('sho_task_id')["nested"]["property"] }} in DatabricksSubmitRunOperator

Related

Accessing task_instance or ti via simpleHttpOperator to do an xcom push

TLDR
In the python callable for a simpleHttpOperator response function, I am trying to push an xcom that has combined information from two sources to a specificied key (a hash of the filename/path and an object lookup from a DB)
Longer Tale
I have a filesensor written which grabs all new files and passes them to MultiDagRun to parallel process the information (scientific) in the files as xcom. Works great. The simpleHttpOperator POSTs filepath info to a submission api and receives back a task_id which it must then read as a response from another (slow running) api to get the result. This I all have working fine. Files get scanned, it launches multiple dags to process, and returns objects.
But... I cannot puzzle out how to push the result to an xcom inside the python response function for the simpleHttpOperator.
My google- and SO and Reddit-fu has failed me here (and it seems overkill to use the pythonOperator tho that's my next stop.). I notice a lot of people asking similar questions though.
How do you use context or ti or task_instance or context['task_instance'] with the response function? (I cannot use "Returned Value" xcom as I need to distinguish the xcom keys as parallel processing afaik). As the default I have context set to true in the default_args.
Sure I am missing something simple here, but stumped as to what it is (note, I did try the **kwargs and ti = kwargs['ti'] below as well before hitting SO...
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
# This is the part that is not working as I am unsure how
# to access the task instance to do the xcom_push
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
and the Operator:
object_result = SimpleHttpOperator(
task_id="object_result",
method='POST',
data=json.dumps({"file": "{{ dag_run.conf['file'] }}", "keyword": "object"}),
http_conn_id="coma_api",
endpoint="/api/v1/file/describe",
headers={"Content-Type": "application/json"},
extra_options={"verify":False},
response_check=lambda response: _handler_object_result(response, "{{ dag_run.conf['file'] }}"),
do_xcom_push=False,
dag=dag,
)
I was really expecting the task_instance object to be available in some fashion, either be default or configuration but each variation that has worked elsewhere (filesensor, pythonOperator, etc) hasn't worked, and been unable to google a solution for the magic words to make it accessible.
You can try using the get_current_context() function in your response_check function:
from airflow.operators.python import get_current_context
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
ti = get_current_context()["ti"] # <- Try this
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
That function is a nice way of still accessing the task's execution context when context isn't explicitly handy or you don't want to pass context attrs around to access it deep in your logic stack.

Using Airflow's HttpSensor with a unique X-Request-ID header on each request

I want to monitor an endpoint via a HTTP GET Request until I see a "SUCCESS".
The endpoint requires a unique X-Request-ID header on each request. If I do not include this field, or if I send the same UUID twice, I get a 400 Bad Request back. I tried:
monitor_job = HttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': str(uuid.uuid4()) #returns a random uuid
}
)
I am seeing that the first GET request works fine, it waits 5 seconds, but the next GET request fails, as it tries to send a request with the same GUID. I'd need it to send a new value in the X-Request-ID header on each request.
Is this possible with HttpSensor or otherwise?
The best alternative approach I can think of would be to move the GET request into a loop in python code (probably using the requests library), and use a PythonSensor. This is more code to write and it feels like a workaround.
I am not currently using http_conn_id just to match style with related code in the codebase. I can use it if it would help.
I'm running on Airflow v2.2.2
If you run the sensor in mode poke, the task will be created once, and put to sleep between pokes. In this case, uuid.uuid4() will be called once and you will have the same uuid for all the queries:
you can change the mode to reschedule:
monitor_job = HttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': str(uuid.uuid4()) #returns a random uuid
},
mode="reschedule",
)
you can override the the sensor code to change the headers value:
class MyHttpSensor(HttpSensor):
def poke(self, context: Context) -> bool:
self.headers = {
'X-Request-ID': str(uuid.uuid4())
}
return super().poke(context)
monitor_job = MyHttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
)
you can also override the sensor and call the method render_template_fields on each poke, then provide the X-Request-ID as a jinja template:
class MyHttpSensor(HttpSensor):
def poke(self, context: Context) -> bool:
self.render_template_fields(context)
return super().poke(context)
monitor_job = MyHttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': "{{ uuid.uuid4() }}"
},
)
I recommend the second option, but if your API takes much time to be "SUCCESS", then the first one is the best with poke_interval >= 60, in order to release the worker slot and let the worker runs other tasks.

Schema not being passed into Airflow MySqlHook

I'm using the MySqlHook to show results of my SQL query in Airflow logs. All is working except that I want to limit the results to a particular schema. Even though I'm passing in that schema, MySqlHook doesn't seem to be recognizing it.
Here's my function that uses the hook:
def func(mysql_conn_id, sql, schema):
"""Print results of sql query """
print("schema:", schema)
hook = MySqlHook(mysql_conn_id=mysql_conn_id, schema=schema)
df = hook.get_pandas_df(sql=sql)
print("\n" + df.to_string())
the schema that I pass in does come up in the logs from my print statement.
But when I see the results of the second print statement here (the df.to_string()), the connection that shows up in Airflow is without the schema.
INFO - Using connection to: id: dbjobs_mysql. Host: kb-qa.local, Port: None, Schema: , Login: my_user, Password: ***, extra: {}
And the query is running for more than just the given schema.
Looking in the source code it seems like what I did is what it's expecting:
class MySqlHook(DbApiHook):
conn_name_attr = 'mysql_conn_id'
default_conn_name = 'mysql_default'
supports_autocommit = True
def __init__(self, *args, **kwargs):
super(MySqlHook, self).__init__(*args, **kwargs)
self.schema = kwargs.pop("schema", None)

What happens to Airflow Sensor running on Celery worker if worker itself goes down

Use Case:
DAG has a Rest API task defined (using RestOperator) which hits application api and starts execution of a process/task which does some business function. It's execution status is monitored with Airflow Sensor which poll for task execution completion status via an API call.
Question:
If a Celery node goes down, What will happen to Sensor which was running on that node.
if Sensor dies with worker, How to propagate sensor execution to another node (to avoid loss of functionality).
from datetime import datetime, timedelta
from airflow.providers.http.sensors.http import HttpSensor
from airflow.providers.http import SimpleHttpOperator
default_args = {
'start_date': datetime.today().strftime('%Y-%m-%d'),
'end_date': None
}
dag = DAG(
'Rest Monitor',
default_args=default_args,
schedule_interval=None,
catchup=False)
HttpOperator = SimpleHttpOperator(
task_id='RestOperator',
method='POST',
endpoint='https://localhost:8080/api/task/execute',
headers={
"Content-Type": "application/json"
},
data={
"taskId": "1234",
}
},
dag = dag )
def resp_check():
return "True if Status = Success"
HttpSensor = HttpSensor(
dag = dag,
task_id = 'http_sensor_head_method',
http_conn_id = 'http_default',
endpoint = 'https://localhost:8080/api/task/1234/status',
request_params = {},
method = 'HEAD',
response_check = resp_check,
timeout = 5,
poke_interval = 1)
HttpOperator >> HttpSensor
This logic is ostensibly managed by Airflow core and its coordination of workers via the Scheduler. Basically, the scheduler polls or is sent Task state / log data from the worker node at regular intervals.
Essentially the Scheduler manages data coming from the work about the worker explicitly and about the Task the worker is supposed to be managing. If either the worker becomes non-responsive or not enough data is being returned about the task. The Scheduler will begin it shutdown task and fail try logic.
If there are more retries for the Task, then a new copy of the task will be added to the queue for any live workers in your cluster to pick up.

How to read response body from SimpleHttpOperator

I am new to Airflow. I have written a code to submit HTTP Post using SimpleHttpOperator. In this case post request return a token, i need a help on how reading the response body.
get_templates = SimpleHttpOperator(
task_id='get_templates',
method='POST',
endpoint='myendpoint',
http_conn_id = 'myconnection',
trigger_rule="all_done",
headers={"Content-Type": "application/json"},
xcom_push=True,
dag=dag
)
Looks like POST was successful. Now my question is how to read the response body.
This is the output of code, there is no errors
[2019-05-06 20:08:40,518] {http_hook.py:128} INFO - Sending 'POST' to url: https://auth.reltio.com/oauth//token?username=perf_api_user&password=perf_api_user!&grant_type=password
/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
The execute function of the SimpleHttpOperator returns the response.text (source). By looking at the Airflow documentation for XCom, you can see that:
... if a task returns a value (either from its Operator’s execute() method, or from a PythonOperator’s python_callable function), then an XCom containing that value is automatically pushed.
meaning the response body is pushed to the XCom and is available for downstream tasks to access.
For example, you could have a PythonOperator fetching it via:
response_body = context['task_instance'].xcom_pull(task_ids='get_templates')
Additionally, if you just want to log the response instead of process it, you can just set the log_response of the SimpleHttpOperator constructor to True.
If you use Airflow 2 xcom_push argument is not available in SimpleHttpOperator. In this case let's say you call /my-url in task call_api; to get the response and pass it to another task you need to read from the xcom return_value that is automatically defined by the SimpleHttpOperator as:
call_api = SimpleHttpOperator(
task_id='call_api',
http_conn_id=api_connection,
method='GET',
endpoint='/my-url',
response_filter=lambda response: json.loads(response.text),
log_response=True, # Shows response in the task log
dag=dag
)
def _read_response(ti):
val = ti.xcom_pull(
task_ids='call_api',
key='return_value'
)
print(val)
read_response = PythonOperator(
task_id='read_response',
python_callable=_read_response,
dag=dag
)
You can also specify dag_id in ti.xcom_pull to select the running dag.

Resources