How to read response body from SimpleHttpOperator - airflow

I am new to Airflow. I have written a code to submit HTTP Post using SimpleHttpOperator. In this case post request return a token, i need a help on how reading the response body.
get_templates = SimpleHttpOperator(
task_id='get_templates',
method='POST',
endpoint='myendpoint',
http_conn_id = 'myconnection',
trigger_rule="all_done",
headers={"Content-Type": "application/json"},
xcom_push=True,
dag=dag
)
Looks like POST was successful. Now my question is how to read the response body.
This is the output of code, there is no errors
[2019-05-06 20:08:40,518] {http_hook.py:128} INFO - Sending 'POST' to url: https://auth.reltio.com/oauth//token?username=perf_api_user&password=perf_api_user!&grant_type=password
/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)

The execute function of the SimpleHttpOperator returns the response.text (source). By looking at the Airflow documentation for XCom, you can see that:
... if a task returns a value (either from its Operator’s execute() method, or from a PythonOperator’s python_callable function), then an XCom containing that value is automatically pushed.
meaning the response body is pushed to the XCom and is available for downstream tasks to access.
For example, you could have a PythonOperator fetching it via:
response_body = context['task_instance'].xcom_pull(task_ids='get_templates')
Additionally, if you just want to log the response instead of process it, you can just set the log_response of the SimpleHttpOperator constructor to True.

If you use Airflow 2 xcom_push argument is not available in SimpleHttpOperator. In this case let's say you call /my-url in task call_api; to get the response and pass it to another task you need to read from the xcom return_value that is automatically defined by the SimpleHttpOperator as:
call_api = SimpleHttpOperator(
task_id='call_api',
http_conn_id=api_connection,
method='GET',
endpoint='/my-url',
response_filter=lambda response: json.loads(response.text),
log_response=True, # Shows response in the task log
dag=dag
)
def _read_response(ti):
val = ti.xcom_pull(
task_ids='call_api',
key='return_value'
)
print(val)
read_response = PythonOperator(
task_id='read_response',
python_callable=_read_response,
dag=dag
)
You can also specify dag_id in ti.xcom_pull to select the running dag.

Related

Accessing task_instance or ti via simpleHttpOperator to do an xcom push

TLDR
In the python callable for a simpleHttpOperator response function, I am trying to push an xcom that has combined information from two sources to a specificied key (a hash of the filename/path and an object lookup from a DB)
Longer Tale
I have a filesensor written which grabs all new files and passes them to MultiDagRun to parallel process the information (scientific) in the files as xcom. Works great. The simpleHttpOperator POSTs filepath info to a submission api and receives back a task_id which it must then read as a response from another (slow running) api to get the result. This I all have working fine. Files get scanned, it launches multiple dags to process, and returns objects.
But... I cannot puzzle out how to push the result to an xcom inside the python response function for the simpleHttpOperator.
My google- and SO and Reddit-fu has failed me here (and it seems overkill to use the pythonOperator tho that's my next stop.). I notice a lot of people asking similar questions though.
How do you use context or ti or task_instance or context['task_instance'] with the response function? (I cannot use "Returned Value" xcom as I need to distinguish the xcom keys as parallel processing afaik). As the default I have context set to true in the default_args.
Sure I am missing something simple here, but stumped as to what it is (note, I did try the **kwargs and ti = kwargs['ti'] below as well before hitting SO...
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
# This is the part that is not working as I am unsure how
# to access the task instance to do the xcom_push
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
and the Operator:
object_result = SimpleHttpOperator(
task_id="object_result",
method='POST',
data=json.dumps({"file": "{{ dag_run.conf['file'] }}", "keyword": "object"}),
http_conn_id="coma_api",
endpoint="/api/v1/file/describe",
headers={"Content-Type": "application/json"},
extra_options={"verify":False},
response_check=lambda response: _handler_object_result(response, "{{ dag_run.conf['file'] }}"),
do_xcom_push=False,
dag=dag,
)
I was really expecting the task_instance object to be available in some fashion, either be default or configuration but each variation that has worked elsewhere (filesensor, pythonOperator, etc) hasn't worked, and been unable to google a solution for the magic words to make it accessible.
You can try using the get_current_context() function in your response_check function:
from airflow.operators.python import get_current_context
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
ti = get_current_context()["ti"] # <- Try this
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
That function is a nice way of still accessing the task's execution context when context isn't explicitly handy or you don't want to pass context attrs around to access it deep in your logic stack.

Using Airflow's HttpSensor with a unique X-Request-ID header on each request

I want to monitor an endpoint via a HTTP GET Request until I see a "SUCCESS".
The endpoint requires a unique X-Request-ID header on each request. If I do not include this field, or if I send the same UUID twice, I get a 400 Bad Request back. I tried:
monitor_job = HttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': str(uuid.uuid4()) #returns a random uuid
}
)
I am seeing that the first GET request works fine, it waits 5 seconds, but the next GET request fails, as it tries to send a request with the same GUID. I'd need it to send a new value in the X-Request-ID header on each request.
Is this possible with HttpSensor or otherwise?
The best alternative approach I can think of would be to move the GET request into a loop in python code (probably using the requests library), and use a PythonSensor. This is more code to write and it feels like a workaround.
I am not currently using http_conn_id just to match style with related code in the codebase. I can use it if it would help.
I'm running on Airflow v2.2.2
If you run the sensor in mode poke, the task will be created once, and put to sleep between pokes. In this case, uuid.uuid4() will be called once and you will have the same uuid for all the queries:
you can change the mode to reschedule:
monitor_job = HttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': str(uuid.uuid4()) #returns a random uuid
},
mode="reschedule",
)
you can override the the sensor code to change the headers value:
class MyHttpSensor(HttpSensor):
def poke(self, context: Context) -> bool:
self.headers = {
'X-Request-ID': str(uuid.uuid4())
}
return super().poke(context)
monitor_job = MyHttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
)
you can also override the sensor and call the method render_template_fields on each poke, then provide the X-Request-ID as a jinja template:
class MyHttpSensor(HttpSensor):
def poke(self, context: Context) -> bool:
self.render_template_fields(context)
return super().poke(context)
monitor_job = MyHttpSensor(
task_id = 'monitor_job',
http_conn_id='',
endpoint='http://some_endpoint',
request_params={},
response_check=lambda response: 'SUCCESS' in response.text,
poke_interval=5,
dag=dag,
headers={
'X-Request-ID': "{{ uuid.uuid4() }}"
},
)
I recommend the second option, but if your API takes much time to be "SUCCESS", then the first one is the best with poke_interval >= 60, in order to release the worker slot and let the worker runs other tasks.

Calling multiple cloud functions inside Google Composer

I am writing a dag to call two functions inside my workflow in Google Composer. I have created a class for it using the SimpleHTTPOperator:
class cfSFTP2GCSOp(SimpleHttpOperator):
def execute(self, context):
http = HttpHook(self.method, http_conn_id=self.http_conn_id)
self.log.info("Calling HTTP method")
target_audience = 'https://dw-etl-transactor-unzip-files-nowvpwp6oq-uc.a.run.app'
request = google.auth.transport.requests.Request()
idt = id_token.fetch_id_token(request, target_audience)
self.headers = {'Authorization': "Bearer " + idt}
response = http.run(self.endpoint,
self.data,
self.headers,
self.extra_options)
self.log.info(response)
if response == "<Response [200]>":
return True
else:
return False
which I use inside the task to call the function:
gcp_cf_dw_etl_transactor_unzip_files = cfSFTP2GCSOp(
task_id='gcp_cf_dw_etl_unzip_files',
method='POST',
http_conn_id='gcp_cf_dw_etl_unzip_files',
data={},
endpoint='/',
headers={},
response_check=lambda response: True if response == "<Response [200]>" is True else False,
dag=dag,
)
I have to use two tasks to call each of the Cloud Functions I need, but what happens if I need to call multiple Cloud Functions? Is it possible to call all of them inside the same task or I need to continue doing what I am currently doing?
Thanks in advance!

Trace failed fastapi requests with opencensus

I'm using opencensus-python to track requests to my python fastapi application running in production, and exporting the information to Azure AppInsights using the opencensus exporters. I followed the Azure Monitor docs and was helped out by this issue post which puts all the necessary bits in a useful middleware class.
Only to realize later on that requests that caused the app to crash, i.e. unhandled 5xx type errors, would never be tracked, since the call to execute the logic for the request fails before any tracing happens. The Azure Monitor docs only talk about tracking exceptions through the logs, but this is separate from the tracing of requests, unless I'm missing something. I certainly wouldn't want to lose out on failed requests, these are super important to track! I'm accustomed to using the "Failures" tab in app insights to monitor any failing requests.
I figured the way to track these requests is to explicitly handle any internal exceptions using try/catch and export the trace, manually setting the result code to 500. But I found it really odd that there seems to be no documentation of this, on opencensus or Azure.
The problem I have now is: this middleware function is expected to pass back a "response" object, which fastapi then uses as a callable object down the line (not sure why) - but in the case where I caught an exception in the underlying processing (i.e. at await call_next(request)) I don't have any response to return. I tried returning None but this just causes further exceptions down the line (None is not callable).
Here is my version of the middleware class - its very similar to the issue post I linked, but I'm try/catching over await call_next(request) rather than just letting it fail unhanded. Scroll down to the final 5 lines of code to see that.
import logging
from fastapi import Request
from opencensus.trace import (
attributes_helper,
execution_context,
samplers,
)
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace import span as span_module
from opencensus.trace import tracer as tracer_module
from opencensus.trace import utils
from opencensus.trace.propagation import trace_context_http_header_format
from opencensus.ext.azure.log_exporter import AzureLogHandler
from starlette.types import ASGIApp
from src.settings import settings
HTTP_HOST = attributes_helper.COMMON_ATTRIBUTES["HTTP_HOST"]
HTTP_METHOD = attributes_helper.COMMON_ATTRIBUTES["HTTP_METHOD"]
HTTP_PATH = attributes_helper.COMMON_ATTRIBUTES["HTTP_PATH"]
HTTP_ROUTE = attributes_helper.COMMON_ATTRIBUTES["HTTP_ROUTE"]
HTTP_URL = attributes_helper.COMMON_ATTRIBUTES["HTTP_URL"]
HTTP_STATUS_CODE = attributes_helper.COMMON_ATTRIBUTES["HTTP_STATUS_CODE"]
module_logger = logging.getLogger(__name__)
module_logger.addHandler(AzureLogHandler(
connection_string=settings.appinsights_connection_string
))
class AppInsightsMiddleware:
"""
Middleware class to handle tracing of fastapi requests and exporting the data to AppInsights.
Most of the code here is copied from a github issue: https://github.com/census-instrumentation/opencensus-python/issues/1020
"""
def __init__(
self,
app: ASGIApp,
excludelist_paths=None,
excludelist_hostnames=None,
sampler=None,
exporter=None,
propagator=None,
) -> None:
self.app = app
self.excludelist_paths = excludelist_paths
self.excludelist_hostnames = excludelist_hostnames
self.sampler = sampler or samplers.AlwaysOnSampler()
self.propagator = (
propagator or trace_context_http_header_format.TraceContextPropagator()
)
self.exporter = exporter or AzureExporter(
connection_string=settings.appinsights_connection_string
)
async def __call__(self, request: Request, call_next):
# Do not trace if the url is in the exclude list
if utils.disable_tracing_url(str(request.url), self.excludelist_paths):
return await call_next(request)
try:
span_context = self.propagator.from_headers(request.headers)
tracer = tracer_module.Tracer(
span_context=span_context,
sampler=self.sampler,
exporter=self.exporter,
propagator=self.propagator,
)
except Exception:
module_logger.error("Failed to trace request", exc_info=True)
return await call_next(request)
try:
span = tracer.start_span()
span.span_kind = span_module.SpanKind.SERVER
span.name = "[{}]{}".format(request.method, request.url)
tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))
execution_context.set_opencensus_attr(
"excludelist_hostnames", self.excludelist_hostnames
)
except Exception: # pragma: NO COVER
module_logger.error("Failed to trace request", exc_info=True)
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
return None
I then register this middleware class in main.py like so:
app.middleware("http")(AppInsightsMiddleware(app, sampler=samplers.AlwaysOnSampler()))
Explicitly handle any exception that may occur in processing the API request. That allows you to finish tracing the request, setting the status code to 500. You can then re-throw the exception to ensure that the application raises the expected exception.
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
raise exception

Passing xcom of SimpleHttpOperator to DatabricksSubmitRunOperator

I have airflow DAG which consists of the following steps:
SimpleHttpOperator
DatabricksSubmitRunOperator
The SimpleHttpOperator connects to an API and gets the response. I want to then pass the response to the DatabricksSubmitRunOperator which can then send the response to Databricks.
How can I pull the response from the SimpleHttpOperator and pass it to DatabricksSubmitRunOperator?
Try the below
sho_task = SimpleHttpOperator(
task_id="sho_task_id",
method="GET",
endpoint="/some/endpoint",
http_conn_id="connection_id", # Must be configured in Airflow UI under connections as HTTP connection type
response_filter=lambda response: response.json()["nested"]["property"],
do_xcom_push=True, # Important parameter!
dag=dag
)
dsro_task = DatabricksSubmitRunOperator(
task_id='databricks_task',
notebook_task={
'notebook_path':'/some/notebook/path',
'base_parameters': {'return_from_prev_task': "{{ ti.xcom_pull('sho_task_id') }}"}
},
dag=dag
)
sho_task >> dsro_task
There are 2 things to note here. First, you need to set do_xcom_push=True in order to save the response to XCOM. Second, the SimpleHttpOperator stores the response as text(string) by default. If you need to access a certain nested property you need to set response_filter=lambda response:response.json()["nested"]["property"] or set response_filter=lambda response:response.json() then use {{ ti.xcom_pull('sho_task_id')["nested"]["property"] }} in DatabricksSubmitRunOperator

Resources