Accessing task_instance or ti via simpleHttpOperator to do an xcom push - airflow

TLDR
In the python callable for a simpleHttpOperator response function, I am trying to push an xcom that has combined information from two sources to a specificied key (a hash of the filename/path and an object lookup from a DB)
Longer Tale
I have a filesensor written which grabs all new files and passes them to MultiDagRun to parallel process the information (scientific) in the files as xcom. Works great. The simpleHttpOperator POSTs filepath info to a submission api and receives back a task_id which it must then read as a response from another (slow running) api to get the result. This I all have working fine. Files get scanned, it launches multiple dags to process, and returns objects.
But... I cannot puzzle out how to push the result to an xcom inside the python response function for the simpleHttpOperator.
My google- and SO and Reddit-fu has failed me here (and it seems overkill to use the pythonOperator tho that's my next stop.). I notice a lot of people asking similar questions though.
How do you use context or ti or task_instance or context['task_instance'] with the response function? (I cannot use "Returned Value" xcom as I need to distinguish the xcom keys as parallel processing afaik). As the default I have context set to true in the default_args.
Sure I am missing something simple here, but stumped as to what it is (note, I did try the **kwargs and ti = kwargs['ti'] below as well before hitting SO...
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
# This is the part that is not working as I am unsure how
# to access the task instance to do the xcom_push
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
and the Operator:
object_result = SimpleHttpOperator(
task_id="object_result",
method='POST',
data=json.dumps({"file": "{{ dag_run.conf['file'] }}", "keyword": "object"}),
http_conn_id="coma_api",
endpoint="/api/v1/file/describe",
headers={"Content-Type": "application/json"},
extra_options={"verify":False},
response_check=lambda response: _handler_object_result(response, "{{ dag_run.conf['file'] }}"),
do_xcom_push=False,
dag=dag,
)
I was really expecting the task_instance object to be available in some fashion, either be default or configuration but each variation that has worked elsewhere (filesensor, pythonOperator, etc) hasn't worked, and been unable to google a solution for the magic words to make it accessible.

You can try using the get_current_context() function in your response_check function:
from airflow.operators.python import get_current_context
def _handler_object_result(response, file):
# Note: related to api I am calling not Airflow internal task ids
header_result = response.json()
task_id = header_result["task"]["id"]
api = "https://redacted.com/api/task/result/{task_id}".format(task_id=task_id)
resp = requests.get(api, verify=False).json()
data = json.loads(resp["data"])
file_object = json.dumps(data["OBJECT"])
file_hash = hash(file)
ti = get_current_context()["ti"] # <- Try this
ti.xcom_push(key=file_hash, value=file_object)
if ti.xcom_pull(key=file_hash):
return True
else:
return False
That function is a nice way of still accessing the task's execution context when context isn't explicitly handy or you don't want to pass context attrs around to access it deep in your logic stack.

Related

Calling multiple cloud functions inside Google Composer

I am writing a dag to call two functions inside my workflow in Google Composer. I have created a class for it using the SimpleHTTPOperator:
class cfSFTP2GCSOp(SimpleHttpOperator):
def execute(self, context):
http = HttpHook(self.method, http_conn_id=self.http_conn_id)
self.log.info("Calling HTTP method")
target_audience = 'https://dw-etl-transactor-unzip-files-nowvpwp6oq-uc.a.run.app'
request = google.auth.transport.requests.Request()
idt = id_token.fetch_id_token(request, target_audience)
self.headers = {'Authorization': "Bearer " + idt}
response = http.run(self.endpoint,
self.data,
self.headers,
self.extra_options)
self.log.info(response)
if response == "<Response [200]>":
return True
else:
return False
which I use inside the task to call the function:
gcp_cf_dw_etl_transactor_unzip_files = cfSFTP2GCSOp(
task_id='gcp_cf_dw_etl_unzip_files',
method='POST',
http_conn_id='gcp_cf_dw_etl_unzip_files',
data={},
endpoint='/',
headers={},
response_check=lambda response: True if response == "<Response [200]>" is True else False,
dag=dag,
)
I have to use two tasks to call each of the Cloud Functions I need, but what happens if I need to call multiple Cloud Functions? Is it possible to call all of them inside the same task or I need to continue doing what I am currently doing?
Thanks in advance!

Trace failed fastapi requests with opencensus

I'm using opencensus-python to track requests to my python fastapi application running in production, and exporting the information to Azure AppInsights using the opencensus exporters. I followed the Azure Monitor docs and was helped out by this issue post which puts all the necessary bits in a useful middleware class.
Only to realize later on that requests that caused the app to crash, i.e. unhandled 5xx type errors, would never be tracked, since the call to execute the logic for the request fails before any tracing happens. The Azure Monitor docs only talk about tracking exceptions through the logs, but this is separate from the tracing of requests, unless I'm missing something. I certainly wouldn't want to lose out on failed requests, these are super important to track! I'm accustomed to using the "Failures" tab in app insights to monitor any failing requests.
I figured the way to track these requests is to explicitly handle any internal exceptions using try/catch and export the trace, manually setting the result code to 500. But I found it really odd that there seems to be no documentation of this, on opencensus or Azure.
The problem I have now is: this middleware function is expected to pass back a "response" object, which fastapi then uses as a callable object down the line (not sure why) - but in the case where I caught an exception in the underlying processing (i.e. at await call_next(request)) I don't have any response to return. I tried returning None but this just causes further exceptions down the line (None is not callable).
Here is my version of the middleware class - its very similar to the issue post I linked, but I'm try/catching over await call_next(request) rather than just letting it fail unhanded. Scroll down to the final 5 lines of code to see that.
import logging
from fastapi import Request
from opencensus.trace import (
attributes_helper,
execution_context,
samplers,
)
from opencensus.ext.azure.trace_exporter import AzureExporter
from opencensus.trace import span as span_module
from opencensus.trace import tracer as tracer_module
from opencensus.trace import utils
from opencensus.trace.propagation import trace_context_http_header_format
from opencensus.ext.azure.log_exporter import AzureLogHandler
from starlette.types import ASGIApp
from src.settings import settings
HTTP_HOST = attributes_helper.COMMON_ATTRIBUTES["HTTP_HOST"]
HTTP_METHOD = attributes_helper.COMMON_ATTRIBUTES["HTTP_METHOD"]
HTTP_PATH = attributes_helper.COMMON_ATTRIBUTES["HTTP_PATH"]
HTTP_ROUTE = attributes_helper.COMMON_ATTRIBUTES["HTTP_ROUTE"]
HTTP_URL = attributes_helper.COMMON_ATTRIBUTES["HTTP_URL"]
HTTP_STATUS_CODE = attributes_helper.COMMON_ATTRIBUTES["HTTP_STATUS_CODE"]
module_logger = logging.getLogger(__name__)
module_logger.addHandler(AzureLogHandler(
connection_string=settings.appinsights_connection_string
))
class AppInsightsMiddleware:
"""
Middleware class to handle tracing of fastapi requests and exporting the data to AppInsights.
Most of the code here is copied from a github issue: https://github.com/census-instrumentation/opencensus-python/issues/1020
"""
def __init__(
self,
app: ASGIApp,
excludelist_paths=None,
excludelist_hostnames=None,
sampler=None,
exporter=None,
propagator=None,
) -> None:
self.app = app
self.excludelist_paths = excludelist_paths
self.excludelist_hostnames = excludelist_hostnames
self.sampler = sampler or samplers.AlwaysOnSampler()
self.propagator = (
propagator or trace_context_http_header_format.TraceContextPropagator()
)
self.exporter = exporter or AzureExporter(
connection_string=settings.appinsights_connection_string
)
async def __call__(self, request: Request, call_next):
# Do not trace if the url is in the exclude list
if utils.disable_tracing_url(str(request.url), self.excludelist_paths):
return await call_next(request)
try:
span_context = self.propagator.from_headers(request.headers)
tracer = tracer_module.Tracer(
span_context=span_context,
sampler=self.sampler,
exporter=self.exporter,
propagator=self.propagator,
)
except Exception:
module_logger.error("Failed to trace request", exc_info=True)
return await call_next(request)
try:
span = tracer.start_span()
span.span_kind = span_module.SpanKind.SERVER
span.name = "[{}]{}".format(request.method, request.url)
tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))
execution_context.set_opencensus_attr(
"excludelist_hostnames", self.excludelist_hostnames
)
except Exception: # pragma: NO COVER
module_logger.error("Failed to trace request", exc_info=True)
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
return None
I then register this middleware class in main.py like so:
app.middleware("http")(AppInsightsMiddleware(app, sampler=samplers.AlwaysOnSampler()))
Explicitly handle any exception that may occur in processing the API request. That allows you to finish tracing the request, setting the status code to 500. You can then re-throw the exception to ensure that the application raises the expected exception.
try:
response = await call_next(request)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)
tracer.end_span()
return response
# Explicitly handle any internal exception here, and set status code to 500
except Exception as exception:
module_logger.exception(exception)
tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, 500)
tracer.end_span()
raise exception

What is the best way to check if a file exists on an Azure Datalake using Apache Airflow?

I have a DAG that shall check if a file has been uploaded to Azure DataLake in a specific directory. If so, it allow other DAGs to run.
I thought about using a FileSensor, but I assume a fsconnid parameter is not enough to authenticate against a DataLake
There is no AzureDataLakeSensor in the Azure provider but you can easily implement one since the AzureDataLakeHook has check_for_file function so all needed is to wrap this function with Sensor class implementing poke() function of BaseSensorOperator. By doing so you can use Microsoft Azure Data Lake Connection directly.
I didn't test it but this should work:
from airflow.providers.microsoft.azure.hooks.data_lake import AzureDataLakeHook
from airflow.sensors.base import BaseSensorOperator
class MyAzureDataLakeSensor(BaseSensorOperator):
"""
Sense for files in Azure Data Lake
:param path: The Azure Data Lake path to find the objects. Supports glob
strings (templated)
:param azure_data_lake_conn_id: The Azure Data Lake conn
"""
template_fields: Sequence[str] = ('path',)
ui_color = '#901dd2'
def __init__(
self, *, path: str, azure_data_lake_conn_id: str = 'azure_data_lake_default', **kwargs
) -> None:
super().__init__(**kwargs)
self.path = path
self.azure_data_lake_conn_id = azure_data_lake_conn_id
def poke(self, context: "Context") -> bool:
hook = AzureDataLakeHook(azure_data_lake_conn_id=self.azure_data_lake_conn_id)
self.log.info('Poking for file in path: %s', self.path)
try:
hook.check_for_file(file_path=self.path)
return True
except FileNotFoundError:
pass
return False
Usage example:
MyAzureDataLakeSensor(
task_id='adls_sense',
path='folder/file.csv',
azure_data_lake_conn_id='azure_data_lake_default',
mode='reschedule'
)
First of all, have a look at official Microsoft Operators for Airflow.
We can see that there are dedicated Operators to Azure DataLake Storage, unfortunately, only the ADLSDeleteOperator seems available at the moment.
This ADLSDeleteOperator uses a AzureDataLakeHook which you should reuse in your own custom operator to check for file presence.
My advice for you is to create a Child class of CheckOperator using the ADLS hook check if the file provided in input exists with check_for_file function of the hook.
UPDATE: as pointed in comments, CheckOperator seems to by tied to SQL queries and is deprecated. Using your own custom Sensor or custom Operator is the way to go.
I had severe issues using the proposed API. So I embedded the Microsoft API into Airflow. This was working fine. All you need to do then is to use this operator and pass account_url and access_token.
from azure.storage.filedatalake import DataLakeServiceClient
from airflow.sensors.base import BaseSensorOperator
class AzureDataLakeSensor(BaseSensorOperator):
def __init__(self, path, filename, account_url, access_token, **kwargs):
super().__init__(**kwargs)
self._client = DataLakeServiceClient(
account_url=account_url,
credential=access_token
)
self.path = path
self.filename = filename
def poke(self, context):
container = self._client.get_file_system_client(file_system="raw")
dir_client = container.get_directory_client(self.path)
file = dir_client.get_file_client(self.filename)
return file.exists()

Passing xcom of SimpleHttpOperator to DatabricksSubmitRunOperator

I have airflow DAG which consists of the following steps:
SimpleHttpOperator
DatabricksSubmitRunOperator
The SimpleHttpOperator connects to an API and gets the response. I want to then pass the response to the DatabricksSubmitRunOperator which can then send the response to Databricks.
How can I pull the response from the SimpleHttpOperator and pass it to DatabricksSubmitRunOperator?
Try the below
sho_task = SimpleHttpOperator(
task_id="sho_task_id",
method="GET",
endpoint="/some/endpoint",
http_conn_id="connection_id", # Must be configured in Airflow UI under connections as HTTP connection type
response_filter=lambda response: response.json()["nested"]["property"],
do_xcom_push=True, # Important parameter!
dag=dag
)
dsro_task = DatabricksSubmitRunOperator(
task_id='databricks_task',
notebook_task={
'notebook_path':'/some/notebook/path',
'base_parameters': {'return_from_prev_task': "{{ ti.xcom_pull('sho_task_id') }}"}
},
dag=dag
)
sho_task >> dsro_task
There are 2 things to note here. First, you need to set do_xcom_push=True in order to save the response to XCOM. Second, the SimpleHttpOperator stores the response as text(string) by default. If you need to access a certain nested property you need to set response_filter=lambda response:response.json()["nested"]["property"] or set response_filter=lambda response:response.json() then use {{ ti.xcom_pull('sho_task_id')["nested"]["property"] }} in DatabricksSubmitRunOperator

How to read response body from SimpleHttpOperator

I am new to Airflow. I have written a code to submit HTTP Post using SimpleHttpOperator. In this case post request return a token, i need a help on how reading the response body.
get_templates = SimpleHttpOperator(
task_id='get_templates',
method='POST',
endpoint='myendpoint',
http_conn_id = 'myconnection',
trigger_rule="all_done",
headers={"Content-Type": "application/json"},
xcom_push=True,
dag=dag
)
Looks like POST was successful. Now my question is how to read the response body.
This is the output of code, there is no errors
[2019-05-06 20:08:40,518] {http_hook.py:128} INFO - Sending 'POST' to url: https://auth.reltio.com/oauth//token?username=perf_api_user&password=perf_api_user!&grant_type=password
/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
The execute function of the SimpleHttpOperator returns the response.text (source). By looking at the Airflow documentation for XCom, you can see that:
... if a task returns a value (either from its Operator’s execute() method, or from a PythonOperator’s python_callable function), then an XCom containing that value is automatically pushed.
meaning the response body is pushed to the XCom and is available for downstream tasks to access.
For example, you could have a PythonOperator fetching it via:
response_body = context['task_instance'].xcom_pull(task_ids='get_templates')
Additionally, if you just want to log the response instead of process it, you can just set the log_response of the SimpleHttpOperator constructor to True.
If you use Airflow 2 xcom_push argument is not available in SimpleHttpOperator. In this case let's say you call /my-url in task call_api; to get the response and pass it to another task you need to read from the xcom return_value that is automatically defined by the SimpleHttpOperator as:
call_api = SimpleHttpOperator(
task_id='call_api',
http_conn_id=api_connection,
method='GET',
endpoint='/my-url',
response_filter=lambda response: json.loads(response.text),
log_response=True, # Shows response in the task log
dag=dag
)
def _read_response(ti):
val = ti.xcom_pull(
task_ids='call_api',
key='return_value'
)
print(val)
read_response = PythonOperator(
task_id='read_response',
python_callable=_read_response,
dag=dag
)
You can also specify dag_id in ti.xcom_pull to select the running dag.

Resources