Airflow - How to handle Asynchronous API calls? - airflow

I'm trying to figure out how to best approach the below problem. Essentially I have an external API Service that I am sending requests to and getting results for.
POST = Send request and the response you get back is a URL which you can use for the GET requests to retrieve your results.
GET = Poll the returned URL from the POST request until you get a successful result.
What would be the best way to approach this in airflow? My idea is to essentially have 2 tasks running in parallel.
One sending the POST requests and then saving the response URL to XCOM.
The other would be continuously running in a while loop, reading from the XCOM store for new URL responses and getting responses. It would then delete from XCOM store once it has a retrieved a succesfuly result from that URL.
Do you think this is the correct way of going about it? Or possibly should i use the asyncio library in python?
Any help much appreciated
Thanks,

You can achieve what you are describing using SimpleHttpOperator and HttpSensor from Airflow (no need to install any extra package).
Consider this example that uses http_default connection to http bin.
The task to perform POST request:
task_post_op = SimpleHttpOperator(
task_id='post_op',
# http_conn_id='your_conn_id',
endpoint='post',
data=json.dumps({"priority": 5}),
headers={"Content-Type": "application/json"},
response_check=lambda response: response.json()['json']['priority'] == 5,
response_filter=lambda response: 'get', # e.g lambda response: json.loads(response.text)
dag=dag,
)
By providing response_filter you can manipulate the response result, which will be the value pushed to XCom. In your case, you should return the endpoint you want to poll in the next task.
response_filter: A function allowing you to manipulate the response
text. e.g response_filter=lambda response: json.loads(response.text).
The callable takes the response object as the first positional argument
and optionally any number of keyword arguments available in the context dictionary.
:type response_filter: A lambda or defined function.
Note that response_check param it's optional.
The task to perform GET requests:
Use the HttpSensor to poke until the response_check callable evaluates to true.
task_http_sensor_check = HttpSensor(
task_id='http_sensor_check',
# http_conn_id='your_conn_id',
endpoint=task_post_op.output,
request_params={},
response_check=lambda response: "httpbin" in response.text,
poke_interval=5,
dag=dag,
)
As endpoint param we are passing the XCom value pulled from previous task, using XComArg.
Use poke_interval to define the time in seconds that the job should wait in between each tries.
Remember to create a Connection of your own defining the base URL, port, etc.
Let me know if that worked for you!

Related

Airflow Long (time) URL request not receiving response

I'm setting up a DAG in airflow (Cloud Composer) to trigger some Cloud Run jobs that take upwards of 30 mins to complete.
Instead of using SimpleHTTPOperator, I'm using the pythonOperator to obtain a OIDC token to allow me to trigger services that require automation.
def make_authorized_get_request(service_url, **kwargs):
auth_req = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(auth_req, service_url)
headers = {"Authorization": f"Bearer {id_token}"}
req = requests.get(service_url, headers=headers, timeout=(5, 3600))
status = req.status_code
response_json = req.json()
return response_json,status
#t1 as request first report
big3_request = PythonOperator(
task_id='big3_request',
python_callable=make_authorized_get_request,
op_kwargs={"service_url":"cloud run url"}
)
I can see the cloud run job completes successfully, but the task in airflow doesn't seem to pick up on the response (a json containing some data about the job and a status code) and just keeps on running, unless a timeout is set, in which case it errors (although the response should be available by then).
I've pointed this task towards a shorter service to test and it picks up response fine.
How can I get the task to pick up on the response? Or do I need to use another operator?

How should REST-like API errors be handled in an Airflow hook?

While Airflow provides AirflowException and its derivatives, I'm curious how an error that occurs in a hook when talking to an external API should be handled? For example, say we have a REST-like API and are using requests to talk to this API--if our response object does not have a successful response, e.g. response.ok == False, we'd like to raise some kind of error (and in our specific, ideally ensure Sentry knows about it). We could create an AirflowException here and embed some meaningful context in our message, but this feels a little brittle and prone to losing context.
Any exception that will happen in your code that you did not catch will be caught by Airflow and retry the task.
You can use AirflowException to raise a new error message. This is helpful when you want to change the error message.
A good example would be the check_response function in HttpHook.
def check_response(self, response: requests.Response) -> None:
"""
Checks the status code and raise an AirflowException exception on non 2XX or 3XX
status codes
:param response: A requests response object
:type response: requests.response
"""
try:
response.raise_for_status()
except requests.exceptions.HTTPError:
self.log.error("HTTP error: %s", response.reason)
self.log.error(response.text)
raise AirflowException(str(response.status_code) + ":" + response.reason)
Using Airflow exceptions provides a way to control over how the task will behave for example AirflowFailException can be used when you want to tell Airflow to fail the task immediately (ignoring the retries parameter)

python grpc: setting timeout per grpc call

Is there a way to specify timeout per grpc call with python.
I am experiencing more than 1 minute delay in receiving response.
I want the api to return some error in case it is taking longer that specified time. I am using blocking grpc call.
You can look up the information you want at gRPC Python's API reference. Setting timeout should be as simple as:
channel = grpc.insecure_channel(...)
stub = ...(channel)
stub.AnRPC(request, timeout=5) # 5 seconds timeout

How to make async requests using HTTPoison?

Background
We have an app that deals with a considerable amount of requests per second. This app needs to notify an external service, by making a GET call via HTTPS to one of our servers.
Objective
The objective here is to use HTTPoison to make async GET requests. I don't really care about the response of the requests, all I care is to know if they failed or not, so I can write any possible errors into a logger.
If it succeeds I don't want to do anything.
Research
I have checked the official documentation for HTTPoison and I see that they support async requests:
https://hexdocs.pm/httpoison/readme.html#usage
However, I have 2 issues with this approach:
They use flush to show the request was completed. I can't loggin into the app and manually flush to see how the requests are going, that would be insane.
They don't show any notifications mechanism for when we get the responses or errors.
So, I have a simple question:
How do I get asynchronously notified that my request failed or succeeded?
I assume that the default HTTPoison.get is synchronous, as shown in the documentation.
This could be achieved by spawning a new process per-request. Consider something like:
notify = fn response ->
# Any handling logic - write do DB? Send a message to another process?
# Here, I'll just print the result
IO.inspect(response)
end
spawn(fn ->
resp = HTTPoison.get("http://google.com")
notify.(resp)
end) # spawn will not block, so it will attempt to execute next spawn straig away
spawn(fn ->
resp = HTTPoison.get("http://yahoo.com")
notify.(resp)
end) # This will be executed immediately after previoius `spawn`
Please take a look at the documentation of spawn/1 I've pointed out here.
Hope that helps!

How can I send a query to database after the request is handled?

I have an application that does the following:
After the app receives a get request, it reads the client's cookies
for identification.
It stores the identification information in Postgresql DB
And it sends the appropriate response and finishes the handling
process.
But in this way the client is also waiting for me to store the data in PSQL. I don'
t want this what I want is:
After the app receives a get request, it reads the client's cookies
for identification.
It sends the appropriate response and finishes the handling process.
It stores the identification information in Postgresql DB.
In the second part storing process is happening after the client has received his response so he won't have to wait for it. I've searched for a solution but haven't found anything thus far. I believe I'm searching with wrong keywords because, I believe this is a common problem.
Any feedback is appreciated.
You should add a callback to the ioloop. Via some code like this:
from tornado import ioloop
def somefuction(*args):
# call the DB
...
... now in your get() or post() handler
...
io_loop = ioloop.IOLoop.instance()
io_loop.add_callback(partial(somefunction, arg, arg2))
... rest of your handler ...
self.finish()
This will get called after the response is returned to the user on the next iteration through the event handler to call your DB processor somefunction.
If you dont want to wait for Postgres to respond you could try
1) An async postgres driver
2) Put the DB jobs on a queue and let the queue handle the DB write. Try Rabbit MQ
Remember because you return to the user before you write to the DB you have to think about how to handle write errors

Resources