I'm setting up a DAG in airflow (Cloud Composer) to trigger some Cloud Run jobs that take upwards of 30 mins to complete.
Instead of using SimpleHTTPOperator, I'm using the pythonOperator to obtain a OIDC token to allow me to trigger services that require automation.
def make_authorized_get_request(service_url, **kwargs):
auth_req = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(auth_req, service_url)
headers = {"Authorization": f"Bearer {id_token}"}
req = requests.get(service_url, headers=headers, timeout=(5, 3600))
status = req.status_code
response_json = req.json()
return response_json,status
#t1 as request first report
big3_request = PythonOperator(
task_id='big3_request',
python_callable=make_authorized_get_request,
op_kwargs={"service_url":"cloud run url"}
)
I can see the cloud run job completes successfully, but the task in airflow doesn't seem to pick up on the response (a json containing some data about the job and a status code) and just keeps on running, unless a timeout is set, in which case it errors (although the response should be available by then).
I've pointed this task towards a shorter service to test and it picks up response fine.
How can I get the task to pick up on the response? Or do I need to use another operator?
Related
I've an airflow dag that executes 10 tasks (exporting different data from the same source) in parallel, every 15min. I've also enabled 'email_on_failure' to get notified on failures.
Once every month or so, the tasks start failing for a couple of hours due to the data source not being available. Causing airflow to generate hundreds of emails (10 emails every 15min.), until the raw data source is available again.
Is there a better way to avoid being spammed with emails once consecutive runs fail to succeed?
For example, is it possible to only send an email on failure once it is the first run that start failing (i.e. previous run was successful)?
To customise the logic in callbacks you can use on_failure_callback and define a python function to call on failure/success. in this function you can access the task instance.
A property on this task instance is try_number - which you can check before sending an alert. An example could be:
some_operator = BashOperator(
task_id="some_operator",
bash_command="""
echo "something"
""",
on_failure_callback=task_fail_email_alert,
dag=dag,
def task_fail_email_alert(context):
try_number = context["ti"].try_number
if try_number == 1:
# send alert
else:
# do nothing
You can them implement the code to send an email in this function, rather than use the builtin email_on_failure. The EmailOperator is available by importing from airflow.operators.email import EmailOperator.
Giving consideration that your tasks are running concurrently and one or multiple failures could occur, I would suggest to treat the dispatch of failure messages as one would a shared resource.
You need to implement a lock that is "dagrun-aware" –– one that knows about the DagRun.
You can back this lock using an in-memory database like Redis, an object store like S3, system file, or a database. How you choose to implement this up to you.
In your on_failure_callback implementation, you must acquire said Lock. If acquisition of said Lock is successful, carry on to dispatch the email. Otherwise, pass.
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
class OnlyOnceLock:
def __init__(self, run_id):
self.run_id = run_id
def acquire(self):
# Returns False if run_id already exists in a backing store.
# S3 example
hook = S3Hook()
key = self.run_id
bucket_name = 'coordinated-email-alerts'
try:
hook.head_object(key, bucket_name)
return False
except:
# This is the first time lock is acquired
hook.load_string('fakie', key, bucket_name)
return True
def __enter__(self):
return self.acquire()
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def on_failure_callback(context):
error = context['exception']
task = context['task']
run_id = context['run_id']
ti = context['ti']
with OnlyOnceLock(run_id) as lock:
if lock:
ti.email_alert(error, task)
This question is really for the different coroutines in base_events.py and streams.py that deal with Network Connections, Network Servers and their higher API equivalents under Streams but since its not really clear how to group these functions I am going to attempt to use start_server() to explain what I don't understand about these coroutines and haven't found online (unless I missed something obvious).
When running the following code, I am able to create a server that is able to handle incoming messages from a client and I also periodically print out the number of tasks that the EventLoop is handling to see how the tasks work. What I'm surprised about is that after creating a server, the task is in the finished state not too long after the program starts. I expected that a task in the finished state was a completed task that no longer does anything other than pass back the results or exception.
However, of course this is not true, the EventLoop is still running and handling incoming messages from clients and the application is still running. Monitor however shows that all tasks are completed and no new task is dispatched to handle a new incoming message.
So my question is this:
What is going on underneath asyncio that I am missing that explains the behavior I am seeing? For example, I would have expected a task (or tasks created for each message) that is handling incoming messages in the pending state.
Why is the asyncio.Task.all_tasks() passing back finished tasks. I would have thought that once a task has completed it is garbage collected (so long as no other references are to it).
I have seen similar behavior with the other asyncio functions like using create_connection() with a websocket from a site. I know at the end of these coroutines, their result is usually a tuple such as (reader, writer) or (transport, protocol) but I don't understand how it all ties together or what other documentation/code to read to give me more insight. Any help is appreciated.
import asyncio
from pprint import pprint
async def echo_message(self, reader, writer):
data = await reader.read(1000)
message = data.decode()
addr = writer.get_extra_info('peername')
print('Received %r from %r' % (message, addr))
print('Send: %r' % message)
writer.write(message.encode())
await writer.drain()
print('Close the client socket')
writer.close()
async def monitor():
while True:
tasks = asyncio.Task.all_tasks()
pprint(tasks)
await asyncio.sleep(60)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.create_task(monitor())
loop.create_task(asyncio.start_server(echo_message, 'localhost', 7777, loop))
loop.run_forever()
Outputs:
###
# Soon after starting the application, monitor prints out:
###
{<Task pending coro=<start_server() running ...>,
<Task pending coro=<monitor() running ...>,
<Task pending coro=<BaseEventLoop._create_server_getaddrinfo() running ...>}
###
# After, things initialized and the server has started and the next print out is:
###
{<Task finished coro=<start_server() done ...>,
<Task pending coro=<monitor() running ...>,
<Task finished coro=<BaseEventLoop._create_server_getaddrinfo() done ...>}
I'm trying to figure out how to best approach the below problem. Essentially I have an external API Service that I am sending requests to and getting results for.
POST = Send request and the response you get back is a URL which you can use for the GET requests to retrieve your results.
GET = Poll the returned URL from the POST request until you get a successful result.
What would be the best way to approach this in airflow? My idea is to essentially have 2 tasks running in parallel.
One sending the POST requests and then saving the response URL to XCOM.
The other would be continuously running in a while loop, reading from the XCOM store for new URL responses and getting responses. It would then delete from XCOM store once it has a retrieved a succesfuly result from that URL.
Do you think this is the correct way of going about it? Or possibly should i use the asyncio library in python?
Any help much appreciated
Thanks,
You can achieve what you are describing using SimpleHttpOperator and HttpSensor from Airflow (no need to install any extra package).
Consider this example that uses http_default connection to http bin.
The task to perform POST request:
task_post_op = SimpleHttpOperator(
task_id='post_op',
# http_conn_id='your_conn_id',
endpoint='post',
data=json.dumps({"priority": 5}),
headers={"Content-Type": "application/json"},
response_check=lambda response: response.json()['json']['priority'] == 5,
response_filter=lambda response: 'get', # e.g lambda response: json.loads(response.text)
dag=dag,
)
By providing response_filter you can manipulate the response result, which will be the value pushed to XCom. In your case, you should return the endpoint you want to poll in the next task.
response_filter: A function allowing you to manipulate the response
text. e.g response_filter=lambda response: json.loads(response.text).
The callable takes the response object as the first positional argument
and optionally any number of keyword arguments available in the context dictionary.
:type response_filter: A lambda or defined function.
Note that response_check param it's optional.
The task to perform GET requests:
Use the HttpSensor to poke until the response_check callable evaluates to true.
task_http_sensor_check = HttpSensor(
task_id='http_sensor_check',
# http_conn_id='your_conn_id',
endpoint=task_post_op.output,
request_params={},
response_check=lambda response: "httpbin" in response.text,
poke_interval=5,
dag=dag,
)
As endpoint param we are passing the XCom value pulled from previous task, using XComArg.
Use poke_interval to define the time in seconds that the job should wait in between each tries.
Remember to create a Connection of your own defining the base URL, port, etc.
Let me know if that worked for you!
My usecase is to control lot of scheduled jobs across microservices using airflow. The solution I am trying is to use airflow as a centralized job scheduler and trigger jobs by making http calls. Some of these jobs will run for long time eg. more than 10min or upto 1 hour.
How can I regularly check the status of these jobs from airflow? What if the remote task has finished but airflow does not know about the job success? Can I publish the event for job completion to kafka and make airflow listen on kafka to get status of job?
There are many ways you could do this with Airflow and your microservices. In general, you will want to use a sensor, that's the appropriate Airflow object for something like this. Start by checking out the BaseSensorOperator and about operators. In Airflow, Sensors are used just like Operators (sensors are operators). So you can create a job like this:
http_post_task -> http_sensor_task -> success_task
Where http_post_task will trigger a job, http_sensor_task will check periodically to see if the job is done (e.g. GET request the microservice and check for 200, maybe?), and success_task will execute after the http_sensor_task is successful.
Your http_sensor_task will need to be your own custom sensor. Here is some sudo code that can help you create this sensor (remember sensors are used like operators). Consider the case where you make a request to the microservice and then make another request to check the status of the job (GET request and check 200), you will extend the BaseSensorOperator kind of like this:
from airflow.operators.sensors import BaseSensorOperator
from airflow.utils.decorators import apply_defaults
from time import sleep
import requests
class HTTPSensorOperator(BaseSensorOperator):
"""
Pokes a URL until it returns 200
"""
ui_color = '#000000'
#apply_defaults
def __init__( self, url, *args, **kwargs):
super(HTTPSensorOperator, self).__init__(*args, **kwargs)
self.url = url
def poke(self, context):
"""
GET request url and return True if response is 200, False otherwise
"""
r = requests.post(self.url)
if r.status_code == 200:
return True
else:
return False
def execute(self, context):
"""
Check the url and wait for it to return 200.
"""
started_at = datetime.utcnow()
while not self.poke(context):
if (datetime.utcnow() - started_at).total_seconds() > self.timeout:
if self.soft_fail:
raise AirflowSkipException("Exporting {0}/{1} took to long.".format(self.project, self.instance))
else:
raise AirflowSkipException("Exporting {0}/{1} took to long.".format(self.project, self.instance))
sleep(self.poke_interval)
self.log.info("Success criteria met. Exiting.")
Then use the operator like:
http_sensor_task = HTTPSensorOperator(
task_id="http_sensor_task",
url="http://localhost/check_job?job_id=1",
timeout=3600, # 1 hour
dag=dag
)
So you'll have to decide how your microservices will communicate with Airflow. Just of the top of my head I'm thinking you'll make 1 request to trigger a job and then make subsequent request (maybe ever 10 seconds) to check on a job. Good luck!
I am working with:
let callTheAPI = async {
printfn "\t\t\tMAKING REQUEST at %s..." (System.DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss"))
let! response = Http.AsyncRequestStream(url,query,headers,httpMethod,requestBody)
printfn "\t\t\t\tREQUEST MADE."
}
And
let cts = new System.Threading.CancellationTokenSource()
let timeout = 1000*60*4//4 minutes (4 mins no grace)
cts.CancelAfter(timeout)
Async.RunSynchronously(callTheAPI,timeout,cts.Token)
use respStrm = response.ResponseStream
respStrm.Flush()
writeLinesTo output (responseLines respStrm)
To call a web API (REST) and the let! response = Http.AsyncRequestStream(url,query,headers,httpMethod,requestBody) just hangs on certain queries. Ones that take a long time (>4 minutes) particularly. This is why I have made it Async and put a 4 minute timeout. (I collect the calls that timeout and make them with smaller time range parameters).
I started Http.RequestStream from FSharp.Data first, but I couldn't add a timeout to this so the script would just 'hang'.
I have looked at the API's IIS server and the application pool Worker Process active requests in IIS manager and I can see the requests come in and go again. They then 'vanish' and the F# script hangs. I can't find an error message anywhere on the script side or server side.
I included the Flush() and removed the timeout and it still hung. (Removing the Async in the process)
Additional:
Successful calls are made. Failed calls can be followed by successful calls. However, it seems to get to a point where all the calls time out and the do so without even reaching the server any more. (Worker Process Active Requests doesn't show the query)
Update:
I made the Fsx script output the queries and ran them through IRM with now issues (I have timeout and it never locks up). I have a suspicion that there is an issue with FSharp.Data.Http.
Async.RunSynchronously blocks. Read the remarks section in the docs: RunSynchronously. Instead, use Async.AwaitTask.