How to check status of long running http task with airflow? - airflow

My usecase is to control lot of scheduled jobs across microservices using airflow. The solution I am trying is to use airflow as a centralized job scheduler and trigger jobs by making http calls. Some of these jobs will run for long time eg. more than 10min or upto 1 hour.
How can I regularly check the status of these jobs from airflow? What if the remote task has finished but airflow does not know about the job success? Can I publish the event for job completion to kafka and make airflow listen on kafka to get status of job?

There are many ways you could do this with Airflow and your microservices. In general, you will want to use a sensor, that's the appropriate Airflow object for something like this. Start by checking out the BaseSensorOperator and about operators. In Airflow, Sensors are used just like Operators (sensors are operators). So you can create a job like this:
http_post_task -> http_sensor_task -> success_task
Where http_post_task will trigger a job, http_sensor_task will check periodically to see if the job is done (e.g. GET request the microservice and check for 200, maybe?), and success_task will execute after the http_sensor_task is successful.
Your http_sensor_task will need to be your own custom sensor. Here is some sudo code that can help you create this sensor (remember sensors are used like operators). Consider the case where you make a request to the microservice and then make another request to check the status of the job (GET request and check 200), you will extend the BaseSensorOperator kind of like this:
from airflow.operators.sensors import BaseSensorOperator
from airflow.utils.decorators import apply_defaults
from time import sleep
import requests
class HTTPSensorOperator(BaseSensorOperator):
"""
Pokes a URL until it returns 200
"""
ui_color = '#000000'
#apply_defaults
def __init__( self, url, *args, **kwargs):
super(HTTPSensorOperator, self).__init__(*args, **kwargs)
self.url = url
def poke(self, context):
"""
GET request url and return True if response is 200, False otherwise
"""
r = requests.post(self.url)
if r.status_code == 200:
return True
else:
return False
def execute(self, context):
"""
Check the url and wait for it to return 200.
"""
started_at = datetime.utcnow()
while not self.poke(context):
if (datetime.utcnow() - started_at).total_seconds() > self.timeout:
if self.soft_fail:
raise AirflowSkipException("Exporting {0}/{1} took to long.".format(self.project, self.instance))
else:
raise AirflowSkipException("Exporting {0}/{1} took to long.".format(self.project, self.instance))
sleep(self.poke_interval)
self.log.info("Success criteria met. Exiting.")
Then use the operator like:
http_sensor_task = HTTPSensorOperator(
task_id="http_sensor_task",
url="http://localhost/check_job?job_id=1",
timeout=3600, # 1 hour
dag=dag
)
So you'll have to decide how your microservices will communicate with Airflow. Just of the top of my head I'm thinking you'll make 1 request to trigger a job and then make subsequent request (maybe ever 10 seconds) to check on a job. Good luck!

Related

Airflow/Cloud Composer execute custom logic when DAG run is killed

I'm currently working on a custom sensor operator that launches a job in a remote server and periodically polls for its status. In some cases I may need to cancel/delete this DAG run programmatically with another program. Since the actual job is executed remotely, I want to be able to shut down the remote job before the DAG run is canceled/deleted. Since only the operator has the details of the remote job, I'm wondering if there is a way to trigger some custom logic to shut down the remote job right before the DAG run is going to be canceled/deleted?
Here is a summarized version of my operator code:
class JobOperator(base_sensor_operator.BaseSensorOperator):
def poke(self, context: Any) -> bool:
if !self._job_started:
self._job_info = LaunchJob(self._job_config)
else:
status = PollJob(self._job_info)
if status == "SUCCESS":
# some logic.
else:
# some logic.
I'm not sure if Airflow has such a trigger to execute some logic before DAG run deletion. Any help would be appreciated!

Airflow: How to only send email alerts when it is not a consecutive failure?

I've an airflow dag that executes 10 tasks (exporting different data from the same source) in parallel, every 15min. I've also enabled 'email_on_failure' to get notified on failures.
Once every month or so, the tasks start failing for a couple of hours due to the data source not being available. Causing airflow to generate hundreds of emails (10 emails every 15min.), until the raw data source is available again.
Is there a better way to avoid being spammed with emails once consecutive runs fail to succeed?
For example, is it possible to only send an email on failure once it is the first run that start failing (i.e. previous run was successful)?
To customise the logic in callbacks you can use on_failure_callback and define a python function to call on failure/success. in this function you can access the task instance.
A property on this task instance is try_number - which you can check before sending an alert. An example could be:
some_operator = BashOperator(
task_id="some_operator",
bash_command="""
echo "something"
""",
on_failure_callback=task_fail_email_alert,
dag=dag,
def task_fail_email_alert(context):
try_number = context["ti"].try_number
if try_number == 1:
# send alert
else:
# do nothing
You can them implement the code to send an email in this function, rather than use the builtin email_on_failure. The EmailOperator is available by importing from airflow.operators.email import EmailOperator.
Giving consideration that your tasks are running concurrently and one or multiple failures could occur, I would suggest to treat the dispatch of failure messages as one would a shared resource.
You need to implement a lock that is "dagrun-aware" –– one that knows about the DagRun.
You can back this lock using an in-memory database like Redis, an object store like S3, system file, or a database. How you choose to implement this up to you.
In your on_failure_callback implementation, you must acquire said Lock. If acquisition of said Lock is successful, carry on to dispatch the email. Otherwise, pass.
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
class OnlyOnceLock:
def __init__(self, run_id):
self.run_id = run_id
def acquire(self):
# Returns False if run_id already exists in a backing store.
# S3 example
hook = S3Hook()
key = self.run_id
bucket_name = 'coordinated-email-alerts'
try:
hook.head_object(key, bucket_name)
return False
except:
# This is the first time lock is acquired
hook.load_string('fakie', key, bucket_name)
return True
def __enter__(self):
return self.acquire()
def __exit__(self, exc_type, exc_val, exc_tb):
pass
def on_failure_callback(context):
error = context['exception']
task = context['task']
run_id = context['run_id']
ti = context['ti']
with OnlyOnceLock(run_id) as lock:
if lock:
ti.email_alert(error, task)

Airflow Long (time) URL request not receiving response

I'm setting up a DAG in airflow (Cloud Composer) to trigger some Cloud Run jobs that take upwards of 30 mins to complete.
Instead of using SimpleHTTPOperator, I'm using the pythonOperator to obtain a OIDC token to allow me to trigger services that require automation.
def make_authorized_get_request(service_url, **kwargs):
auth_req = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(auth_req, service_url)
headers = {"Authorization": f"Bearer {id_token}"}
req = requests.get(service_url, headers=headers, timeout=(5, 3600))
status = req.status_code
response_json = req.json()
return response_json,status
#t1 as request first report
big3_request = PythonOperator(
task_id='big3_request',
python_callable=make_authorized_get_request,
op_kwargs={"service_url":"cloud run url"}
)
I can see the cloud run job completes successfully, but the task in airflow doesn't seem to pick up on the response (a json containing some data about the job and a status code) and just keeps on running, unless a timeout is set, in which case it errors (although the response should be available by then).
I've pointed this task towards a shorter service to test and it picks up response fine.
How can I get the task to pick up on the response? Or do I need to use another operator?

In Airflow, how do you get a parent tasks, task_id?

I have a branch task that relies on an XCOM set by it's direct upstream. The upstream task id's are generated via loop such as task_1, task_2..task_n.
So something like this:
task_n >> branch[task_a, task_b]
Is there a way for a branch to access an XCOM set by it's direct upstream? I know I could use op_kwargs and pass the task id to the branch. I just wanted to see if there was a more Airflow native way to do it.
The PythonBranchOperator should be created with provide_context=True and the python callable for it can look something like this:
def branch_callable(task_instance, task, **kwargs):
upstream_ids = task.upstream_task_ids # an iterable
xcoms = task_instance.xcom_pull(task_ids=upstream_ids)
# process the xcoms of the direct upstream tasks

Airflow - mark a specific task_id of given dag_id and run_id as success or failure

Can I externally(use a http request ?) to mark a specific task_id associated with dag_id and run_id as success/failure.
My task is a long running task on external system and I don't want my task to poll the system to find the status.. since we can probably have several 1000 task running at same time ..
Ideally want my task to
make a http request to start my external job
go to sleep
once the job is finished, it(External system or the post build action of my job) informs airflow that the task is done (identified by task_id, dag_id and run_id)
Thanks
You can solve this by sending SQL queries directly into Airflow's metadata DB:
UPDATE task_instance
SET state = 'success',
try_number = 0
WHERE
task_id = 'YOUR-TASK-ID'
AND
dag_id = 'YOUR-DAG-ID'
AND
execution_date = '2019-06-27T16:56:17.789842+00:00';
Notes:
The execution_date filter is crucial, Airflow identifies DagRuns by execution_date, not really by their run_id. This means you really need to get your DagRun's execution/run date to make it work.
The try_number = 0 part is added because sometimes Airflow will reset the task back to failed if it notices that try_number is already at its limit (max_tries)
You can see it in Airflow's source code here: https://github.com/apache/airflow/blob/750cb7a1a08a71b63af4ea787ae29a99cfe0a8d9/airflow/models/dagrun.py#L203
Airflow doesnt yet have a Rest endpoint. However you have a couple of options
- Using the airflow command line utilities to mark the jobs to success. E.g. In python using Popen.
- Directly update the Airflow DB table task_instance

Resources