How to trigger Airflow DAG from AWS SQS? - airflow

I would like to trigger an Airflow DAD based on SQS messages. I am quite new to Airflow but this is how I think it should be done:
Option 1
Use the Airflow SQS Sensor. From my understanding, this waits on SQS messages to proceed with the execution of an already trigger DAG. Does this mean a DAG would always need to be running and waiting for SQS messages to catch any eventual new messages and process them? Does this also mean I should schedule my DAG on a very short interval so that when an SQS message gets handled by a DAG, another DAG is created to handle the next SQS messages?
Option 2
Add a lambda or something watching for SQS messages and using the Airflow API to trigger DAGs when needed.
Eventually, I would like to minimise the number of interactions needed to trigger a DAG so I would like to use an Airflow built-in way of watching SQS.
Thank you

Both options are valid however Option 2 is basically an alternative implementation to sensor. I think the better solution is Option 1 with some modification:
Use SQSSensor but with mode='reschedule' that way every once in a while the sensor is "awaking" checking if the criteria is met. Note that this is not like sleep(x). When the criteria isn't met Airflow will release the worker for other tasks that needs to run and return the SQSSensor to the scheduling queue.
You can read more about the sensor modes in the docs.
from airflow.providers.amazon.aws.sensors.sqs import SQSSensor
SQSSensor(
task_id='test_task',
dag=dag,
sqs_queue='your_queue',
aws_conn_id='aws_default',
mode='reschedule')
Note that the sensor will run indefinitely until the criteria is met. You can set timeout on the sensor task (there are other possible reasons for timeout like cluster policy and other defaults but that is another topic).

Related

how to use up_for_reschedule with python operator

in our orchestration we are using one DAG to trigger other child dags and till these child dags not finish our master DAG is running to check the status. we used sleep of 5 min to check status of child dags after each 5 min. As this task is continually in running state so its consuming resources on one worker. recently we came to know about up_for_reschedule. does this solve my problem to release the worker ? is it possible to use up_for_reschedule with python operator ? if yes is there any document which i can refer ?
There are a couple of options here.
You can use PythonSensor instead of Operator in reschedule mode. In reschedule mode, the airflow will reschedule the task instance if the sensor con
task = PythonSensor(
task_id='sensor_example',
mode='reschedule',
python_callable=func
)
You can link a bunch of TriggerDagRunOperoator. But this is best if you just want to create the dog runs and not wait for status checks. Unlike sensors, it doesn't have a reschedule mode. So when wait_for_completion is True, it will hold the worker slot

Polling multiple SQS messages using Airflow SQSSensor

I am using this SQSSensoe settings to poll messages
fetch_sqs_message = SQSSensor(
task_id="...",
sqs_queue="...",
aws_conn_id="aws_default",
max_messages=10,
wait_time_seconds=30,
poke_interval=60,
timeout=300,
dag=dag
)
I would assume everytime it polls it should poll up to 10 messages. Which my queue has around 5 when I tested this.
But each time I trigger the dag, it only polls 1 message at a time, which I found out from the SQS message count.
Why is it doing this? How can I to get it poll as much messages as possible?
Recently, a new feature has been added to SQSSensor so that the sensor can polls SQS multiple times instead of only once.
You can check out this merged PR
For example, if num_batches is set to 3, SQSSensor will poll the queue 3 times before returning the results.
Disclaimer: I contributed to this feature.

Is there a way to externally check the status of each task in a DAG?

I was looking through the different API endpoints that Airflow offers, but I could not find one that would suite my needs. Essentially I want to monitor the state of each task within the DAG, without having to specify each task I am trying to monitor. Ideally, I would be able to ping the DAG and the response would tell me the state of the task/tasks and what task/tasks are running/retrying...etc
You can use the airflow rest api which comes along with it - https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html

Airflow - pause/unpause individual dagruns of the same DAG running in parallel

We are currently evaluating airflow for a project. I was wondering if there is a way to stop/start individual dagruns while running a DAG multiple times in parallel. Pause/unpause on dag_id seems to pause/unpause all the dagruns under a dag. Instead we want to pause individual dagruns (or tasks within them). Let me know if this is achievable in airflow.
If its not possible, here are other alternatives I am thinking of, let me know your opinion on these
Change task state. – Change all tasks under a dagrun to Mark Failed or Success. That way that particular dagrun is stopped on its tracks without affecting other dagruns.
Airflow sensor to pull this information from s3 or http or sql or somewhere to pause current dagrun. And have a task to check on s3 everytime if this dagrun needs to be stopped (not other dagruns).
subdags. - Can we pause/unpause subdags. That way for each parallel user's request we want to do we issue a subdag and we can pause userAs subdag without impacting other user’s subdags.
There's nothing "baked" into Airflow to support this but you could (ab)use the state of the DagRun by changing it to "failed" to pause and then back to "running" to resume; you won't be able to blanket unpause but for testing it should be workable.

Airflow: Only allow one instance of task

Is there a way specify that a task can only run once concurrently? So in the tree above where DAG concurrency is 4, Airflow will start task 4 instead of a second instance of task 2?
This DAG is a little special because there is no order between the tasks. These tasks are independent but related in purpose and therefore kept in one DAG so as to new create an excessive number of single task DAGs.
max_active_runs is 2 and dag_concurrency is 4. I would like it start all 4 tasks and only start a task in next if same task in previous run is done.
I may have mis-understood your question, but I believe you are wanting to have all the tasks in a single dagrun finish before the tasks begin in the next dagrun. So a DAG will only execute once the previous execution is complete.
If that is the case, you can make use of the max_active_runs parameter of the dag to limit how many running concurrent instances of a DAG there are allowed to be.
More information here (refer to the last dotpoint): https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled
max_active_runs defines how many running concurrent instances of a DAG there are allowed to be.
Airflow operator documentation describes argument task_concurrency. Just set it to one.
From the official docs for trigger rules:
depends_on_past (boolean) when set to True, keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded.
So the future DAGs will wait for the previous ones to finish successfully before executing.
On airflow.cfg under [core]. You will find
dag_concurrency = 16
//The number of task instances allowed to run concurrently by the scheduler
you're free to change this to what you desire.

Resources