Trigger airflow task on sensor timeout - airflow

I currently have a PythonSensor which waits for files on an ftp server. Is it possible to have this sensor trigger a task on timeout? I am trying to create the following dag:
airflow sensor diagram
I have taken a look at BranchPythonOperator but it seems like I no longer get the benefits of rescheduling a task if it fails the first time.

Have you tried to use trigger_rule="all_failed" in your task?
all_failed: All upstream tasks are in a failed or upstream_failed state
See http://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html?highlight=all_failed#trigger-rules
And an example here http://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=all_failed#how-to-trigger-tasks-based-on-another-task-s-failure

Related

How to write deferrable SqlSensor in Airflow?

I have a number of DAGs that wait for EOD settlement, and limited worker slot. The settlement ends at varying times. So while waiting the settlement, I want to run a different DAG on the worker slot. From Airflow documentation, deferable operator looks fit for this kind of purpose. I'm new to python and airflow. Can somebody explain how to write deferable sql sensor ?
I have looked at deferable time sensor examples, but can't make it to work with sql sensors.

Airflow Scheduler handling queueing of dags

I have the following airflow setup
Executor : KubernetesExecutor
airflow version : 2.1.3
airflow config : parallelism = 256
I have the below scenario
I have a number of dags(eg 10) which are dependent on the success state of another task from another dag. The tasks kept failing with retries enabled for 6 times.
All the dependent dags run hourly and as a result they were added to the queue state by the scheduler. I can see around 800 dags were in queue and nothing was running. So I ended up manually changing their state to Fail.
Below are my questions from this event.
Is there a limit on the number of dags that can run concurrently in airflow set up ?
Is there a limit on how many dags can be enqueued ?
When dags are queued how does the scheduler decides which one to pick ? Is it based on queued time ?
Is is possible for setting up priority among the queued dags ?
How Does airflow 2.1.3 treats task in queue ? Are they counted against max_active_runs parameters ?

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Airflow: How to get the return output of one task to set the dependencies of the downstream tasks to run?

We have a kubernetes pod operator that will spit out a python dictionary that will define which further downstream kubernetes pod operators to run along with their dependencies and the environment variables to pass into each operator.
How do I get this python dictionary object back into the executor's context (or is it worker's context?) so that airflow can spawn the downstream kubernetes operators?
I've looked at BranchOperator and TriggerDagRunOperator and XCOM push/pull and Variable.get and Variable.set, but nothing seems to quite work.
We have a kubernetes pod operator that will spit out a python
dictionary that will define which further downstream kubernetes pod
operators to run
This is possible, albeit not in the way you are trying. You'll have to have all possible KubernetesPodOperators already in your workflow and then skip those that need not be run.
An elegant way to do this would be to attach a ShortCircuitOperator before each KubernetesPodOperator that reads the XCom (dictionary) published by the upstream KubernetesPodOperator and determines whether or not to continue with the downstream task.
EDIT-1
Actually a cleaner way would be to just raise an AirflowSkipException within the task that you want to skip (rather than using a separate ShortCircuitOperator to do this)
How do I get this python dictionary ... so that airflow can spawn the
downstream kubernetes operators..
No. You can't dynamically spawn new tasks based on output of an upstream task.
Think of it this way: for scheduler it is imperative to know all the tasks (their task_ids, trigger_rules, priority_weight etc) ahead of time so as to be able to execute them when the right time comes. If the tasks were to just keep coming up dynamically then Airflow's scheduler would have to become akin to an Operating System scheduler (!). For more details read the EDIT-1 part of this answer

How to implement polling in Airflow?

I want to use Airflow to implement data flows that periodically poll external systems (ftp servers, etc), check for new files matching certain conditions, and then run a bunch of tasks for those files. Now, I'm a newbie to Airflow and read that Sensors are something you would use for this kind of a case, and I actually managed to write a sensor that works ok when I run "airflow test" for it. But I'm a bit confused regarding the relation of poke_interval for the sensor and the DAG scheduling. How should I define those settings for my use case? Or should I use some other approach? I just want Airflow to run the tasks when those files become available, and not flood the dashboard with failures when no new files were available for a while.
Your understanding is correct, using a sensor is the way to go when you want to poll, either by using an existing sensor or by implementing your own.
They are, however, always part of a DAG and they do not execute outside of its boundaries. DAG execution depends on the start_date and schedule_interval, but you can leverage this and a sensor to implement some sort of DAG depending on the status of an external server: one possible approach would be starting the whole DAG with a sensor which checks for a condition to occur and decide to skip the whole DAG if the condition is not met (you can make sure that sensors mark downstream tasks as skipped and not failed by setting their soft_fail parameter to True). You can have a polling interval of one minute by using the most frequent scheduling option (* * * * *). If you really need a shortest polling time you can tweak the sensor's poke_interval and timeout parameters.
Keep in mind, however, that execution times are not probably guaranteed by Airflow itself, so for very short polling times you may want to investigate alternatives (or at least consider different approaches to the one I've just shared).

Resources