AirFlow SIGTERM While EMR Job is Running - airflow

I have an AriFlow DAG where each step is an AWS EMR task. Once AirFlow reaches one of the steps, it sends the SIGTERM signal as the following
{emr_step.py:73} INFO - Poking step XXXXXXXX on cluster XXXXXXXX
{emr_base.py:66} INFO - Job flow currently RUNNING
{local_task_job.py:199} WARNING - State of this instance has been externally set to failed. Terminating instance.
{process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 7632
This is in spite of the fact that the EMR job is still running healthy. One major difference between the EMR job that AirFlow fails on and the rest of my EMR jobs is that it triggers anther system and wait to hear back from that system. In other words, it stays idle until it hears back from aother system. My impression is that AirFlow thinks the EMR job has failed. However, it is just waiting to hear from another system.
Is there any way to ask AirFlow to wait more for this EMR job?

Related

airflow jobs stucked in running status

There is a strange phenomenon when using air flow. When I run the airflow scheduler, the unidentified jobs suddenly become running, and when I exit the scheduler, it turns into success.
I tried airflow db reset, but I keep getting that job if I just run the scheduler. can you tell me why?

Airflow stops scheduling tasks after a few days of runs

Airflow Version 2.0.2
I have three schedulers running in a kubernetes cluster running the CeleryExecutor with a postgres backend. Everything seems to run fine for a couple of weeks, but then the airflow scheduler stops scheduling some tasks. I've done an airflow db reset followed by an airflow db init and a fresh deployment of the airflow-specific images. Below are some of the errors I've received from logging in the database:
According to https://github.com/apache/airflow/issues/19811 the slot_pool issue is expected behavior, but I cannot figure out why DAGs suddenly stop being scheduled on time. For reference, there are ~500 DAGs being run every 15 minutes.
LOG: could not receive data from client: Connection timed out
STATEMENT: SELECT slot_pool.pool AS slot_pool_pool, slot_pool.slots AS slot_pool_slots
FROM slot_pool FOR UPDATE NOWAIT
The slot_pool table looks like this:
select * from slot_pool;
id | pool | slots | description
----+--------------+-------+--------------
1 | default_pool | 128 | Default pool
(1 row)
I have looked at several posts, but none of the posts seem to explain the issue or provide a solution. Below are a few of them:
Airflow initdb slot_pool does not exists
Running multiple Airflow Schedulers cause Postgres locking issues
Airflow tasks get stuck at "queued" status and never gets running

Airflow DAG getting psycopg2.OperationalError when running tasks with KubernetesPodOperator

Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!

Airflow State of this instance has been externally set to shutdown. Taking the poison pill

Some of the Airflow tasks are automatically getting shutdown.
I am using Airflow version 1.10.6 with Celery Executor. The Database is PostgreSQL and Broker is Redis. The airflow infrastructure is deployed on Azure.
Few tasks are getting shutdown after 15 hrs, few are getting stopped after 30 minutes. These are long-running tasks, I have set the execution_timeout to 100 hrs.
Any configuration that can prevent these tasks to be shutdown by Airflow ?
{local_task_job.py:167} WARNING - State of this instance has been externally set to shutdown. Taking the poison pill.

Cloudera Mesos - When mesos-slave is stopped, current job get in 'LOST' status

Cloudera Mesos - When mesos-slave is stopped it stops any further job processing. However, if a job is currently in progress, it gets in 'LOST' status. How to prevent it? Also, how can i inform mesos to complete the current job and then shutdown the mesos-slave?
Thanks
There is Maintenace operator API in Mesos where you can ask Mesos to drain node (finish all tasks) and then turn it off but this feature must be enabled in a framework you are using.
http://mesos.apache.org/documentation/latest/maintenance/

Resources