I have an Airflow DAG that has nearly 50 tasks running in parallel. Occasionally, I get this exception. "Exception while trying to heartbeat! Sleeping for 5.0 seconds". The Airflow documentation says that its due to the failure of a periodic heartbeat from tasks and the scheduler considered them as zombies. Any fix would be helpful.
If you are making connections to DB or using many ports then try to close the connections and free the ports. That might work.
Related
I have a .net core Worker process using the latest library IBMMQDotnetClient 9.2.4. I am running into a test when there is a about 10k+ messages in queue. the application processes about 5.5k and then it processes random number of messages(mostly one message every heartbeat interval but have seen 100+) for every heartbeat interval time(currently have default value of 5). Has anyone ran into the issue. it would be great if someone knows the cause.
Update: There is no issue when putting message in the Queue it is only when getting from the queue.
Context
We are running Airflow 2.1.4. on a AKS cluster. The Airflow metadata database is an Azure managed postgreSQL(8 cpu). We have a DAG that has like 30 tasks, each task use a KubernetesPodOperator (using the apache-airflow-providers-cncf-kubernetes==2.2.0) to execute some container logic. Airflow is configured with the Airflow official HELM chart. The executor is Celery.
Issue
Usually the first like 5 tasks execute successfully (taking like 1 or 2 minute each) and get marked as done (and colored green) in the Airflow UI. The tasks after that are also successfully executed on AKS, but Airflow not marked as completed in Airflow as such. In the end this leads up to this error message and marking the already finished task as a fail:
[2021-12-15 11:17:34,138] {pod_launcher.py:333} INFO - Event: task.093329323 had an event of type Succeeded
...
[2021-12-15 11:19:53,866]{base_job.py:230} ERROR - LocalTaskJob heartbeat got an exception
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "psql-airflow-dev-01.postgres.database.azure.com" (13.49.105.208) and accepting
TCP/IP connections on port 5432?
Similar posting
This issue is also described in this post: https://www.titanwolf.org/Network/q/98b355ff-d518-4de3-bae9-0d1a0a32671e/y Where in the post a link to Stackoverflow does not work anymore.
The metadata database (Azure managed postgreSQL) is not overloading. Also the AKS node pool we are using does not show any sign of stress. It seems like the scheduler cannot pick up / detect a finished task after like a couple of tasks have run.
We also looked at several configuration option as stated here
We are looking now for a number of days now to get this solved but unfortunately no success.
Anyone any idea's what the cause could be? Any help is appreciated!
I currently have a PythonSensor which waits for files on an ftp server. Is it possible to have this sensor trigger a task on timeout? I am trying to create the following dag:
airflow sensor diagram
I have taken a look at BranchPythonOperator but it seems like I no longer get the benefits of rescheduling a task if it fails the first time.
Have you tried to use trigger_rule="all_failed" in your task?
all_failed: All upstream tasks are in a failed or upstream_failed state
See http://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html?highlight=all_failed#trigger-rules
And an example here http://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=all_failed#how-to-trigger-tasks-based-on-another-task-s-failure
Some of the Airflow tasks are automatically getting shutdown.
I am using Airflow version 1.10.6 with Celery Executor. The Database is PostgreSQL and Broker is Redis. The airflow infrastructure is deployed on Azure.
Few tasks are getting shutdown after 15 hrs, few are getting stopped after 30 minutes. These are long-running tasks, I have set the execution_timeout to 100 hrs.
Any configuration that can prevent these tasks to be shutdown by Airflow ?
{local_task_job.py:167} WARNING - State of this instance has been externally set to shutdown. Taking the poison pill.
We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.